Skip to yearly menu bar Skip to main content


Poster

High-Quality Joint Image and Video Tokenization with Causal VAE

Dawit Mureja Argaw · Xian Liu · Qinsheng Zhang · Joon Son Chung · Ming-Yu Liu · Fitsum Reda

Hall 3 + Hall 2B #159
[ ]
Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Generative modeling has seen significant advancements in image and video synthesis. However, the curse of dimensionality remains a significant obstacle, especially for video generation, given its inherently complex and high-dimensional nature. Many existing works rely on low-dimensional latent spaces from pretrained image autoencoders. However, this approach overlooks temporal redundancy in videos and often leads to temporally incoherent decoding. To address this issue, we propose a video compression network that reduces the dimensionality of visual data both spatially and temporally. Our model, based on a variational autoencoder, employs causal 3D convolution to handle images and videos jointly. The key contributions of our work include a scale-agnostic encoder for preserving video fidelity, a novel spatio-temporal down/upsampling block for robust long-sequence modeling, and a flow regularization loss for accurate motion decoding. Our approach outperforms competitors in video quality and compression rates across various datasets. Experimental analyses also highlight its potential as a robust autoencoder for video generation training.

Live content is unavailable. Log in and register to view live content