Poster
in
Workshop: SCOPE: SCALABLE OPTIMIZATION FOR EFFICIENT AND ADPATIVE FOUNDATION MODELS

STIV: SCALABLE TEXT AND IMAGE CONDITIONED VIDEO GENERATION

Zongyu Lin ⋅ Wei Liu ⋅ Chen Chen ⋅ Jiasen Lu ⋅ Wenze Hu ⋅ Tsu-Jui Fu ⋅ Jesse Allardice ⋅ Zhengfeng Lai ⋅ Liangchen Song ⋅ Bowen Zhang ⋅ cha chen ⋅ Yiran Fei ⋅ Yifan Jiang ⋅ Lezhi Li ⋅ Yizhou Sun ⋅ Kai-Wei Chang ⋅ Yinfei Yang

Keywords: efficiency image to video Scalable video generation

Project Page [ OpenReview]

Abstract

The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV.Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously. Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, multi-view generation, and long video generation, etc. With comprehensive ablation studies on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple design. An 8.7B model with (512^2) resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at (512^2) resolution. By providing a transparent and extensible recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress toward more versatile and reliable video generation solutions.

Video

Chat is not available.