Skip to yearly menu bar Skip to main content


ControlVideo: Training-free Controllable Text-to-video Generation

Yabo Zhang · Yuxiang Wei · Dongsheng jiang · XIAOPENG ZHANG · Wangmeng Zuo · Qi Tian

Halle B #71
[ ] [ Project Page ]
Wed 8 May 1:45 a.m. PDT — 3:45 a.m. PDT


Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart lags behind due to the excessive training cost.To avert the training burden, we propose a training-free ControlVideo to produce high-quality videos based on the provided text prompts and motion sequences.Specifically, ControlVideo adapts a pre-trained text-to-image model (i.e., ControlNet) for controllable text-to-video generation.To generate continuous videos without flicker effect, we propose an interleaved-frame smoother to smooth the intermediate frames.In particular, interleaved-frame smoother splits the whole videos with successive three-frame clips, and stabilizes each clip by updating the middle frame with the interpolation among other two frames in latent space.Furthermore, a fully cross-frame interaction mechanism have been exploited to further enhance the frame consistency, while a hierarchical sampler is employed to produce long videos efficiently.Extensive experiments demonstrate that our ControlVideo outperforms the state-of-the-arts both quantitatively and qualitatively. It is worthy noting that, thanks to the efficient designs, ControlVideo could generate both short and long videos within several minutes using one NVIDIA 2080Ti. Code and videos are available at this link.

Chat is not available.