VOGUE: Unified Understanding, Generation, and Editing for Videos
Abstract
Unified multimodal understanding–generation models have shown promising results in image generation and editing, but remain largely constrained to the image domain. In this work, we present VOGUE, a versatile framework that extends unified modeling to the video domain. VOGUE adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, VOGUE unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that VOGUE matches or surpasses state-of-the-art task-specific baselines in visual understanding, text/image-to-video generation, in-context video editing and generation. Beyond these core capabilities, the unified design allows VOGUE to generalize to unseen free-form editing tasks, such as green-screening characters or novel task composition (e.g., editing + style transfer) in a single instruction. Notably, VOGUE is the first system to support visual-prompt-based video generation in a unified model, where the MLLM interprets visual prompts and guides the MMDiT in synthesis. To foster future research, our model and code will be released.