Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Deep Generative Model in Machine Learning: Theory, Principle and Efficacy

Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer

Junpeng Jiang · Gangyi Hong · Hengtong Hu · Lijun Zhou · Tianyi Yan · Yida Wang · Kun Zhan · Peng Jia · XianPeng Lang · Miao Zhang

Keywords: [ Diffusion Model ] [ Video Generation ] [ Efficient Inference ]


Abstract: Collecting multi-view driving scenario videos to enhance the performance of 3D visual perception tasks presents significant challenges and incurs substantial costs, making generative models for realistic data an appealing alternative. Yet, the videos generated by recent works suffer from poor quality and temporal consistency, which restricts their effectiveness in advancing perception tasks under driving scenarios. This gap highlights the need for a more robust and versatile framework capable of generating high-fidelity and temporally consistent multi-view videos, tailored to the complexities of driving scenarios. We introduce DiVE, a framework based on the Diffusion Transformer (DiT), designed to generate videos that are both temporally and cross-view consistent, aligning seamlessly with bird's-eye view (BEV) layouts and textual descriptions. Specifically, DiVE leverages cross-attention and a SketchFormer to exert precise control over multimodal data, while incorporating a view-inflated attention mechanism that adds no extra parameters, thereby guaranteeing consistency across views. To address the computational costs associated with high-resolution video generation, we further propose a training-free sampling strategy for acceleration called Resolution Progressively Sampling, achieving a remarkable $\times$1.62 speedup without compensating the generation quality. In summary, DiVE delivers multi-view videos with outstanding visual quality and has demonstrated state-of-the-art performance on the nuScenes dataset. Additionally, the highly efficient and robust generation capabilities of DiVE offer promising avenues to support 3D perception models in achieving substantial performance improvements.

Chat is not available.