Skip to yearly menu bar Skip to main content


Poster

Field-DiT: Diffusion Transformer on Unified Video, 3D, and Game Field Generation

Kangfu Mei · Mo Zhou · Vishal Patel

Hall 3 + Hall 2B #174
[ ] [ Project Page ]
Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

The probabilistic field models the distribution of continuous functions defined over metric spaces. While these models hold great potential for unifying data generation across various modalities, including images, videos, and 3D geometry, they still struggle with long-context generation beyond simple examples. This limitation can be attributed to their MLP architecture, which lacks sufficient inductive bias to capture global structures through uniform sampling.To address this, we propose a new and simple model that incorporates a view-wise sampling algorithm to focus on local structure learning, along with autoregressive generation to preserve global geometry. It adapts cross-modality conditions, such as text prompts for text-to-video generation, camera poses for 3D view generation, and control actions for game generation.Experimental results across various modalities demonstrate the effectiveness of our model, with its 675M parameter size, and highlight its potential as a foundational framework for scalable, modality-unified visual content generation. Our project page can be found at https://kfmei.com/Field-DiT/.

Live content is unavailable. Log in and register to view live content