Skip to yearly menu bar Skip to main content


Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Junyoung Seo · Wooseok Jang · Min-Seop Kwak · Inès Hyeonsu Kim · Jaehoon Ko · Junho Kim · Jin-Hwa Kim · Jiyoung Lee · Seungryong Kim

Halle B #16
[ ] [ Project Page ]
Tue 7 May 7:30 a.m. PDT — 9:30 a.m. PDT


Text-to-3D generation has shown rapid progress in recent days with the advent of score distillation sampling (SDS), a methodology of using pretrained text-to-2D diffusion models to optimize a neural radiance field (NeRF) in a zero-shot setting. However, the lack of 3D awareness in the 2D diffusion model often destabilizes previous methods from generating a plausible 3D scene. To address this issue, we propose 3DFuse, a novel framework that incorporates 3D awareness into the pretrained 2D diffusion model, enhancing the robustness and 3D consistency of score distillation-based methods. Specifically, we introduce a consistency injection module which constructs a 3D point cloud from the text prompt and utilizes its projected depth map at given view as a condition for the diffusion model. The 2D diffusion model, through its generative capability, robustly infers dense structure from the sparse point cloud depth map and generates a geometrically consistent and coherent 3D scene. We also introduce a new technique called semantic coding that reduces semantic ambiguity of the text prompt for improved results. Our method can be easily adapted to various text-to-3D baselines, and we experimentally demonstrate how our method notably enhances the 3D consistency of generated scenes in comparison to previous baselines, achieving state-of-the-art performance in geometric robustness and fidelity.

Chat is not available.