Beyond Text-to-Image: Liberating Generation with a Unified Discrete Diffusion Model
Abstract
Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Leveraging efficient token-level discrete denoising, strong visual priors, and a lightweight text decoder, Muddit supports flexible, high-quality generation with a compact architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger AR-based models, in both quality and speed. The work highlights the potential of pure discrete diffusion as a scalable and effective backbone for multimodal generation. Code and models will be available.