dLLM - Rethinking Generation Beyond Autoregressive Models
Abstract
Diffusion large language models (dLLMs) have emerged as a promising alternative to standard autoregressive (AR) Transformers, offering parallel token generation and flexible infilling instead of strict left-to-right decoding. This post walks through how masked discrete diffusion works at a high level: a forward process that randomly masks tokens, a reverse process that iteratively denoises them, and training setups that either start from scratch or adapt existing AR models. We then discuss how diffusion decoding differs from AR decoding, including its strengths for infilling, structured generation, and long-horizon planning, as well as practical challenges around length control, number of denoising steps, and blockwise generation. Finally, we examine both sides of the current evidence: where dLLMs shine in data-constrained regimes and where theoretical and empirical work suggests they can underperform or even collapse back toward autoregressive behavior, ending with some possible hybrid futures that combine diffusion for reasoning and autoregression for generation.