Sitan Chen — Theory for Discrete Diffusions: Parallel Decoding and Variable-Length Generation
Abstract
Compared to autoregressive models and even to continuous diffusions, diffusion language models offer a fundamentally different design space for crafting efficient and flexible generation processes. This talk discusses work along two axes: parallel decoding and variable-length generation. In the first half, an exact characterization of the optimal inference schedule for masked diffusion models is given. From this characterization, simple schedules are derived that enable sampling provably more efficiently than autoregressive models for any distribution with bounded correlations. In the second half, FlexMDM is presented, a theoretically principled and empirically lightweight method for equipping diffusion language models with the ability to generate sequences of arbitrary length, while provably preserving their any-order generation capabilities.