Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes
Pengxiang Li ⋅ Jiayin Cai ⋅ Hongwei Xue ⋅ Kunyu Shi ⋅ Shilin Yan
Abstract
Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on ``scattered acceptance''---committing high-confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the \textbf{Longest Stable Prefix (LSP)} scheduler, a training-free and model-agnostic inference paradigm based on \textit{monolithic prefix absorption}. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4$\times$ across rigorous benchmarks---including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing---while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.
Successful Page Load