ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping
Zijian Zhu · Fei Ren · Zhanhong Tan · Kaisheng Ma
Abstract
Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM generation remains time-consuming, as it processes the full context for inference at each iteration. In this work, we analyze the generation characteristics of dLLMs and observe that intermediate states (e.g., key, value, and hidden states) change only subtly across iterations. Leveraging this insight, we propose ES-dLLM, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens with low importance scores in earlier layers of the model. Importance is estimated based on intermediate tensor variation and confidence scores from previous iterations. Experiments on LLaDA-8B and Dream-7B show that ES-dLLM achieves throughput of up to 226.57 and 308.51 tokens per second (TPS) on an NVIDIA H200 GPU, delivering 5.6$\times$ to 16.8$\times$ speedup over the original implementation and up to 1.85$\times$ over the state-of-the-art caching method, while preserving generation quality.
Successful Page Load