Poster Thu, Apr 23, 2026 • 6:30 AM – 9:00 AM PDT

ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping

Zijian Zhu · Fei Ren · Zhanhong Tan · Kaisheng Ma

Project Page

Abstract

Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM generation remains time-consuming, as it processes the full context for inference at each iteration. In this work, we analyze the generation characteristics of dLLMs and observe that intermediate states (e.g., key, value, and hidden states) change only subtly across iterations. Leveraging this insight, we propose ES-dLLM, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens with low importance scores in earlier layers of the model. Importance is estimated based on intermediate tensor variation and confidence scores from previous iterations. Experiments on LLaDA-8B and Dream-7B show that ES-dLLM achieves throughput of up to 226.57 and 308.51 tokens per second (TPS) on an NVIDIA H200 GPU, delivering 5.6$\times$ to 16.8$\times$ speedup over the original implementation and up to 1.85$\times$ over the state-of-the-art caching method, while preserving generation quality.