Polestar-Cache: Reconciling Parallel Decoding and Accuracy in Diffusion LLMs via Token Drift-Aware KV Cache Recalibration
Mingyu Lee ⋅ Akshat Ramachandran ⋅ Souvik Kundu ⋅ Tushar Krishna
Abstract
Diffusion language models (dLLMs) offer a promising alternative to autoregressive generation, but their inference efficiency remains limited by high compute cost and instability under approximate key–value (KV) cache reuse. We show that existing KV-cache–enabled dLLM inference schemes suffer from token drift, where cached representations become misaligned with evolving context, leading to degraded prediction confidence, reduced parallelism, and accuracy loss. To address this, we propose Polestar-Cache, a training-free, drift-aware KV caching framework that reconciles parallel decoding with accuracy. Polestar-Cache detects layer-wise representation drift using a lightweight KL-divergence proxy computed from cached hidden states, and selectively refreshes KV cache entries. To reduce overhead, Polestar-Cache clusters hidden states and keeps only centroids on the GPU. The non-centroid tokens are offloaded to CPU memory and fetched according to drift dynamics. Extensive experiments on multiple dLLM benchmarks, including GSM8K, MBPP, and ParallelBench, demonstrate that Polestar-Cache achieves up to 11\% accuracy improvement and $1.7\times$ performance improvement over prior KV-cache–enabled baselines.
Chat is not available.
Successful Page Load