Poster Sat, Apr 25, 2026 • 6:30 AM – 9:00 AM PDT Pavilion 3 P3-#611

Draft-based Approximate Inference for LLMs

Kevin Galim ⋅ Ethan Ewer ⋅ Wonjun Kang ⋅ Minjae Lee ⋅ Hyung Koo ⋅ Kangwook Lee

Project Page [ Slides] [ Poster] [ OpenReview]

Abstract

Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.

Video

Chat is not available.