Norm-Guided KV-Cache Eviction for Memory-Efficient Reasoning
Prasanth Yadla
Abstract
Large language models deployed as autonomous agents face a fundamental memory constraint: the KV-cache required for autoregressive generation scales quadratically with context length. We propose \textbf{$\ell_2$-Norm Eviction}, a novel gradient-free KV-cache compression method that scores tokens by the mean $\ell_2$-norm of their key vectors across attention heads, retaining a hybrid of high-norm heavy hitters and recent tokens. Unlike H2O~\cite{h2o}, which requires accumulating explicit attention scores across all decoding steps, our method operates with a single pass over key tensors and imposes no attention-tracking overhead. We evaluate $\ell_2$-Norm Eviction against a full-cache baseline and a StreamingLLM-style sliding window on the GSM8K mathematical reasoning benchmark and curated logic prompts, using automated Exact Match scoring across four cache budgets (256--2048 tokens) on Mistral-7B-Instruct-v0.3. At budgets 512--2048, the eviction condition ($T > B$) is never satisfied because total sequence lengths remain below 512 tokens in our evaluation set; no tokens are dropped and all methods match the full-cache baseline exactly. At the extreme budget of 256 (87.5\% reduction), where eviction does fire, the sliding window (EM=0.25) outperforms $\ell_2$-Norm Eviction (EM=0.05) on GSM8K, indicating that recency dominates global token importance at very tight budgets. We characterise this as a minimum viable budget effect and identify adaptive pool sizing as the key direction for closing this gap.
Successful Page Load