KV Cache as a Reasoning Primitive for Long Context Reasoning
Abstract
Large language models often produce inconsistent answers across multiple related questions when earlier premises are partially forgotten or distorted in long contexts. We argue this is not only a modeling issue but a working-memory issue: KV cache policy controls which premises remain accessible for attention and thus mediates logical consistency under finite memory. Current practice sits at two extremes: retain everything (wasteful) or evict uniformly (premise-destructive). This ignores decades of memory-hierarchy results on working sets and locality. We synthesize empirical evidence that attention working sets are sparse and structurally constrained (heavy hitters, attention sinks, layer heterogeneity), implying that premise-preserving retention is achievable. We provide a small proof-of-concept cache manager with content-aware retention and show favorable memory–quality tradeoffs on a premise-retrieval stress test (passkey retrieval). We then propose a “consistency bundle” evaluation protocol for measuring cross-question contra- dictions as a function of memory policy. Our conclusion is practical: memory policies should be designed and reported as reasoning controls, not just serving optimizations.