RSCE: Training-Free Residual Stream Encoding for Persistent Context Amortization
Abstract
We propose Residual Stream Context Encoding (RSCE), a training-free method that eliminates redundant long-context prefill costs in retrieval-augmented generation. Given a context document ctx, RSCE extracts a vector C ∈ RdM by mean- pooling residual stream activations at a calibrated intermediate layer f (M ), then injects it as an additive shift at query time—replacing O(|T (ctx)|) attention prefill with an O(1) operation with zero per-query context forward pass. For tasks requiring factual precision, we pair C with a compact explicit fact block F , forming a dual-channel representation amortized across N ≥ 2 queries. We evaluate five decoder-only architectures (7B–70B) on multi-document QA (LongBench, n = 108) and six architectures on cross-file code completion (RepoBench-C), comparing against LongLLMLingua and EHPC. A key mechanistic finding: vector injection alone suppresses parametric recall below the question-only baseline—a dual-pathway interference effect absent in behavioral steering that motivates the dual-channel design. At extreme compression (∼99% token reduction), RSCE Vec+F is competitive with EHPC on smaller architectures (LLaMA-8B F1 0.333 vs. EHPC 0.334; DeepSeek-14B both 0.214) while both substantially outperform LongLLMLingua (0.209, 0.172). On larger models, EHPC’s capacity-scaling token selection widens the gap, reaching F1 0.539 vs. RSCE 0.365 on LLaMA- 70B—a finding we explain through model capacity scaling of in-context reasoning. On RepoBench-C, LongLLMLingua substantially improves over baseline via compression-as-retrieval; RSCE is the only method achieving 81% compression at 100% operational reliability.