Poster
in
Workshop: ICLR 2026 Workshop on AI with Recursive Self-Improvement

Reasoning Cache: Learning to Extrapolate to Long Lengths via Short-Length RL

Ian Wu ⋅ Yuxiao Qu ⋅ Amrith Setlur ⋅ Aviral Kumar

Project Page [ OpenReview]

Abstract

Large Language Models (LLMs) that continue improving at test-time budgets far beyond their training budgets can solve harder problems by leveraging additional inference compute: we refer to this property as extrapolation. Standard on-policy RL operates on fixed problem distributions and training budgets, giving rise to a train-test distribution shift that limits the model's extrapolation capabilities. To address this, we introduce RC, an iterative decoding algorithm replacing standard autoregressive decoding that enables models to extrapolate to lengths an order of magnitude longer than those seen during training. RC exploits the asymmetry between summarization and generation capabilities present in LLMs to construct a decoding process that improves consistently over iterations. Its effectiveness can be further increased through training, which amplifies the model’s ability to perform summary-conditioned reasoning while avoiding the challenges of long-horizon RL. Training a 4B instruction-following model with RC using a 16k-token training budget improves performance on HMMT 2025 from 40% to 70% when evaluated with a 512k-token test budget, substantially surpassing comparably sized LLMs.

Chat is not available.