Reward Is Enough: LLMs Are In-Context Reinforcement Learners
Abstract
Reinforcement learning (RL) is a human-designed framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges in LLMs at inference time – a phenomenon known as in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, called ICRL prompting. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning for self-improvement on a given task. After the LLM generates a response at the current round, we give numerical scalar feedback on the response, called the rewards. At the next round, we prompt the LLM again with the same task and a context consisting of all previous responses and rewards. We observe that the quality of the LLM's response increases as the context grows. In other words, the LLM is able to maximize the scalar reward signal at inference time, just like an RL algorithm. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Surprisingly, in some experiments, the reward signals are generated by the LLM itself, yet performance improvements are still observed from ICRL prompting, offering a new paradigm for test-time scaling.