Learning to Reason Efficiently with Discounted Reinforcement Learning
Alex Ayoub · Kavosh Asadi · Dale Schuurmans · Csaba Szepesvari · Karim Bouyarmane
Abstract
Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing the reasoning tokens using a discounted reinforcement-learning setup (interpretable as a small per-token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning; in practice we discount only the environment (correctness) reward. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.
Successful Page Load