Poster Fri, Apr 24, 2026 • 6:30 AM – 9:00 AM PDT Pavilion 3 P3-#725

Learning to Reason Efficiently with Discounted Reinforcement Learning

Alex Ayoub ⋅ Kavosh Asadi ⋅ Dale Schuurmans ⋅ Csaba Szepesvari ⋅ Karim Bouyarmane

[ OpenReview]

Abstract

Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. More broadly, in goal reaching sequential decision problems we often want to reach the goal quickly, and LRM reasoning can be viewed through this lens. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning, analogous to preferring shorter successful trajectories in a stochastic shortest path problem. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.

Video

Chat is not available.