Unrolled Policy Iteration for Tiny Recursive Models
Bahram Behzadian ⋅ Brett Daley ⋅ Gopeshh Raaj Subbaraj ⋅ Houssam Nassif
Abstract
We study recursive self-improvement via internal evaluators trained from verifier feedback, focusing on stability when recursive compute is scaled at test time. In plan-editing models such as Tiny Recursive Models (TRMs), an inner evaluator is unrolled for $n$ recurrent steps to produce a value estimate; the unroll depth~$n$ is an architectural compute budget. We formalize plan editing as a Markov decision process and analyze approximate policy iteration with truncated internal evaluation. Under a contraction assumption on the inner recursion---requiring the latent update map to have Lipschitz constant $L_z < 1$---we decompose value error into an architectural Bellman-residual term plus a finite-unrolling bias that decays geometrically as $L_z^{\,n}$ with depth. For conservative mixture updates that blend the current policy with a candidate at mixing weight~$\alpha$, we show that with statewise-centered advantage estimates, evaluation error enters the improvement bound scaled by~$\alpha$ rather than at full strength, reducing sensitivity to imperfect evaluation. Experiments on Sudoku (4${\times}$4 and 9${\times}$9) trained solely from checker feedback validate feasibility and show that contraction strength and latent projection modulate stability under depth mismatch between training and evaluation. The analysis identifies the contraction modulus and unroll depth as practical stability controls for policy improvement under truncated internal computation.
Chat is not available.
Successful Page Load