Recurrent Reasoning on Symbolic Puzzles with Sequence Models
Gowrav Mannem ⋅ Chowdhury Mahjabin ⋅ Jason Chen ⋅ Shivank Garg ⋅ Kevin Zhu
Abstract
Large language models often appear strong on symbolic and algorithmic tasks, yet this apparent strength can hide brittle behaviour when problems become longer, harder, or slightly out of distribution. A major limitation of current reasoning benchmarks is that many primarily test whether a model can produce a valid answer, while paying less attention to whether the solution is minimal, robust, and stable under controlled difficulty scaling. We introduce RecurrReason, a difficulty-controlled benchmark of four recurrent logic puzzles (Tower of Hanoi, River Crossing, Block World, and Checkers Jumping) with BFS-verified expert trajectories and a single interpretable difficulty parameter $N \in \{1,\dots,10\}$, totalling 10,817 unique puzzles and 280,106 moves. We benchmark two Transformer families, an encoder-decoder model (T5-style) and a decoder-only model (GPT-2-style), under consistent data splits and evaluation criteria, training on $N{=}1$ to $7$ and evaluating on both held-out in-distribution instances and harder out-of-distribution instances at $N{=}8$ to $10$. Fine-tuned pre-trained T5 achieves 97.27\% validation and 81.00\% OOD accuracy on Block World; all models score 0.00\% on River Crossing under all conditions. Failure mode analysis reveals that architecture is a stronger determinant of success than scale; pre-training transfers only to puzzles with locally structured transition functions. Our code and dataset will be open-sourced upon acceptance.
Chat is not available.
Successful Page Load