Poster
in
Workshop: Workshop on Logical Reasoning of Large Language Models

Recurrent Reasoning on Symbolic Puzzles with Sequence Models

Gowrav Mannem ⋅ Chowdhury Mahjabin ⋅ Jason Chen ⋅ Shivank Garg ⋅ Kevin Zhu

Project Page [ OpenReview]

Abstract

Large language models often appear strong on symbolic and algorithmic tasks, yet this apparent strength can hide brittle behaviour when problems become longer, harder, or slightly out of distribution. A major limitation of current reasoning benchmarks is that many primarily test whether a model can produce a valid answer, while paying less attention to whether the solution is minimal, robust, and stable under controlled difficulty scaling. We introduce RecurrReason, a difficulty-controlled benchmark of four recurrent logic puzzles (Tower of Hanoi, River Crossing, Block World, and Checkers Jumping) with BFS-verified expert trajectories and a single interpretable difficulty parameter $N \in \{1,\dots,10\}$, totalling 10,817 unique puzzles and 280,106 moves. We benchmark two Transformer families, an encoder-decoder model (T5-style) and a decoder-only model (GPT-2-style), under consistent data splits and evaluation criteria, training on $N{=}1$ to $7$ and evaluating on both held-out in-distribution instances and harder out-of-distribution instances at $N{=}8$ to $10$. Fine-tuned pre-trained T5 achieves 97.27\% validation and 81.00\% OOD accuracy on Block World; all models score 0.00\% on River Crossing under all conditions. Failure mode analysis reveals that architecture is a stronger determinant of success than scale; pre-training transfers only to puzzles with locally structured transition functions. Our code and dataset will be open-sourced upon acceptance.

Chat is not available.