Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling
Abstract
Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this in a controlled cellular-automata (1dCA) framework that excludes memorisation by using disjoint train/test rules. Models are trained on short state sequences, required to infer the hidden local rule, and then chain it for multiple future steps. We find that most neural architectures learn the rule and achieve high next-step accuracy, but performance drops sharply as the required number of steps increases. Increasing model depth is crucial, and extending effective depth via recurrence, memory, or test-time compute improves results but remains bounded. Complementing these controlled experiments, a natural-language proxy game shows that contemporary LLMs largely fail on the complex setting. Together, these results separate genuine rule induction from memorisation, quantify how difficulty scales with reasoning depth, and highlight the joint roles of architecture and training/inference procedures.