TopoBench: Benchmarking LLMs on hard topological reasoning
Abstract
Solving topology grid puzzles requires maintaining global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large language models (LLMs). To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels. We evaluate strong reasoning LLMs on TopoBench and find that even frontier models solve fewer than one quarter of hard instances, with two families nearly unsolved. To further understand the failure modes, we annotate 750 chain of thought traces with an error taxonomy that surfaces four candidate causal failure modes, then test them with targeted interventions simulating each error type. These interventions show that certain error patterns like premature commitment (going down a wrong solution path early on) and constraint forgetting (moves that violate rules) have a direct impact on the ability to solve the puzzle while repeated-reasoning (re-trying the same reasoning path without meaningful variation) is a benign effect of search. Finally we study mitigation strategies aimed at reducing these error patterns including prompt guidance, cell-aligned grid representations and tool-based constraint checking, finding that the bottleneck lies in extracting constraints from spatial representations and not in reasoning over them.