Constrained Wikigame: Benchmarking Deductive Reasoning for Multi-Step Planning
Abstract
Benchmarking LLMs on multi-step planning tasks typically relies on final answer accuracy. This results in evaluation that fails to distinguish correct reasoning from lucky outcomes. We introduce Constrained Wikigame, a benchmark that extends the classic Wikigame (navigating Wikipedia from a source to a target article via hyperlinks) by introducing category constraints. This addition transforms a task where memorization and shortest-path heuristics may drive success into a step-level deduction task, as each decision involves explicitly justifying consistency with the constraint. We benchmark a suite of frontier reasoning and thinking models using both outcome level (success rate, constraint violation and path efficiency) as well as reasoning validity, directly testing whether extended reasoning translates into reliable constrained planning.