Poster
in
Workshop: Workshop on Logical Reasoning of Large Language Models

Constrained Wikigame: Benchmarking Deductive Reasoning for Multi-Step Planning

Rafael Mosquera-Gómez ⋅ Juan Rodriguez ⋅ Martin Velez ⋅ Ivan de Sousa ⋅ Juan Jaramillo

Project Page [ OpenReview]

Abstract

Benchmarking LLMs on multi-step planning tasks typically relies on final answer accuracy. This results in evaluation that fails to distinguish correct reasoning from lucky outcomes. We introduce Constrained Wikigame, a benchmark that extends the classic Wikigame (navigating Wikipedia from a source to a target article via hyperlinks) by introducing category constraints. This addition transforms a task where memorization and shortest-path heuristics may drive success into a step-level deduction task, as each decision involves explicitly justifying consistency with the constraint. We benchmark a suite of frontier reasoning and thinking models using both outcome level (success rate, constraint violation and path efficiency) as well as reasoning validity, directly testing whether extended reasoning translates into reliable constrained planning.

Chat is not available.