RHIM: Benchmarking Redundant Hypothesis Identification Reveals Systematic Gaps in LLM Logical Reasoning
Abstract
Identifying and removing redundant hypotheses is fundamental to mathematical discovery, yet the ability of large language models to perform this reasoning remains unexplored. We introduce RHIM (Redundant Hypothesis Identification in Mathematics), which is, to the best of our knowledge, the first benchmark for evaluating redundant hypothesis detection in mathematical proof problems, comprising 200 problems with verified ground truth. Through comprehensive experiments with state-of-the-art models — including DeepSeek-Reasoner, Gemini-2.5-Flash, and GPT-5.2 — we reveal critical failures across three hierarchical tasks: detection (false alarm rates 38–99.5\%), identification (accuracy 32–64\%, barely above the 25.8\% random baseline imposed by variable hypothesis counts), and verification (23–34.5\% acceptance of logically invalid proofs). These results demonstrate that proof generation ability does not imply understanding of logical dependencies between assumptions and conclusions — a capability essential for rigorous mathematical reasoning and theorem refinement.