Scaling Reasoning Depth Reveals Three Tiers of Failure in Multi-Model Mathematical Deduction
Abstract
We present a diagnostic analysis of mathematical reasoning failures across three reasoning-specialized language models on 15 competition-level problems. Through detailed trace analysis of 45 independent reasoning attempts and 90 cross-model verifications (each of 3 models verifying each other model's solutions across 15 problems), we identify a three-tier taxonomy of failure modes that current evaluation methodologies routinely conflate: capacity failures (correct reasoning approaches truncated by generation limits, accounting for 8 of 11 errors in our sample; 95\% CI: 48\%-91\%), correlated deductive failures (genuine logical errors where models independently converge on the same wrong answer through a shared invalid reasoning step, 2 of 11 in our sample; 95\% CI: 2\%-52\%), and meta-cognitive override (a model that correctly refutes its own candidate answer and submits it anyway, documented in a controlled extended-context experiment). We demonstrate that majority voting across models corrected zero errors across our dataset because each failure tier induces distinct correlation structures: capacity failures produce correlated truncation artifacts, and reasoning failures reflect shared deductive blind spots that verification cannot detect. Furthermore, we reveal that cross-model verification actively amplifies these correlated errors. Rather than functioning as an independent logical auditor, the verification phase suffers from severe meta-reasoning pathologies, including problem mutation and training data injection, achieving a 3/3 false-negative rate on the one problem where all models answered correctly (P12). Our findings suggest that failure-tier-aware evaluation is required to accurately assess and improve logical consistency in large language models.