When Shallow Wins: Silent Failures and the Depth–Accuracy Paradox in Latent Reasoning
Subramanyam Sahoo ⋅ Aman Chadha ⋅ Vinija Jain ⋅ Divya Chaudhary
Abstract
Mathematical reasoning models are widely deployed in education, automated tutoring, and decision support systems despite exhibiting fundamental computational instabilities. We demonstrate that state-of-the-art models (Qwen2.5-Math-7B) achieve 61\% accuracy through a mixture of reliable and unreliable reasoning pathways: 18.4\% of correct predictions employ stable, faithful reasoning while 81.6\% emerge through computationally inconsistent pathways. Additionally, 8.8\% of all predictions are silent failures—confident yet incorrect outputs. Through comprehensive analysis using novel faithfulness metrics, we reveal: (1) reasoning quality shows weak negative correlation with correctness ($r=-0.21$, $p=0.002$), reflecting a binary classification threshold artifact rather than a monotonic inverse relationship; (2) scaling from 1.5B to 7B parameters (4.7$\times$ increase) provides zero accuracy benefit on our evaluated subset (6\% of GSM8K), requiring validation on the complete benchmark; and (3) latent reasoning employs diverse computational strategies, with $\approx$20\% sharing CoT-like patterns. These findings highlight that benchmark accuracy can mask computational unreliability, demanding evaluation reforms measuring stability beyond single-sample metrics.
Chat is not available.
Successful Page Load