Where Deduction Breaks Down: Diagnosing Reasoning Failures in Research-Level Mathematics
Abstract
Frontier AI systems have achieved gold-medal performance on mathematical olympiads, yet the logical reasoning capabilities underlying these successes remain poorly understood. We introduce MILLENNIUM-BENCH, a benchmark of expert-curated, research-level mathematics problems spanning nine domains including mathematical physics, topology, and measure-theoretic probability, that demand sustained multi-step deductive reasoning far beyond competition mathematics. Evaluating five frontier models over 100 independent runs per problem, we find that even the strongest achieve at most 24% pass@1 and 39% pass@3, revealing a substantial gap between pattern-driven problem solving and rigorous deduction. Beyond aggregate pass rates, we present a qualitative analysis of logical reasoning failures at the research-mathematics frontier. Through expert-annotated case studies, we identify main failure modes: framework hallucination, in which models invent inapplicable theoretical frameworks and reason coherently within them, and fabricating theorems, in which models cite nonexistent results: complete with fabricated author names and theorem numbers. These failures are not merely computational; they reflect fundamental limitations in how current models perform formal deductive reasoning, constructing logically valid arguments from false premises.