DEDUCTIVE CONSTRAINT SATISFACTION VS. PREVALENCE PRIORS: BENCHMARKING LLM LOGIC IN CLINICAL DIAGNOSTICS
Abstract
Large language models are increasingly evaluated for complex decision-making, yet their ability to maintain logical invariance when deductive constraints conflict with statistical priors remains poorly characterized. We introduce DiagRare, a structured deductive benchmark of 696 clinical vignettes grounded in an expert-curated 58-disease biomedical ontology. By explicitly controlling present (modus ponens) and absent (modus tollens) evidence, DiagRare isolates logical constraint satisfaction from associative pattern matching. Evaluating five frontier LLMs reveals a stark 31.6% to 100% Top-1 accuracy spread and exposes three phenomena. First, the weakest model exhibits prevalence-driven mode collapse, funneling 278 of 476 errors into a single common diagnosis spanning 45 unrelated diseases (diagnostic entropy collapsing to 3.33/5.86 bits). Second, mid-tier reasoning models exhibit an inductive downgrade: Claude Sonnet 4.6 achieves 99.7% Top-3 accuracy on rare diseases but only 86.3% Top-1, indicating prevalence priors corrupt the final ranking of correctly deduced candidates. Third, we establish a deductive asymmetry: increasing explicit negative constraints monotonically improves Claude's accuracy from 84.5% to 97.0%, while increasing positive findings degrades it, demonstrating that modus tollens binds statistical priors more effectively than modus ponens.