ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale
Noel Thomas
Abstract
Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inability to reason about parameter-dependent dynamics. We present ChaosBench-Logic v2, a 40,886-question benchmark over 165 dynamical systems with 27 FOL predicates and 78 axiom edges, together with CARE (Calibration- and Adversarial-Robust Evaluation), a protocol that surfaces these pathologies. Evaluating 14 models, we find that regime transition reasoning remains near-random (MCC=0.05) even for frontier models, while FOL deduction with given premises reaches MCC=0.52; per-family decomposition shows the proprietary advantage concentrates on cross-indicator ($\Delta$MCC=+0.40) and consistency tasks, while open-source Qwen 2.5-32B dominates indicator diagnostics (0.91 vs. 0.45). Two models exhibit negative MCC on bifurcation questions, confirmed as systematic anti-correlation via confusion matrix analysis.
Chat is not available.
Successful Page Load