Logical Consistency Under Pressure: Probing and Repairing Cross-Query Contradictions in LLMs
Abstract
Large language models answer individual logic questions with reasonable accuracy, yet frequently contradict themselves across logically related queries -- affirming a conditional while denying its contrapositive, or endorsing a transitive chain while rejecting the implied conclusion. We introduce ConsistencyBench, a benchmark of 493 logically entailed question sets (1,904 questions) spanning six categories of formal and commonsense reasoning, designed to measure cross-query logical consistency. We evaluate eighteen frontier LLMs -- including GPT-5.2, GPT-4.1, Claude Opus 4.6, Gemini 2.5 Pro, DeepSeek-R1, o3, and Qwen 2.5 72B -- and find that even the strongest model (GPT-4.1) achieves only 46.7% set-level consistency despite 83.0% individual accuracy, revealing consistency gaps of 36-57 percentage points across all models tested. We propose Consistency-Guided Decoding (CGD), a training-free, model-agnostic inference-time method that detects and repairs cross-query contradictions via NLI-based checking. Across 17 models, CGD improves set-level consistency by +6.6pp on average (up to +19.7pp for GPT-4o), while simultaneously improving individual accuracy by +2.8pp on average, demonstrating that cross-query consistency is a tractable target for inference-time intervention.