Poster
in
Workshop: Workshop on Logical Reasoning of Large Language Models

Entailment Closure Failures in Large Language Models: A Benchmark for Cross-Query Logical Consistency

Ben Jenkins

Project Page [ OpenReview]

Abstract

Large language models (LLMs) are increasingly deployed as implicit knowledge bases, yet their logical consistency across independent queries remains poorly understood. Existing benchmarks evaluate reasoning within a single prompt, neglecting whether an LLM's aggregate commitments satisfy basic properties from classical logic. We introduce ECF-Bench, a benchmark that systematically audits LLMs for entailment-closure failures: cases in which a model affirms a set of premises across separate queries but denies their logically necessary conclusions. ECF-Bench comprises 3,200 test suites spanning propositional logic, first-order taxonomic reasoning, and multi-hop inference chains, with ground-truth labels certified by the Z3 SMT solver. We evaluate seven LLMs and find that all models exhibit substantial closure violations, with failure rates ranging from 17% to 58% depending on reasoning depth and logical structure. Strikingly, models that achieve high single-query accuracy still violate entailment closure at alarming rates, revealing a fundamental gap between local reasoning competence and global logical coherence. We further show that chain-of-thought prompting reduces but does not eliminate these failures. We also test two lightweight mitigations (query self-consistency and recap-conditioned conclusion prompts that surface prior premise text), which cut overall violation rates roughly in half for top models (e.g., GPT-4 from 18.7% to 9.1% CVR) while making explicit where strict query independence is relaxed. Our results highlight the need for cross-query consistency as a first-class evaluation criterion for LLM reasoning.

Chat is not available.