AGM-Bench: Do Large Language Models Revise Beliefs Rationally?
Abstract
Large language models (LLMs) are increasingly deployed in settings that require updating conclusions as new information arrives, from multi-turn dialogue to agentic workflows with evolving evidence. Yet virtually all evaluations of LLM logical reasoning focus on static problems: given fixed premises, derive a conclusion. We introduce AGM-Bench, the first benchmark grounded in the AGM theory of belief revision, which tests whether LLMs update their beliefs rationally when confronted with new, potentially contradictory information. AGM-Bench operationalizes six classical rationality postulates, namely Success, Consistency, Inclusion, Vacuity, Extensionality, and Preservation, as well as the Darwiche–Pearl postulates for iterated revision, across 2,400 synthetic reasoning scenarios of controlled logical complexity. We evaluate seven frontier LLMs and find that: (1) all models satisfy Success and Consistency at high rates, but systematically violate Inclusion (minimal change) and Preservation (stability of unrelated beliefs); (2) under iterated revision, models exhibit severe belief inertia (retaining retracted information) and collateral damage (retracting beliefs not logically affected by the new evidence); and (3) reasoning-trained models (o3-mini, DeepSeek-R1) show improved single-step revision but degrade faster under iteration than standard chat models. Our results reveal a fundamental gap between LLM reasoning and rational belief dynamics.