Beyond Clause Count: A Study of Proof-Relevant Difficulty in LLM SAT Reasoning
Abstract
SAT has recently emerged as a controlled setting for evaluating logical reasoning in large language models (LLMs). However, existing SAT-oriented benchmarks mainly vary surface-level properties such as formula size (e.g., clause count and variable count), phase-transition regime, or question format, but these axes do not directly capture refutational difficulty on UNSAT instances. We propose proof complexity as a theory-grounded lens for diagnosing LLM failures on UNSAT reasoning and instantiate this perspective in a preliminary study using clause-controlled Tseitin formulas together with two auxiliary control families, implication-chain formulas and random 3-CNF instances. Using direct CNF input, we find that aggregate accuracy often hides strong label-specific bias, and that structurally different Tseitin families can remain substantially different in difficulty even when clause budgets are matched. These results suggest that surface statistics such as clause count and variable count are insufficient for understanding LLM performance on satisfiability judgments, and motivate broader evaluations of proof-relevant hardness in formal reasoning benchmarks.