Interventional Grounding Audits: Black-Box Premise-Dependency Tests for LLM Chain-of-Thought via Predicate Substitution
Abstract
Large language models produce chain-of-thought (CoT) reasoning that appears logically sound yet may not genuinely depend on its stated premises. We introduce interventional grounding audits, a black-box, step-level test of premise dependency: we intervene on a single premise by substituting its target predicate with a fresh symbol, re-run the model, and check whether each reasoning step's normalized conclusion (canonical predicate form) changes. We evaluate on ProntoQA, a synthetic multi-hop deductive reasoning benchmark with gold proof trees, where step-level premise dependencies are known. Applied to 50 ProntoQA problems with GPT-4o, our method achieves F1 = 0.783 on detecting proof-tree dependencies (F1 = 0.835 on predicate-determining dependencies; Recall = 97.4%), significantly outperforming a self-consistency baseline (F1 = 0.346; 95% bootstrap CIs non-overlapping). We further identify that 28% of correctly-solved problems contain at least one step insensitive to proof-tree premises—a "right answer, wrong reasoning" phenomenon invisible to passive methods. All audit certificates, raw outputs, and reproduction scripts are included as supplementary material, and we discuss scope limits beyond formal, parsable benchmarks.