Mechanisms of Introspective Awareness
Uzay Macar ⋅ Li Yang ⋅ Atticus Wang ⋅ Peter Wallich ⋅ Emmanuel Ameisen ⋅ Jack Lindsey
Abstract
Human metacognition converts internal cues into explicit reports (e.g., "something feels off"), supporting error correction and communication. We ask whether LLMs exhibit an analogous capacity and what mechanisms underlie it. In a controlled "thought injection" setting, we add concept steering vectors to the residual stream and ask models to detect the perturbation and identify its content. We find that (i) detection is behaviorally robust (moderate true positive rates, 0\% false positives) across diverse prompts and is strongest under post-trained assistant personas, indicating introspection is largely elicited by post-training rather than pretraining; (ii) detection is not reducible to a single linear confound, relies on distributed mid-to-late-layer MLP computation, and involves identifiable gate and evidence-carrier features; (iii) identification depends on partially distinct circuitry; and (iv) introspection is under-elicited by default: ablating refusal directions improves detection by $\sim$50\% and a trained steering vector by $\sim$75\%. Within a testable setting for metacognitive-style monitoring, our results reveal causally identifiable mechanisms and provide a target for interpretable and human-aligned AI reasoning.
Video
Chat is not available.
Successful Page Load