Oral
in
Workshop: From Human Cognition to AI Reasoning: Models, Methods, and Applications Sun, Apr 26, 2026 • 7:30 AM – 7:45 AM PDT

Mechanisms of Introspective Awareness

Uzay Macar ⋅ Li Yang ⋅ Atticus Wang ⋅ Peter Wallich ⋅ Emmanuel Ameisen ⋅ Jack Lindsey

Project Page [ OpenReview]

Abstract

Human metacognition converts internal cues into explicit reports (e.g., "something feels off"), supporting error correction and communication. We ask whether LLMs exhibit an analogous capacity and what mechanisms underlie it. In a controlled "thought injection" setting, we add concept steering vectors to the residual stream and ask models to detect the perturbation and identify its content. We find that (i) detection is behaviorally robust (moderate true positive rates, 0\% false positives) across diverse prompts and is strongest under post-trained assistant personas, indicating introspection is largely elicited by post-training rather than pretraining; (ii) detection is not reducible to a single linear confound, relies on distributed mid-to-late-layer MLP computation, and involves identifiable gate and evidence-carrier features; (iii) identification depends on partially distinct circuitry; and (iv) introspection is under-elicited by default: ablating refusal directions improves detection by $\sim$50\% and a trained steering vector by $\sim$75\%. Within a testable setting for metacognitive-style monitoring, our results reveal causally identifiable mechanisms and provide a target for interpretable and human-aligned AI reasoning.

Video

Chat is not available.