Mechanisms of Introspective Awareness
Uzay Macar ⋅ Li Yang ⋅ Atticus Wang ⋅ Peter Wallich ⋅ Emmanuel Ameisen ⋅ Jack Lindsey
Abstract
A growing body of work explores computation that happens implicitly in hidden representations rather than through explicit chain-of-thought (CoT). We study a controlled implicit-reasoning setting in which concept steering vectors are added to the residual stream and the model is asked to detect the perturbation and identify its content. We find that (i) detection is behaviorally robust (moderate true positives, 0\% false positives) across diverse prompts and strongest under post-trained assistant personas, indicating the capability is largely elicited by post-training rather than pretraining; (ii) detection is not reducible to a single linear confound, relies on distributed mid-to-late-layer MLP computation, and involves identifiable gate and evidence-carrier features; (iii) identification depends on partially distinct circuitry; and (iv) the capability is under-elicited by default: ablating refusal directions improves detection by $\sim$50\% and a trained steering vector by $\sim$75\%. Our results reveal causally identifiable mechanisms underlying implicit anomaly detection and provide a concrete target for understanding and engineering latent LLM reasoning.
Chat is not available.
Successful Page Load