Poster
Monitoring Latent World States in Language Models with Propositional Probes
Jiahai Feng · Stuart Russell · Jacob Steinhardt
Hall 3 + Hall 2B #514
[
Abstract
]
Fri 25 Apr 7 p.m. PDT
— 9:30 p.m. PDT
Abstract:
Language models (LMs) are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of LMs could help monitor and correct unfaithful behavior. We hypothesize that LMs faithfully represent their input contexts in a latent world model, and we seek to extract these latent world states as logical propositions. For example, given the input context Greg is a nurse. Laura is a physicist.'', we aim to decode the propositions WorksAs(Greg, nurse) and WorksAs(Laura, physicist) from the model's internal activations. To do so we introduce _propositional probes_, which compositionally extract lexical concepts from token activations and bind them into propositions. Key to this is identifying a _binding subspace_ in which bound tokens have high similarity (Greg ↔ nurse) but unbound ones do not (Greg ↮ physicist). Despite only being trained on linguistically simple English templates, we find that propositional probes generalize to inputs written as short stories and translated to Spanish. Moreover, in three settings where LMs respond unfaithfully to the input context---prompt injections, backdoor attacks, and gender bias--- the decoded propositions remain faithful. This suggests that LMs often encode a faithful world model but decode it unfaithfully, which motivates the search for better interpretability tools for monitoring LMs.
Live content is unavailable. Log in and register to view live content