ICLR Poster Looking Inward: Language Models Can Learn About Themselves by Introspection

Poster

Looking Inward: Language Models Can Learn About Themselves by Introspection

Felix Jedidja Binder · James Chua · Tomek Korbak · Henry Sleight · John Hughes · Robert Long · Ethan Perez · Miles Turpin · Owain Evans

Hall 3 + Hall 2B #500

[ Abstract ]

Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g. thoughts and feelings) that are not accessible to external observers. Do LLMs have this introspective capability of privileged access? If they do, this would show that LLMs can acquire knowledge not contained in or inferable from training data.We investigate LLMs predicting properties of their own behavior in hypothetical situations. If a model M1 has this capability, it should outperform a different model M2 in predicting M1's behavior—even if M2 is trained on M1's ground-truth behavior.The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger).In experiments with GPT-4, GPT-4o, and Llama-3 models, we find that the model M1 outperforms M2 in predicting itself, providing evidence for privileged access. Further experiments and ablations provide additional evidence.Our results show that LLMs can offer reliable self-information independent of external data in certain domains. By demonstrating this, we pave the way for further work on introspection in more practical domains, which would have significant implications for model transparency and explainability. However, while we successfully show introspective capabilities in simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

Live content is unavailable. Log in and register to view live content