Poster
Tell me about yourself: LLMs are aware of their learned behaviors
Jan Betley · Xuchan Bao · Martín Soto · Anna Sztyber-Betley · James Chua · Owain Evans
Hall 3 + Hall 2B #509
We study behavioral self-awareness, which we define as an LLM's capability to articulate its behavioral policies without relying on in-context examples. We finetune LLMs on examples that exhibit particular behaviors, including (a) making risk-seeking / risk-averse economic decisions, and (b) making the user say a certain word. Although these examples never contain explicit descriptions of the policy (e.g. "I will now take the risk-seeking option"), we find that the finetuned LLMs can explicitly describe their policies through out-of-context reasoning. We demonstrate LLMs' behavioral self-awareness across various evaluation tasks, both for multiple-choice and free-form questions. Furthermore, we demonstrate that models can correctly attribute different learned policies to distinct personas.Finally, we explore the connection between behavioral self-awareness and the concept of backdoors in AI safety, where certain behaviors are implanted in a model, often through data poisoning, and can be triggered under certain conditions. We find evidence that LLMs can recognize the existence of the backdoor-like behavior that they have acquired through fine-tuning.
Live content is unavailable. Log in and register to view live content