Oral
in
Workshop: Secure and Trustworthy Large Language Models
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov · Georg Lange · Neel Nanda
A major open problem in mechanistic interpretability is disentangling internalmodel activations into meaningful features, with recent work focusing on sparseautoencoders (SAEs) as a potential solution. However, verifying that an SAE hasfound the `right' features in realistic settings has been difficult, as we don'tknow the (hypothetical) ground-truth features to begin with. In the absence ofsuch ground truth, current evaluation metrics are indirect and rely on proxies,toy models, or other non-trivial assumptions.To overcome this, we propose a new framework to evaluate SAEs: studying how pre-trained language models perform specific tasks, where model activations canbe (supervisedly) disentangled in a principled way that allows precise controland interpretability. We develop a task-specific comparison of learned SAEs toour supervised feature decompositions that is \emph{agnostic} to whether theSAE learned the same exact set of features as our supervised method. Weinstantiate this framework in the indirect object identification (IOI) task onGPT-2 Small, and report on both successes and failures of SAEs in this setting.