ICLR Poster Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Poster

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov · Georg Lange · Neel Nanda

Hall 3 + Hall 2B #558

[ Abstract ]

Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Disentangling model activations into human-interpretable features is a centralproblem in interpretability. Sparse autoencoders (SAEs) have recently attractedmuch attention as a scalable unsupervised approach to this problem. However, ourimprecise understanding of ground-truth features in realistic scenarios makes itdifficult to measure the success of SAEs. To address this challenge, we proposeto evaluate SAEs on specific tasks by comparing them to supervisedfeature dictionaries computed with knowledge of the concepts relevant to thetask. Specifically, we suggest that it is possible to (1) compute supervised sparsefeature dictionaries that disentangle model computations for a specific task;(2) use them to evaluate and contextualize the degree of disentanglement andcontrol offered by SAE latents on this task. Importantly, we can do this in away that is agnostic to whether the SAEs have learned the exact ground-truthfeatures or a different but similarly useful representation.As a case study, we apply this framework to the indirect object identification(IOI) task using GPT-2 Small, with SAEs trained on either the IOI or OpenWebTextdatasets. We find that SAEs capture interpretable features for the IOI task, andthat more recent SAE variants such as Gated SAEs and Top-K SAEs are competitivewith supervised features in terms of disentanglement and control over the model.We also exhibit, through this setup and toy models, some qualitative phenomenain SAE training illustrating feature splitting and the role of featuremagnitudes in solutions preferred by SAEs.

Live content is unavailable. Log in and register to view live content