Self-Interpretable Concept Representations: Training Lightweight Adapters on Vector-Label Pairs
Keenan Pepper ⋅ Alex McKenzie ⋅ Florin Pop ⋅ Stijn Servaes ⋅ Martin Leitgab ⋅ Michael Vaiana ⋅ Judd Rosenblatt ⋅ Michael Graziano ⋅ Diogo de Lucena
Abstract
Self-interpretation methods prompt language models to describe their own internal states, offering a path toward concept-based self-explanation, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on learned concept representations, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder concept labels that outperform the training labels themselves on generation scoring, a concept quality metric (71% vs 63% at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and surface semantic concepts implicit in multi-hop reasoning, including bridge entities appearing in neither prompt nor response, without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that faithful self-explanation of learned concepts improves with scale, without modifying the model being interpreted.
Chat is not available.
Successful Page Load