Poster
in
Workshop: Unifying Concept Representation Learning Sun, Apr 26, 2026 • 7:00 AM – 8:00 AM PDT

Self-Interpretable Concept Representations: Training Lightweight Adapters on Vector-Label Pairs

Keenan Pepper ⋅ Alex McKenzie ⋅ Florin Pop ⋅ Stijn Servaes ⋅ Martin Leitgab ⋅ Michael Vaiana ⋅ Judd Rosenblatt ⋅ Michael Graziano ⋅ Diogo de Lucena

Project Page [ OpenReview]

Abstract

Self-interpretation methods prompt language models to describe their own internal states, offering a path toward concept-based self-explanation, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on learned concept representations, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder concept labels that outperform the training labels themselves on generation scoring, a concept quality metric (71% vs 63% at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and surface semantic concepts implicit in multi-hop reasoning, including bridge entities appearing in neither prompt nor response, without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that faithful self-explanation of learned concepts improves with scale, without modifying the model being interpreted.

Chat is not available.