Blog Track Poster

How To Open the Black Box: Modern Models for Mechanistic Interpretability

Juntai Cao ⋅ Xiang Zhang ⋅ Raymond Li ⋅ Jiarui Ding

[ OpenReview]

Abstract

Understanding how transformers represent and transform internal features is a core challenge in mechanistic interpretability. Traditional tools like attention maps and probing reveal only partial structure, often blurred by polysemanticity and superposition. New model-based methods offer more principled insight: Sparse Autoencoders extract sparse, interpretable features from dense activations; Semi-Nonnegative Matrix Factorization uncovers how neuron groups themselves encode concepts; Cross-Layer Transcoders track how these representations evolve across depth; and Weight-Sparse Transformers encourage inherently modular computation through architectural sparsity. Together, these approaches provide complementary pathways for opening the black box and understanding the circuits that underpin transformer behavior.

Video

Chat is not available.