Poster
in
Workshop: Scientific Methods for Understanding Deep Learning (Sci4DL)

Which Sparse Code? Identifiability Failures in SAE Inference

Alessa Carbo ⋅ Eric Nalisnick

Project Page [ OpenReview]

Abstract

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability, but it is unclear whether the encoder’s sparse code is uniquely determined. We com- pare SAE encoders against classical sparse coding algorithms (OMP, IHT) using frozen dictionaries. We find that alternative methods select substantially different features (Jaccard ∼ 0.43) while producing linearly equivalent codes (R2 > 0.88). This dissociation between linear and support identifiability holds across layers and SAE configurations. Our results suggest SAE features represent one valid decom- position among alternatives, with implications for interpretability claims built on specific features.

Chat is not available.