Which Sparse Code? Identifiability Failures in SAE Inference
Alessa Carbo ⋅ Eric Nalisnick
Abstract
Sparse autoencoders (SAEs) are widely used in mechanistic interpretability, but it is unclear whether the encoder’s sparse code is uniquely determined. We com- pare SAE encoders against classical sparse coding algorithms (OMP, IHT) using frozen dictionaries. We find that alternative methods select substantially different features (Jaccard ∼ 0.43) while producing linearly equivalent codes (R2 > 0.88). This dissociation between linear and support identifiability holds across layers and SAE configurations. Our results suggest SAE features represent one valid decom- position among alternatives, with implications for interpretability claims built on specific features.
Chat is not available.
Successful Page Load