RT-TopKSAE: Improving Top-k Sparse Autoencoders with the Rotation Trick
Abstract
Mechanistic interpretability seeks to reverse-engineer computational circuits learnt by neural networks and sparse autoencoders (SAEs) have emerged as a key tool for this endeavor given their ability to decompose neural activations into interpretable feature dictionaries in an unsupervised manner. However, SAEs face persistent challenges, viz. collapsed feature usage (dead latents), polysemantic features that confound interpretation, and fundamental non-identifiability that admits multiple equally-valid decompositions. We hypothesize these pathologies share a common origin in gradient geometry, specifically, that standard training loses critical directional information when backpropagating through the discrete Top-K selection operation, typically implemented via straight-through estimation. We therefore introduce the RT-TopKSAE, which recovers this directional information by preserving angular relationships at the Top-K selection boundary. This geometric inductive bias yields appreciable improvements in dictionary utilization, feature disentanglement, and activation uniformity while maintaining representational fidelity. Critically, the geometric properties we observe --comprehensive coverage, feature disentanglement, and uniform utilization --are ones that should enable downstream feature steering and targeted behavioral control. By producing representations where individual features align with distinct aspects of variation, RT-TopKSAE transforms sparse dictionaries from passive decomposition tools into controllable computational primitives suitable for dynamic intervention. Our results suggest that preserving gradient angle information acts as a geometric regularizer, that, while not resolving identifiability in principle, empirically biases optimization toward representations with desirable interpretability properties.