Single-Token Features as Extreme Monosemanticity: Compositional Transition and Geometric Signatures in SAE Feature Space
Abstract
Sparse Autoencoders (SAEs) decompose neural network activations into interpretable features, but the structure of this feature space remains poorly understood. We study single-token features, the extreme endpoint of monosemanticity where a feature activates on one vocabulary item and its variants. Analyzing 3.5 million features across five models, we find these features concentrate in Layer 0, drop sharply with scale within residual-stream SAEs, and show distinctive activation and decoder-geometry signatures (high gap and purity; tighter decoder-space clustering). Decoder tracking reveals a sharp early-layer transition in GPT2-Small consistent with a boundary between token identity and compositional processing. At comparable scale, prevalence differs dramatically across SAE objectives: JumpReLU GemmaScope yields ~46× higher Layer 0 single-token prevalence than TopK LlamaScope. These results emphasize methodology matching as a key requirement for cross-SAE comparisons.