Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
Contextual Sparsity as a Tool for Mechanistic Understanding of Retrieval in Hybrid Foundation Models
Davide Zani · Felix Michalak · Steven Abreu
We mechanistically investigate the role of self-attention in hybrid foundation models that combine state-space modules with self-attention. Evaluating the RecurrentGemma-2B model on a synthetic needle-in-a-haystack task, we show that completely deactivating attention heads causes a total retrieval failure—even though overall generation quality is only modestly affected. Using a contextual sparsity approach inspired by Liu et al. (2023), we find that retaining only 2 out of 10 attention heads is sufficient to nearly preserve full retrieval performance. These findings highlight a specialized function of self-attention for copying and retrieval, suggesting that future work could focus on designing dedicated, interpretable retrieval mechanisms within hybrid architectures.