Sparse Circuits of Vision Language Alignment
Huizhen Shu ⋅ xuying li
Abstract
Multimodal alignment lies at the core of vision--language models (VLMs), enabling them to associate visual inputs with linguistic concepts and reason coherently across modalities. Understanding how such alignment is internally represented is therefore crucial for both interpretability and controllability of VLMs. However, existing interpretability approaches largely focus on unimodal components or global representations, offering limited insight into the fine-grained neural mechanisms that coordinate vision and language features. In this work, we introduce Joint Sparse Autoencoders ($\textbf{JSAE}$), which extend sparse autoencoders to jointly factorize vision and language activations into shared, sparse, and interpretable features. Applying JSAE to LLaVA, we uncover highly correlated vision--language neuron pairs that form semantically coherent circuits corresponding to concepts such as $\textit{food}$ and $\textit{animals}$. Through bidirectional causal interventions, we further reveal a hierarchical functional asymmetry: early-layer circuits are necessary for semantic grounding, while later-layer circuits increasingly exhibit sufficiency in steering generation. Extensive experiments across diverse architectures, including dense and MoE-based VLMs, demonstrate that such sparse, layer-localized circuits constitute a common structural pattern underlying multimodal alignment. Our results offer a principled framework for analyzing multimodal coordination and enable precise, neuron-level control of vision--language generation.
Successful Page Load