Poster
in
Workshop: Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety Across Modalities

Sparse Circuits of Vision Language Alignment

Huizhen Shu ⋅ xuying li

Project Page

Abstract

Multimodal alignment lies at the core of vision--language models (VLMs), enabling them to associate visual inputs with linguistic concepts and reason coherently across modalities. Understanding how such alignment is internally represented is therefore crucial for both interpretability and controllability of VLMs. However, existing interpretability approaches largely focus on unimodal components or global representations, offering limited insight into the fine-grained neural mechanisms that coordinate vision and language features. In this work, we introduce Joint Sparse Autoencoders ($\textbf{JSAE}$), which extend sparse autoencoders to jointly factorize vision and language activations into shared, sparse, and interpretable features. Applying JSAE to LLaVA, we uncover highly correlated vision--language neuron pairs that form semantically coherent circuits corresponding to concepts such as $\textit{food}$ and $\textit{animals}$. Through bidirectional causal interventions, we further reveal a hierarchical functional asymmetry: early-layer circuits are necessary for semantic grounding, while later-layer circuits increasingly exhibit sufficiency in steering generation. Extensive experiments across diverse architectures, including dense and MoE-based VLMs, demonstrate that such sparse, layer-localized circuits constitute a common structural pattern underlying multimodal alignment. Our results offer a principled framework for analyzing multimodal coordination and enable precise, neuron-level control of vision--language generation.