Process-then-Retrieve: A Mechanistic Study of Cross-Modal Alignment in Vision-Language Models
Abstract
Understanding the internal integration of visual and textual data in vision–language models (VLMs) remains a significant challenge. We present a mechanistic study of adapter-based VLMs, using PaliGemma-3B and Qwen2-VL as representative models, to test the hypothesis that models follow a two-phase workflow: early layers prioritize textual processing, while later layers execute cross-modal retrieval. Using representational similarity analysis, attention patching, and residual stream attribution, we reveal that early layers preserve visual embeddings with minimal modification while focusing on text. Significant cross-modal alignment and visual attention appear only in the final layers. We find that this structural bias is a primary contributor to textual dominance, where linguistic priors can override conflicting visual evidence. Our results provide a foundation for addressing the "modality gap" and offer insights into multimodal reasoning in VLM architectures.