Poster
in
Workshop: Scientific Methods for Understanding Deep Learning (Sci4DL)

Process-then-Retrieve: A Mechanistic Study of Cross-Modal Alignment in Vision-Language Models

Arpita Shanbhag ⋅ Julia Tran ⋅ Dhruv Mandala ⋅ Ayda Sultan

Project Page [ OpenReview]

Abstract

Understanding the internal integration of visual and textual data in vision–language models (VLMs) remains a significant challenge. We present a mechanistic study of adapter-based VLMs, using PaliGemma-3B and Qwen2-VL as representative models, to test the hypothesis that models follow a two-phase workflow: early layers prioritize textual processing, while later layers execute cross-modal retrieval. Using representational similarity analysis, attention patching, and residual stream attribution, we reveal that early layers preserve visual embeddings with minimal modification while focusing on text. Significant cross-modal alignment and visual attention appear only in the final layers. We find that this structural bias is a primary contributor to textual dominance, where linguistic priors can override conflicting visual evidence. Our results provide a foundation for addressing the "modality gap" and offer insights into multimodal reasoning in VLM architectures.

Chat is not available.