Visual Representation Alignment for Multimodal Large Language Models
Abstract
Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting and spatial reasoning. We find that this limitation arises not merely from the choice of vision encoder, but from the lack of explicit supervision on visual representations during training, which causes detailed visual information to be gradually weakened even when strong vision foundation models (VFMs) are used as vision encoders. To this end, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained VFMs. By explicitly enforcing this alignment, VIRAL preserves rich visual information within the MLLM while enabling it to leverage complementary visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs.