Poster
in
Workshop: ICLR 2026 Workshop on Multimodal Intelligence: Next Token Prediction and Beyond

Visual Representation Alignment for Multimodal Large Language Models

Heeji Yoon ⋅ Jaewoo Jung ⋅ Junwan Kim ⋅ Hyungyu Choi ⋅ Heeseong Shin ⋅ Sangbeom Lim ⋅ Honggyu An ⋅ Chaehyun Kim ⋅ Jisang Han ⋅ Donghyun Kim ⋅ Chanho Eom ⋅ Sunghwan Hong ⋅ Seungryong Kim

Project Page [ OpenReview]

Abstract

Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting and spatial reasoning. We find that this limitation arises not merely from the choice of vision encoder, but from the lack of explicit supervision on visual representations during training, which causes detailed visual information to be gradually weakened even when strong vision foundation models (VFMs) are used as vision encoders. To this end, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained VFMs. By explicitly enforcing this alignment, VIRAL preserves rich visual information within the MLLM while enabling it to leverage complementary visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs.

Chat is not available.