Poster
in
Workshop: Second Workshop on Representational Alignment (Re$^2$-Align)

Cross-Modal Alignment Regularization: Enhancing Language Models with Vision Model Representations

Yulu Gan · Zhao · Phillip Isola

Project Page [ OpenReview]

Abstract

Cross-modal distillation has emerged as a critical technique for leveraging the complementary strengths of different modalities. However, existing work has not enabled direct benefits between models trained on data from different modalities. In this work, we introduce a method that incorporates a cross-modal alignment regularization term (CMAR) during language model training to promote alignment with the representations of a vision model at specific layers. We show experimental results demonstrating our method significantly enhances the performance of language models in various downstream tasks within both pre-training and fine-tuning settings. We observe a 1.01\% increase in accuracy on the Language Modeling Broadened to Account for Discourse Aspects (LAMBADA) dataset, and a 1.49\% increase on the Causal Reasoning (COPA) dataset. More surprisingly, our method enables a weaker vision model to boost the performance of a stronger language model by 1.20\% on LAMBADA and 2.00\% on COPA, challenging the traditional assumption that the teacher model must be stronger than the student model.

Video

Chat is not available.