Poster
in
Workshop: Neural Network Weights as a New Data Modality

Hyper-Align: Efficient Modality Alignment via Hypernetworks

Jaisidh Singh · Diganta Misra · Boris Knyazev · Antonio Orvieto

Keywords: Multimodal models Hypernetworks Parameter Prediction VLM

Project Page [ OpenReview]

Abstract

Modern approaches to constructing multimodal models involve learning specialized modules known as connectors that align the representations of different unimodal models (for example, VLM – a combination of the Vision modality model and the Language modality model). However, due to the vast scale of multimodal pre-training datasets and large individual models, aligning representations across various pre-trained model combinations becomes computationally expensive. This challenge is further compounded by the frequent release of new models. To address this, we propose a novel method for aligning $N$ combinations of pre-trained models using a hypernetwork called ``Hyper-Align", that approximates the weights of all possible connectors within a fixed compute budget, regardless of $N$. Our approach is computationally efficient, further retains the performance of multimodal tasks, matching or surpassing that of independently trained pairwise connectors. For instance, Hyper-Align finds the best possible unimodal pair configuration and subsequently generates their multimodal connector weights while being $\approx$ 8x cheaper in terms of FLOP cost compared to state-of-the-art baselines.

Chat is not available.