Poster
in
Workshop: Neural Network Weights as a New Data Modality
Hyper-Align: Efficient Modality Alignment via Hypernetworks
Jaisidh Singh · Diganta Misra · Boris Knyazev · Antonio Orvieto
Keywords: [ Multimodal models ] [ Hypernetworks ] [ Parameter Prediction ] [ VLM ]
Abstract:
Modern approaches to constructing multimodal models involve learning specialized modules known as connectors that align the representations of different unimodal models (for example, VLM – a combination of the Vision modality model and the Language modality model). However, due to the vast scale of multimodal pre-training datasets and large individual models, aligning representations across various pre-trained model combinations becomes computationally expensive. This challenge is further compounded by the frequent release of new models. To address this, we propose a novel method for aligning $N$ combinations of pre-trained models using a hypernetwork called ``Hyper-Align", that approximates the weights of all possible connectors within a fixed compute budget, regardless of $N$. Our approach is computationally efficient, further retains the performance of multimodal tasks, matching or surpassing that of independently trained pairwise connectors. For instance, Hyper-Align finds the best possible unimodal pair configuration and subsequently generates their multimodal connector weights while being $\approx$ 8x cheaper in terms of FLOP cost compared to state-of-the-art baselines.
Chat is not available.
Successful Page Load