Poster
REMEDY: Recipe Merging Dynamics in Large Vision-Language Models
Didi Zhu · Yibing Song · tao shen · Ziyu Zhao · Jinluan Yang · Min Zhang · Chao Wu
Hall 3 + Hall 2B #503
Model merging has emerged as a powerful technique for combining task-specific vision models into a unified and multi-functional model. Previous methods represented by task arithmetic, have demonstrated effectiveness and scalability in this domain. When large vision-language models (LVLMs) arise with model size scaling up, this design becomes challenging to fuse different instruction-tuned LVLMs for generalization enhancement. The large scale and multi-modal nature of LVLMs present unique obstacles, including constructing reusable and modular components to accommodate the multi-component architecture of LVLMs and the requirement for dynamic fusion based on multi-modal input tokens. To address these challenges, we propose the \textbf{RE}cipe \textbf{ME}rging \textbf{DY}namics (REMEDY) method, a scalable and flexible paradigm for model merging in LVLMs. We first define reusable modules termed \textit{recipes} including the projector and shallow LLM layers, enhancing visual-language understanding. Then, we introduce a modality-aware allocator dynamically generates weights in a one-shot manner based on input relevance to existing recipes, enabling efficient cross-modal knowledge integration. REMEDY thus offers an adaptive solution for LVLMs to tackle both seen (i.e., multi-task learning) and unseen (i.e., zero-shot generalization) tasks. Experimental results demonstrate that our method consistently improves performance on both seen and unseen tasks, underscoring the effectiveness of REMEDY in diverse multi-modal scenarios.
Live content is unavailable. Log in and register to view live content