ICLR Poster LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation

Poster

LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation

Fangxun Shu · Yue Liao · Lei Zhang · Le Zhuo · Chenning Xu · Guanghao Zhang · Haonan Shi · Weilong Dai · ZhongTao · Zhelun Yu · Wanggui He · Siming Fu · Haoyuan Li · Si Liu · Hongsheng Li · Hao Jiang

Hall 3 + Hall 2B #295

[ Abstract ] [ Project Page ]

Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (

$s$ -MLLM) distilling knowledge from large-scale MLLM (

$l$ -MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of

$s$ -MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy for comprehensive knowledge transfer. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable

$s$ -MLLM to emulate

$s$ -MLLM's understanding. Following this, we introduce preference distillation via Preference Optimization (PO), where the key lies in treating

$l$ -MLLM as the reference model. During this phase, the

$s$ -MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond

$l$ -MLLM, leading to a better

$s$ -MLLM that surpasses

$l$ -MLLM, particularly in hallucination benchmarks.Extensive experiments demonstrate that LLaVA-MoD surpasses existing works across various benchmarks while maintaining a minimal activated parameters and low computational costs. Remarkably, LLaVA-MoD-2B surpasses Qwen-VL-Chat-7B with an average gain of 8.8\%, using merely

$0.3\%$ of the training data and 23\% trainable parameters. The results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for developing efficient MLLMs.

Live content is unavailable. Log in and register to view live content