Demystifying Supervision Data Generalization in Multimodal LMs
Xuan Qi · Luxi He · Dan Roth · Xingyu Fu
Abstract
Conventional wisdom in selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that are intuitively similar to the target task (e.g. text-rich v.s. vision-centric). However, it remains unclear how reliably such similarity translates into improved performance on the test benchmarks. In this paper, we take the first step to study the problem in MLLMs: can we predict a training data's influence on a target benchmark even before any training takes place? To answer this question, we first conduct an in-depth analysis using 14 vision-language datasets covering 7 diverse tasks. Our analysis shows that intuitive task similarity is unreliable in predicting task generalizability, and that transfer depends on the specific dataset rather than the broader task category. We propose DATAPROPHET, a training-free, simple yet effective metric based on multimodal perplexity, similarity, and data diversity. Our experiments demonstrate that the influence rankings for different supervision datasets derived from DATAPROPHET is strongly-correlated with rankings based on the actual performance increase after training, with a Kendall’s $\tau$ correlation coefficient of 86.0\%. Moreover, we show that DATAPROPHET can help select better supervision data, achieving up to 6.9\% improvement in average over uniform selection, 1.4\% over SoTA training-based baseline, and 0.2\% higher than oracle experiment performance-based selection. Our code and data will be released.
Successful Page Load