MixAtlas: Uncertainty-aware Data Mixture for Multimodal LLM Midtraining
Bingbing Wen ⋅ Sirajul Salekin ⋅ Feiyang Kang ⋅ Bill Howe ⋅ Lucy Lu Wang ⋅ Javier Movellan ⋅ Manjot Bilkhu
Abstract
Domain reweighting can improve sample efficiency and downstream generalization; however, data-mixture optimization for multimodal midtraining remains underexplored. Current multimodal training recipes tune mixtures from only a single perspective such as data format or task type. We introduce MixAtlas, which produces a benchmark-targeted data recipe that users can inspect, adapt, and transfer to their own corpora and downstream goals. MixAtlas, curates the training image corpus along two interpretable axes---\emph{image concepts} and \emph{task supervision}---enabling interpretable mixture control and fine-grained attribution of downstream performance to specific domains within each axis. Using small proxy models and a Gaussian-process surrogate, we show that the optimal mixtures obtained successfully transfer to larger-scale model. The resulting mixtures yield substantial improvements: up to 2$\times$ faster convergence and consistently average gains from 1\%--17.6\% on 10 diverse benchmarks compare with strongest data mixture baselines. Overall, \framework makes multimodal mixture optimization interpretable and adaptable, providing concrete, compute-efficient data recipes for training next-generation MLLMs.
Chat is not available.
Successful Page Load