Poster
in
Workshop: 3rd Workshop on Navigating and Addressing Data Problems For Foundation Models (DATA-FM)

MixAtlas: Uncertainty-aware Data Mixture for Multimodal LLM Midtraining

Bingbing Wen ⋅ Sirajul Salekin ⋅ Feiyang Kang ⋅ Bill Howe ⋅ Lucy Lu Wang ⋅ Javier Movellan ⋅ Manjot Bilkhu

Project Page [ OpenReview]

Abstract

Domain reweighting can improve sample efficiency and downstream generalization; however, data-mixture optimization for multimodal midtraining remains underexplored. Current multimodal training recipes tune mixtures from only a single perspective such as data format or task type. We introduce MixAtlas, which produces a benchmark-targeted data recipe that users can inspect, adapt, and transfer to their own corpora and downstream goals. MixAtlas, curates the training image corpus along two interpretable axes---\emph{image concepts} and \emph{task supervision}---enabling interpretable mixture control and fine-grained attribution of downstream performance to specific domains within each axis. Using small proxy models and a Gaussian-process surrogate, we show that the optimal mixtures obtained successfully transfer to larger-scale model. The resulting mixtures yield substantial improvements: up to 2$\times$ faster convergence and consistently average gains from 1\%--17.6\% on 10 diverse benchmarks compare with strongest data mixture baselines. Overall, \framework makes multimodal mixture optimization interpretable and adaptable, providing concrete, compute-efficient data recipes for training next-generation MLLMs.

Chat is not available.