Poster
MixEval-X: Any-to-any Evaluations from Real-world Data Mixture
Jinjie Ni · Yifan Song · Deepanway Ghosal · Bo Li · David Junhao Zhang · Xiang Yue · Fuzhao Xue · Yuntian Deng · Andy Zheng · Kaichen Zhang · Mahir Shah · Kabir Jain · Yang You · Michael Qizhe Shieh
Hall 3 + Hall 2B #176
Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X's model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.
Live content is unavailable. Log in and register to view live content