XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
Abstract
Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks have advanced multimodal evaluation, it remains unclear whether OLLMs achieve modality-invariant reasoning or inherit modality-specific biases. We introduce \textbf{XModBench}, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench contains 60K multiple-choice questions across five task families and systematically covers all six cross-modality directions, enabling diagnosis of task competence, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) suffers from modality disparities, with performance dropping by over {20 points} on average when audio inputs replace text, and (iii) exhibits directional imbalance, with a {9-point gap} when using vision as context versus using text as context. The findings suggest that OLLMs fall short of modality-invariant reasoning, and XModBench provides a fundamental diagnostic tool for evaluating and improving their overall cross-modal competence.