VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models
Abstract
Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still don't fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., upsample, elastic_transform, reaching up to 34pp.