Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization
Abstract
Spatial reasoning remains a core limitation of current vision-language models (VLMs), which often struggle to understand object relations such as direction, distance, and spatial configuration. In this work, we introduce SVQA-R1, a reinforcement learning framework that improves spatial reasoning by enforcing view consistency during training. At its core is Spatial-GRPO, a group-based reward optimization method that encourages the model to generate consistent answers and reasoning across perturbed views of the same scene. We adopt two complementary perturbation strategies: (1) horizontal flipping, which supervises directional concepts like “left” and “right”; and (2) 2D viewpoint transformations—such as in-plane rotation and perspective warping—which reinforce reasoning about distance and relative positioning. This unified approach enables the model to acquire geometry-aware spatial understanding without relying on supervised fine-tuning. Experiments on multiple Spatial VQA benchmarks demonstrate that SVQA-R1 significantly outperforms strong baselines and produces interpretable, consistent reasoning across diverse viewpoints. Code and data will be released.