GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation
Karim Elmaaroufi ⋅ Liheng Lai ⋅ Justin Svegliato ⋅ Yutong Bai ⋅ Sanjit Seshia ⋅ Matei Zaharia
Abstract
Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning$\textemdash{}$a prerequisite for applications such as medical imaging and robotics. We present GRAID, a data generation pipeline that generates high-fidelity spatial reasoning data from images through qualitative analysis of 2D geometry by using object detectors. By avoiding the use of single-image 3D reconstruction pipelines and generative hallucinations, GRAID produces datasets with higher accuracy, as confirmed by our human study. Crucially, we demonstrate that training on GRAID generated QA pairs leads to learning transferable concepts and improved reasoning across general visual reasoning problems. We fine-tune several VLM model families on GRAID data and compare against models tuned on data from current methods. We find that GRAID tuned models result in significant accuracy gains in both spatial reasoning and general visual reasoning benchmarks such as BLINK, A-OKVQA, and RealWorldQA. GRAID will be publicly available after the review period.
Chat is not available.
Successful Page Load