Poster
in
Workshop: The First Workshop on Efficient Spatial Reasoning Mon, Apr 27, 2026 • 7:45 AM – 8:45 AM PDT

GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation

Karim Elmaaroufi ⋅ Liheng Lai ⋅ Justin Svegliato ⋅ Yutong Bai ⋅ Sanjit Seshia ⋅ Matei Zaharia

Project Page [ OpenReview]

Abstract

Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning$\textemdash{}$a prerequisite for applications such as medical imaging and robotics. We present GRAID, a data generation pipeline that generates high-fidelity spatial reasoning data from images through qualitative analysis of 2D geometry by using object detectors. By avoiding the use of single-image 3D reconstruction pipelines and generative hallucinations, GRAID produces datasets with higher accuracy, as confirmed by our human study. Crucially, we demonstrate that training on GRAID generated QA pairs leads to learning transferable concepts and improved reasoning across general visual reasoning problems. We fine-tune several VLM model families on GRAID data and compare against models tuned on data from current methods. We find that GRAID tuned models result in significant accuracy gains in both spatial reasoning and general visual reasoning benchmarks such as BLINK, A-OKVQA, and RealWorldQA. GRAID will be publicly available after the review period.

Chat is not available.