Poster Sat, Apr 25, 2026 • 11:15 AM – 1:45 PM PDT Pavilion 4 P4-#3502

InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models

Nianchen Deng ⋅ Lixin Gu ⋅ Shenglong Ye ⋅ Yinan He ⋅ Zhe Chen ⋅ Songze Li ⋅ Haomin Wang ⋅ Jinhui Yin ⋅ Qi Wei ⋅ Tianshuo Yang ⋅ Min Dou ⋅ Tong He ⋅ Wenqi Shao ⋅ Kaipeng Zhang ⋅ Yi Wang ⋅ Botian Shi ⋅ Yanting Zhang ⋅ Jifeng Dai ⋅ Yu Qiao ⋅ Wenhai Wang ⋅ Hongjie Zhang

Project Page [ OpenReview]

Abstract

Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain constrained by limited scale, narrow visual diversity, and restricted instruction expressiveness. To address these gaps, we present InternSpatial---the largest open-source dataset for spatial reasoning in VLMs---alongside InternSpatial-Bench, a comprehensive evaluation benchmark designed to assess spatial understanding across diverse instruction formats. InternSpatial contains 12 million question-answer(QA) pairs covering both single-view and multi-view scenarios, sourced from varied visual environments and supporting 19 distinct instruction formats that mirror real-world query patterns. InternSpatial-Bench aims to single-view assessment and also extends multi-view reasoning through a novel rotation estimation task. Experimental validation demonstrates that models trained on \trainset achieve substantial performance improvement of 12.1% on InternSpatial-Bench and 10.7% on VSI-Bench, while preserving competitive performance on general-purpose benchmarks. We expect these resources can advance the development of spatially-capable VLMs for practical applications in robotics and embodied AI systems. Our codes and datasets are publicly available at https://github.com/dengnianchen/intern-spatial.

Video

Chat is not available.