Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?
Pingyue Zhang ⋅ Zihan Huang ⋅ Yue Wang ⋅ Jieyu Zhang ⋅ Letian Xue ⋅ Zihan Wang ⋅ Qineng Wang ⋅ Keshigeyan Chandrasegaran ⋅ Ruohan Zhang ⋅ Yejin Choi ⋅ Ranjay Krishna ⋅ Jiajun Wu ⋅ Li Fei-Fei ⋅ Manling Li
Abstract
Spatial embodied intelligence under partial observability requires agents to actively acquire missing information rather than passively consume complete observations. While multimodal foundation models excel at passive perception and reasoning, their ability to support active, self-directed exploration to build and maintain a coherent spatial belief remains unstudied. We therefore propose Theory of Space, defined as an agent's ability to construct, revise, and exploit a spatial belief through self-directed active exploration under partial observability. We implement Theory of Space using a benchmark with textual and visual environments. Rather than solving specific tasks, the goal is curiosity-driven exploration to build a complete, accurate spatial belief. A core innovation is spatial belief probing: we prompt it to reveal its internal spatial belief as a cognitive map at each step, letting us measure the quality of its underlying spatial belief. Our evaluation of state-of-the-art models on a suite of downstream tasks reveals critical bottlenecks: (1) \textbf{The Active-Passive Gap}: Performance degrades when agents must autonomously gather information (e.g., \textsc{GPT-5.2}: $57.1{\to}46.0$); (2) \textbf{Inefficiency}: Models explore in an unsystematic way and with high redundancy, failing to match the efficiency of program-based proxies while producing no better results. Through belief probing, we diagnose that perception acts as an initial bottleneck, yet global beliefs suffer further from \textbf{instability} that causes spatial knowledge to degrade over time. Finally, a false belief paradigm reveals \textbf{Belief Inertia}: agents fail to overwrite obsolete priors, an effect especially severe in vision-based models.
Successful Page Load