What Lies Beyond the View? Actively Constructing Spatial Beliefs in Foundation Models
Abstract
Current foundation models can answer spatial reasoning questions about a given image or text, yet they lack the fundamental ability to build a genuine spatial understanding of an environment through active exploration. This reflects a critical blind spot in prevailing evaluation protocols, which predominantly test passive reasoning on curated data rather than the active construction of knowledge under uncertainty. To address this, we introduce Theory of Space (ToS), a new framework analogous to the Theory of Mind. While Theory of Mind concerns an agent's ability to model the hidden mental states of others, ToS concerns its ability to construct, update, and utilize an internal belief about the unobserved structure of its spatial environment from local, incomplete observations. We implement ToS with a comprehensive benchmark featuring both text-based and visual environments. Instead of performing specific tasks in such environments, the primary objective is to build a complete and accurate spatial belief through curiosity-driven exploration. A core innovation of our framework is the direct probing of this internal belief: we prompt models to explicitly present their cognitive map at each step, allowing us to measure not only task performance but also the quality, consistency, and evolution of the underlying spatial model itself. By evaluating state-of-the-art models as both active explorers and passive reasoners (using logs from scripted proxy agents), we disentangle exploration strategy from reasoning ability. Our analysis reveals common failure modes in spatial belief management, such as egomotion update errors and the inability to maintain a globally consistent map. The ToS framework provides the concepts and tools necessary to evaluate and build agents with more robust, human-like spatial intelligence.