Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?
Abstract
Current multimodal foundation models show limitations in maintaining coherent spatial beliefs during active exploration, exhibiting gaps between active and passive performance, inefficient exploration strategies, and difficulties in updating outdated spatial knowledge.
Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent's ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.
Community
Theory of Space studies whether foundation models can construct a globally consistent spatial belief from partial observations via active exploration, revise the belief in dynamic environments when new evidence contradicts prior assumptions, and exploit the belief for downstream spatial tasks. We also probe the model to externalize its spatial belief during exploration to “open the box” and directly observe how beliefs evolve over time.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents (2025)
- Thinking on Maps: How Foundation Model Agents Explore, Remember, and Reason Map Environments (2025)
- SpatialTree: How Spatial Abilities Branch Out in MLLMs (2025)
- PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents (2026)
- Current Agents Fail to Leverage World Model as Tool for Foresight (2026)
- CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs? (2025)
- Agentic Uncertainty Quantification (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper