SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
Abstract
Double Interactive Reinforcement Learning (DIRL) enables Vision Language Models (VLMs) to coordinate multiple tools for precise spatial reasoning, achieving state-of-the-art performance on benchmarks and real-world tasks.
Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.
Community
TL;DR: SpaceTools empowers VLMs with vision and robotic tools for spatial reasoning via Double Interactive Reinforcement Learning (DIRL), enabled by our Toolshed infrastructure. Achieves state-of-the-art performance on spatial reasoning benchmarks and enables precise real-world robot manipulation.
Project page: https://spacetools.github.io/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs (2025)
- SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards (2025)
- DeepEyesV2: Toward Agentic Multimodal Model (2025)
- TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics (2025)
- Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch (2025)
- Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning (2025)
- MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper