SARAH: Spatially Aware Real-time Agentic Humans
Abstract
A causal transformer-based variational autoencoder combined with flow matching enables real-time, spatially-aware conversational motion for embodied agents in virtual reality applications.
As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS -- 3x faster than non-causal baselines -- while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see https://evonneng.github.io/sarah/ for details.
Community
A real-time, fully causal system producing spatially aware, gaze-controlled full-body motion for embodied agents from user position and audio, achieving 300 FPS on streaming VR hardware.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation (2026)
- ProAct: A Dual-System Framework for Proactive Embodied Social Agents (2026)
- EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation (2025)
- Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures (2026)
- 3DGesPolicy: Phoneme-Aware Holistic Co-Speech Gesture Generation Based on Action Control (2026)
- DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model (2025)
- StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper