Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World
Abstract
Claw-Anything benchmark evaluates large language model agents on comprehensive user activity contexts spanning extended timeframes, multiple services, and diverse device interactions to assess true always-on personal assistance capabilities.
Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.
Community
Claw-Anything: See anything, and do anything. Scaling Agent Context.
We believe the next leap for always-on LLM agents lies in scaling agent context — expanding the slice of the user's digital world an assistant can continuously perceive, reason over, and act on.
the standout thing is how claw-anything expands context along long-horizon histories, interdependent backend services, and integrated GUI/CLI across devices, then actually simulates months of user activity with realistic noise. i’d love a clean ablation that drops one axis at a time—history, backend coupling, or cross-device interaction—to see which drives pass@1 most. the arxivlens breakdown helped me parse the method details and it points to where the biggest implementation bets are. one worry is that proactive recommendations hinge on up-to-date world models, but the paper doesn’t quantify how often signals are fresh versus stale in the long-horizon setting.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ClawGym: A Scalable Framework for Building Effective Claw Agents (2026)
- $\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows (2026)
- KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation (2026)
- Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents (2026)
- Terminal-World: Scaling Terminal-Agent Environments via Agent Skills (2026)
- ClawEnvKit: Automatic Environment Generation for Claw-Like Agents (2026)
- WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.26086 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper