RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator Paper • 2605.21748 • Published 13 days ago • 15
TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation Paper • 2605.22355 • Published 12 days ago • 174
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards Paper • 2605.21467 • Published 13 days ago • 204
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration Paper • 2605.20025 • Published 14 days ago • 185
Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining Paper • 2605.14747 • Published 19 days ago • 145
XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity Paper • 2605.05662 • Published 26 days ago • 11
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation Paper • 2604.28196 • Published Apr 30 • 72
From Context to Skills: Can Language Models Learn from Context Skillfully? Paper • 2604.27660 • Published 30 days ago • 166
RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time Paper • 2604.11626 • Published Apr 13 • 102