s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs Paper • 2603.14628 • Published 12 days ago • 3
Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL Paper • 2603.19470 • Published 8 days ago • 3
Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders Paper • 2603.19209 • Published 8 days ago • 5
AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science Paper • 2603.19005 • Published 8 days ago • 6
Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD Paper • 2603.20155 • Published 7 days ago • 8
Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States Paper • 2603.19987 • Published 7 days ago • 9
WorldAgents: Can Foundation Image Models be Agents for 3D World Models? Paper • 2603.19708 • Published 8 days ago • 12
LoopRPT: Reinforcement Pre-Training for Looped Language Models Paper • 2603.19714 • Published 8 days ago • 13
A Subgoal-driven Framework for Improving Long-Horizon LLM Agents Paper • 2603.19685 • Published 8 days ago • 18
ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models Paper • 2603.19466 • Published 8 days ago • 39
Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models Paper • 2603.17051 • Published 10 days ago • 103
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning Paper • 2603.17024 • Published 10 days ago • 104
Reasoning over mathematical objects: on-policy reward modeling and test time aggregation Paper • 2603.18886 • Published 8 days ago • 6
MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning Paper • 2603.16929 • Published 14 days ago • 12
ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents Paper • 2603.18815 • Published 8 days ago • 14
VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining Paper • 2603.15030 • Published 12 days ago • 21