BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution Paper • 2510.08697 • Published Oct 9, 2025 • 36
Running 37 BigCodeArena 🚀 37 Compare two AI models by sending them code and seeing their responses
view article Article BigCodeArena: Judging code generations end to end with code executions Oct 7, 2025 • 19
R-Zero: Self-Evolving Reasoning LLM from Zero Data Paper • 2508.05004 • Published Aug 7, 2025 • 129
Optimizing Decomposition for Optimal Claim Verification Paper • 2503.15354 • Published Mar 19, 2025 • 18
IHEval: Evaluating Language Models on Following the Instruction Hierarchy Paper • 2502.08745 • Published Feb 12, 2025 • 20
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks Paper • 2410.01744 • Published Oct 2, 2024 • 26 • 5
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks Paper • 2410.01744 • Published Oct 2, 2024 • 26
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks Paper • 2410.01744 • Published Oct 2, 2024 • 26
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions Paper • 2406.15877 • Published Jun 22, 2024 • 48
Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning Paper • 2406.12050 • Published Jun 17, 2024 • 19