FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain Paper • 2510.15232 • Published Oct 17 • 5
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering Paper • 2510.06426 • Published Oct 7 • 2
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks Paper • 2507.01001 • Published Jul 1 • 47
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective Paper • 2505.15045 • Published May 21 • 54
PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving Paper • 2503.21821 • Published Mar 26 • 21
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published Jan 21 • 85