Evaluation
updated
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models
for Integrated Capabilities
Paper
• 2408.00765
• Published
• 13
Towards Achieving Human Parity on End-to-end Simultaneous Speech
Translation via LLM Agent
Paper
• 2407.21646
• Published
• 18
LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection
Paper
• 2408.04284
• Published
• 25
Training Language Models on the Knowledge Graph: Insights on
Hallucinations and Their Detectability
Paper
• 2408.07852
• Published
• 16
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question
Answering
Paper
• 2409.06595
• Published
• 38
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical
Applications
Paper
• 2409.07314
• Published
• 56
Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded
Attributions and Learning to Refuse
Paper
• 2409.11242
• Published
• 7
LLaVA-Critic: Learning to Evaluate Multimodal Models
Paper
• 2410.02712
• Published
• 37
TurtleBench: Evaluating Top Language Models via Real-World Yes/No
Puzzles
Paper
• 2410.05262
• Published
• 11
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?
Paper
• 2409.15277
• Published
• 38
Fusion-Eval: Integrating Evaluators with LLMs
Paper
• 2311.09204
• Published
• 6
HardTests: Synthesizing High-Quality Test Cases for LLM Coding
Paper
• 2505.24098
• Published
• 43
Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Paper
• 2508.18076
• Published
• 6