BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models Paper β’ 2502.07346 β’ Published Feb 11 β’ 54
Slamming: Training a Speech Language Model on One GPU in a Day Paper β’ 2502.15814 β’ Published Feb 19 β’ 69
Qwen2-VL Collection Vision-language model series based on Qwen2 β’ 16 items β’ Updated Jul 21 β’ 226
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding Paper β’ 2501.18362 β’ Published Jan 30 β’ 23
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search Paper β’ 2412.18319 β’ Published Dec 24, 2024 β’ 39
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions Paper β’ 2412.09596 β’ Published Dec 12, 2024 β’ 98
ProcessBench: Identifying Process Errors in Mathematical Reasoning Paper β’ 2412.06559 β’ Published Dec 9, 2024 β’ 84
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper β’ 2412.05271 β’ Published Dec 6, 2024 β’ 159
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper β’ 2412.04467 β’ Published Dec 5, 2024 β’ 118
view article Article πΊπ¦ββ¬ LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs Dec 4, 2024 β’ 80