InnoGym: Benchmarking the Innovation Potential of AI Agents Paper • 2512.01822 • Published 6 days ago • 33
MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation Paper • 2511.03942 • Published Nov 6 • 1
LightMem: Lightweight and Efficient Memory-Augmented Generation Paper • 2510.18866 • Published Oct 21 • 110
When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation Paper • 2510.07238 • Published Oct 8 • 14
When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation Paper • 2510.07238 • Published Oct 8 • 14
When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation Paper • 2510.07238 • Published Oct 8 • 14 • 2
OceanGym: A Benchmark Environment for Underwater Embodied Agents Paper • 2509.26536 • Published Sep 30 • 34
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses Paper • 2510.00232 • Published Sep 30 • 15
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses Paper • 2510.00232 • Published Sep 30 • 15
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses Paper • 2510.00232 • Published Sep 30 • 15 • 2
SteeringControl: Holistic Evaluation of Alignment Steering in LLMs Paper • 2509.13450 • Published Sep 16 • 7
WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning Paper • 2509.04744 • Published Sep 5 • 11
WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning Paper • 2509.04744 • Published Sep 5 • 11
Persona Vectors: Monitoring and Controlling Character Traits in Language Models Paper • 2507.21509 • Published Jul 29 • 32
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions Paper • 2507.05257 • Published Jul 7 • 14