Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions Paper โข 2406.10638 โข Published Jun 15, 2024
MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos Paper โข 2502.12558 โข Published Feb 18
Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval Paper โข 2502.11431 โข Published Feb 17
Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification Paper โข 2506.19225 โข Published Jun 24
TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos Paper โข 2509.26360 โข Published Sep 30
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist Paper โข 2511.08521 โข Published 28 days ago โข 37
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist Paper โข 2511.08521 โข Published 28 days ago โข 37
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation Paper โข 2503.24379 โข Published Mar 31 โข 76
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval Paper โข 2412.14475 โข Published Dec 19, 2024 โข 55