LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory Paper • 2603.03269 • Published 17 days ago • 61
Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence Paper • 2603.07660 • Published 12 days ago • 83
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding Paper • 2602.20913 • Published 24 days ago • 11
jina-embeddings-v5-text: Task-Targeted Embedding Distillation Paper • 2602.15547 • Published Feb 17 • 26
TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions Paper • 2602.08711 • Published Feb 9 • 28
BrowseComp-V^3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents Paper • 2602.12876 • Published Feb 13 • 10
CoPE-VideoLM: Codec Primitives For Efficient Video Language Models Paper • 2602.13191 • Published Feb 13 • 30
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence Paper • 2602.08683 • Published Feb 9 • 52
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models Paper • 2601.22060 • Published Jan 29 • 155
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding Paper • 2601.10611 • Published Jan 15 • 30
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning Paper • 2601.06943 • Published Jan 11 • 214
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos Paper • 2601.00393 • Published Jan 1 • 133
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding Paper • 2512.17532 • Published Dec 19, 2025 • 68