Collections
Discover the best community collections!
Collections including paper arxiv:2409.12191
-
Qwen Technical Report
Paper • 2309.16609 • Published • 37 -
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Paper • 2311.07919 • Published • 10 -
Qwen2 Technical Report
Paper • 2407.10671 • Published • 167 -
Qwen2-Audio Technical Report
Paper • 2407.10759 • Published • 62
-
Exploring the Potential of Encoder-free Architectures in 3D LMMs
Paper • 2502.09620 • Published • 26 -
The Evolution of Multimodal Model Architectures
Paper • 2405.17927 • Published • 1 -
What matters when building vision-language models?
Paper • 2405.02246 • Published • 103 -
Efficient Architectures for High Resolution Vision-Language Models
Paper • 2501.02584 • Published
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 74 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 47 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 121
-
The Evolution of Multimodal Model Architectures
Paper • 2405.17927 • Published • 1 -
What matters when building vision-language models?
Paper • 2405.02246 • Published • 103 -
Efficient Architectures for High Resolution Vision-Language Models
Paper • 2501.02584 • Published -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 133
-
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 78 -
Multimodal Latent Language Modeling with Next-Token Diffusion
Paper • 2412.08635 • Published • 48 -
AIDC-AI/Ovis2-2B
Image-Text-to-Text • 2B • Updated • 3.33k • 59 -
DAMO-NLP-SG/VideoLLaMA3-2B
Video-Text-to-Text • 2B • Updated • 4.29k • 15
-
The Llama 3 Herd of Models
Paper • 2407.21783 • Published • 117 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 78 -
Baichuan Alignment Technical Report
Paper • 2410.14940 • Published • 51 -
A Survey of Small Language Models
Paper • 2410.20011 • Published • 46
-
Qwen Technical Report
Paper • 2309.16609 • Published • 37 -
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Paper • 2311.07919 • Published • 10 -
Qwen2 Technical Report
Paper • 2407.10671 • Published • 167 -
Qwen2-Audio Technical Report
Paper • 2407.10759 • Published • 62
-
The Evolution of Multimodal Model Architectures
Paper • 2405.17927 • Published • 1 -
What matters when building vision-language models?
Paper • 2405.02246 • Published • 103 -
Efficient Architectures for High Resolution Vision-Language Models
Paper • 2501.02584 • Published -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 133
-
Exploring the Potential of Encoder-free Architectures in 3D LMMs
Paper • 2502.09620 • Published • 26 -
The Evolution of Multimodal Model Architectures
Paper • 2405.17927 • Published • 1 -
What matters when building vision-language models?
Paper • 2405.02246 • Published • 103 -
Efficient Architectures for High Resolution Vision-Language Models
Paper • 2501.02584 • Published
-
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 78 -
Multimodal Latent Language Modeling with Next-Token Diffusion
Paper • 2412.08635 • Published • 48 -
AIDC-AI/Ovis2-2B
Image-Text-to-Text • 2B • Updated • 3.33k • 59 -
DAMO-NLP-SG/VideoLLaMA3-2B
Video-Text-to-Text • 2B • Updated • 4.29k • 15
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 74 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 47 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 121
-
The Llama 3 Herd of Models
Paper • 2407.21783 • Published • 117 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 78 -
Baichuan Alignment Technical Report
Paper • 2410.14940 • Published • 51 -
A Survey of Small Language Models
Paper • 2410.20011 • Published • 46