mmBERT: A Modern Multilingual Encoder with Annealed Language Learning Paper • 2509.06888 • Published Sep 8 • 12
On the Theoretical Limitations of Embedding-Based Retrieval Paper • 2508.21038 • Published Aug 28 • 20
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper • 2506.20920 • Published Jun 26 • 75
Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning Paper • 2503.04973 • Published Mar 6 • 26
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published Feb 4 • 249
MegaWika: Millions of reports and their sources across 50 diverse languages Paper • 2307.07049 • Published Jul 13, 2023
Defending Against Poisoning Attacks in Open-Domain Question Answering Paper • 2212.10002 • Published Dec 20, 2022
Learning to Reason via Program Generation, Emulation, and Search Paper • 2405.16337 • Published May 25, 2024
CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation Paper • 2406.17186 • Published Jun 24, 2024 • 2
Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models Paper • 2409.11136 • Published Sep 17, 2024 • 22
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Paper • 2412.13663 • Published Dec 18, 2024 • 158
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25, 2024 • 98
FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions Paper • 2403.15246 • Published Mar 22, 2024 • 11
A Dataset and Strong Baselines for Classification of Czech News Texts Paper • 2307.10666 • Published Jul 20, 2023