mmbert-colab

non-profit

AI & ML interests

None defined yet.

orionweller

authored 2 papers 3 months ago

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Paper • 2509.06888 • Published Sep 8 • 12

On the Theoretical Limitations of Embedding-Based Retrieval

Paper • 2508.21038 • Published Aug 28 • 20

hynky

authored a paper 6 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 75

orionweller

authored a paper 9 months ago

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

Paper • 2503.04973 • Published Mar 6 • 26

hynky

authored a paper 10 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 249

hynky

authored a paper 11 months ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 63

orionweller

authored 8 papers 12 months ago

NevIR: Negation in Neural Information Retrieval

Paper • 2305.07614 • Published May 12, 2023 • 1

Learning from Task Descriptions

Paper • 2011.08115 • Published Nov 16, 2020

MegaWika: Millions of reports and their sources across 50 diverse languages

Paper • 2307.07049 • Published Jul 13, 2023

Defending Against Poisoning Attacks in Open-Domain Question Answering

Paper • 2212.10002 • Published Dec 20, 2022

Learning to Reason via Program Generation, Emulation, and Search

Paper • 2405.16337 • Published May 25, 2024

CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation

Paper • 2406.17186 • Published Jun 24, 2024 • 2

Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

Paper • 2409.11136 • Published Sep 17, 2024 • 22

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published Dec 18, 2024 • 158

hynky

authored a paper over 1 year ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 98

orionweller

authored a paper over 1 year ago

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

Paper • 2403.15246 • Published Mar 22, 2024 • 11

hynky

authored a paper almost 2 years ago

A Dataset and Strong Baselines for Classification of Czech News Texts

Paper • 2307.10666 • Published Jul 20, 2023