The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper • 2510.13996 • Published Oct 15 • 8
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training Paper • 2506.01732 • Published Jun 2 • 6
open-sci-ref-0.01 Collection Research baseline models trained on various open reference datasets • 12 items • Updated Jul 23 • 4
view article Article Releasing Common Corpus: the largest public domain dataset for training LLMs Mar 20, 2024 • 29
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper • 2506.20920 • Published Jun 26 • 75
view article Article Assisted Generation: a new direction toward low-latency text generation May 11, 2023 • 74
Common Models Collection The first generation of models pretrained on Common Corpus. • 5 items • Updated Dec 5, 2024 • 41
Qwen2.5 Collection Qwen2.5 language models, including pretrained and instruction-tuned models of 7 sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. • 46 items • Updated Jul 21 • 666
GEITje 7B: A Large Open Dutch Language Model Collection All models and datasets relating to GEITje • 8 items • Updated Jan 25 • 5
Recent models: last 100 repos, sorted by creation date Collection The last 100 repos I have created. Sorted by creation date descending, so the most recently created repos appear at the top. • 121 items • Updated Jan 31, 2024 • 564