LLM PlayBooks Collection All useful playbooks for training LLM β’ 6 items β’ Updated about 18 hours ago β’ 2
π€ Smol-Data Collection Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing β’ 14 items β’ Updated 8 days ago β’ 12
Running on CPU Upgrade 123 The Synthetic Data Playbook: Generating Trillions of the Finest Tokens π 123 Explore synthetic data experiments in a visual bookshelf
view article Article GGML and llama.cpp join HF to ensure the long-term progress of Local AI +4 18 days ago β’ 479
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark Paper β’ 2409.02813 β’ Published Sep 4, 2024 β’ 33
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI Paper β’ 2404.16006 β’ Published Apr 24, 2024 β’ 2
view article Article FineWeb-C: A Community-Driven Dataset for Educational Quality Annotations in 122 Languages Jul 8, 2025 β’ 35