In a Training Loop 🔄
codelion
·
AI & ML interests
Creator of OptiLLM, OpenEvolve, Adaptive Classifier, and Ellora. Pioneering a new category in AI infrastructure: inference-time compute for LLMs.
Recent Activity
reacted
to
their post with 🤗 2 days ago Scaling Pedagogical Pre-training to 10 Billion Tokens
New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.
We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.
The result is https://huggingface.co/datasets/codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.
We trained https://huggingface.co/codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.
Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.
Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens
All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection. reacted
to
their post with 👀 2 days ago Scaling Pedagogical Pre-training to 10 Billion Tokens
New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.
We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.
The result is https://huggingface.co/datasets/codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.
We trained https://huggingface.co/codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.
Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.
Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens
All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection. reacted
to
their post with 🚀 2 days ago Scaling Pedagogical Pre-training to 10 Billion Tokens
New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.
We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.
The result is https://huggingface.co/datasets/codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.
We trained https://huggingface.co/codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.
Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.
Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens
All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection. View all activity
Organizations
codelion/sutra-magpie-sft
Viewer
• Updated
• 20.7k • 9
Viewer
• Updated
• 30.3k • 8
Viewer
• Updated
• 7.25k • 12
Viewer
• Updated
• 70.4k • 14
Viewer
• Updated
• 429k • 20
Viewer
• Updated
• 5M • 823
• 3
Viewer
• Updated
• 822k • 193
Viewer
• Updated
• 100k • 17
Viewer
• Updated
• 13.3k • 59
Viewer
• Updated
• 52.7k • 142
• 2
Viewer
• Updated
• 4.91k • 73
• 2
Viewer
• Updated
• 68k • 26
• 2
Viewer
• Updated
• 970k • 234
• 8
codelion/fineweb-edu-100M
Viewer
• Updated
• 115k • 94
• 3
Viewer
• Updated
• 9.46k • 30
• 2
codelion/dclm-baseline-1B
Viewer
• Updated
• 774k • 89
• 5
codelion/dclm-baseline-100M
Viewer
• Updated
• 77.2k • 8
• 2
codelion/dclm-baseline-10M
Viewer
• Updated
• 7.95k • 66
• 2
Viewer
• Updated
• 186k • 256
• 4
Viewer
• Updated
• 18.6k • 16
• 2
Viewer
• Updated
• 7.54k • 28
• 2
codelion/execution-world-model-dataset
Viewer
• Updated
• 621 • 14
codelion/SimpleQA-Verified
Viewer
• Updated
• 1k • 171
• 1
codelion/ifeval-high-quality-dpo
Viewer
• Updated
• 501 • 46
codelion/Qwen2.5-Coder-0.5B-Instruct-security-preference
Viewer
• Updated
• 245 • 10
codelion/Qwen2.5-Coder-0.5B-Instruct-progressive-2M-context
Viewer
• Updated
• 400 • 104
codelion/Llama-3.2-1B-Instruct-magpie-tool-calling
Viewer
• Updated
• 1.2k • 110
• 1
codelion/Qwen3-0.6B-icm-dpo-pairs
Viewer
• Updated
• 122 • 100
Viewer
• Updated
• 500 • 38
• 1
codelion/gemma-3-1b-it-magpie-reasoning
Viewer
• Updated
• 131 • 129
• 2