Imu1-Base
First small language model trained on consumer GPUs with competitive performance.
Trained on 80B tokens using advanced NorMuon optimizer with Cautious Weight Decay and Polar Express Newton-Shulz coefficients and WSD scheduler.
Custom library for training: https://github.com/thepowerfuldeez/sample_efficient_gpt
Stage 1 data: 72B total
46B web (27B dclm-edu, 8B finepdfs_edu, 7B finewiki_en, 4B zyda-2-filtered)
14B code (12B stack_edu, 2B python_edu)
10B math (5B finemath, 5B megamath)
2B synth (1B marin q&a, 1B cosmopedia_edu)
Thresholds for FineWeb-Edu classifier
- DLCM-Edu 2.75+
- FinePDFs 2020+ & 2+
- The Stack V2 Updated 3+
- The Stack Edu 3.75+
- FineMath 3+
The Stack V2 updated is processed with the stack edu classifiers Stage 2 data:
22.75B cooldown
5.5B web (2B dclm-edu 3.0+, 3B finepdfs_edu 2.2+, 500M zyda-2-filtered)
5.5B knowledge (3B finewiki_en, 1B stackexchange, 1B arxiv, 0.5B legal)
1.75B code (1B stack_edu 4.0+, 0.75B python_edu)
2B math (1B finemath 4.0+, 1B megamath)
8B synth (2B cosmopedia_edu 3.0+, 1B marin q&a, 5B pleias synth)
FinePDFs is processed with fineweb edu classifier
Training phase
Stable stage - start with 80bs + 640 context, increased to 704 + 70bs during training (50k tokens per micro step), grad acc 8, so 400k tokens per update step.
Decay stage - increased context up to 1024 + 50bs, grad acc 10, so 500k tokens per update step. Inverse sqrt decay.
- EMA during cooldown, post-hoc due to memory limits, increase checkpoint frequency
- MFU: 46%
- Total FLOP consumed: 749.5TFLOPS * (4+16+4+60+5+27+25+…) ~ 6e8 TFLOP
- Total micro steps: 1750000, or 200000 optimizer steps.
Evals
Task , Accuracy
hellaswag_zeroshot , 0.397530
jeopardy , 0.174776
bigbench_qa_wikidata , 0.535308
arc_easy , 0.661195
arc_challenge , 0.365188
copa , 0.680000
commonsense_qa , 0.251433
piqa , 0.672470
openbook_qa , 0.328000
lambada_openai , 0.392199
hellaswag , 0.395837
winograd , 0.615385
winogrande , 0.535912
bigbench_dyck_languages , 0.163000
agi_eval_lsat_ar , 0.278261
bigbench_cs_algorithms , 0.465151
bigbench_operators , 0.276190
bigbench_repeat_copy_logic , 0.062500
squad , 0.252886
coqa , 0.278467
boolq , 0.556269
bigbench_language_identification , 0.257400
CORE , , 0.226476
Inference
Custom fork of transformers required
uv pip install "git+https://github.com/thepowerfuldeez/transformers.git@imu1"
- Downloads last month
- 32
