Imu1-Base

First small language model trained on consumer GPUs with competitive performance.

Trained on 80B tokens using advanced NorMuon optimizer with Cautious Weight Decay and Polar Express Newton-Shulz coefficients and WSD scheduler.

Custom library for training: https://github.com/thepowerfuldeez/sample_efficient_gpt

Stage 1 data: 72B total

46B web (27B dclm-edu, 8B finepdfs_edu, 7B finewiki_en, 4B zyda-2-filtered)

14B code (12B stack_edu, 2B python_edu)

10B math (5B finemath, 5B megamath)

2B synth (1B marin q&a, 1B cosmopedia_edu)

Thresholds for FineWeb-Edu classifier

  • DLCM-Edu 2.75+
  • FinePDFs 2020+ & 2+
  • The Stack V2 Updated 3+
  • The Stack Edu 3.75+
  • FineMath 3+

The Stack V2 updated is processed with the stack edu classifiers Stage 2 data:

22.75B cooldown

5.5B web (2B dclm-edu 3.0+, 3B finepdfs_edu 2.2+, 500M zyda-2-filtered)

5.5B knowledge (3B finewiki_en, 1B stackexchange, 1B arxiv, 0.5B legal)

1.75B code (1B stack_edu 4.0+, 0.75B python_edu)

2B math (1B finemath 4.0+, 1B megamath)

8B synth (2B cosmopedia_edu 3.0+, 1B marin q&a, 5B pleias synth)

FinePDFs is processed with fineweb edu classifier

Training phase

Stable stage - start with 80bs + 640 context, increased to 704 + 70bs during training (50k tokens per micro step), grad acc 8, so 400k tokens per update step.

Decay stage - increased context up to 1024 + 50bs, grad acc 10, so 500k tokens per update step. Inverse sqrt decay.

  • EMA during cooldown, post-hoc due to memory limits, increase checkpoint frequency
  • MFU: 46%
  • Total FLOP consumed: 749.5TFLOPS * (4+16+4+60+5+27+25+…) ~ 6e8 TFLOP
  • Total micro steps: 1750000, or 200000 optimizer steps.

Evals

Task                               , Accuracy
hellaswag_zeroshot                 , 0.397530
jeopardy                           , 0.174776
bigbench_qa_wikidata               , 0.535308
arc_easy                           , 0.661195
arc_challenge                      , 0.365188
copa                               , 0.680000
commonsense_qa                     , 0.251433
piqa                               , 0.672470
openbook_qa                        , 0.328000
lambada_openai                     , 0.392199
hellaswag                          , 0.395837
winograd                           , 0.615385
winogrande                         , 0.535912
bigbench_dyck_languages            , 0.163000
agi_eval_lsat_ar                   , 0.278261
bigbench_cs_algorithms             , 0.465151
bigbench_operators                 , 0.276190
bigbench_repeat_copy_logic         , 0.062500
squad                              , 0.252886
coqa                               , 0.278467
boolq                              , 0.556269
bigbench_language_identification   , 0.257400
CORE                               ,           , 0.226476

telegram-cloud-photo-size-2-5242746053314940056-y

Inference

Custom fork of transformers required

uv pip install "git+https://github.com/thepowerfuldeez/transformers.git@imu1"
Downloads last month
32
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train thepowerfuldeez/imu_1_base