--- license: apache-2.0 datasets: - PleIAs/SYNTH language: - en --- honkhazard-3
40.6M (10.49M embed, 16L/8H) | 975M seen --- a third experiment to train only on synthetic messages! - parameters: 40.6M (13.11 mlp, 10.49 embed, 10.49 head, 6.55 attn) - tokens seen: 975.2M - num_layers: 16 - num_heads: 8 - vocab_size: 32768 changes vs *honkhazard-2*: - identical main NN config - tweaked LRs - tuned batch count to ~halve training time - fixed bug causing dataset to be limited to ~600M causing repeated dataset - changed vocab size 64K -> 32K
trained on 1x rtx 5090 in 68.1m: ![image](https://cdn-uploads.huggingface.co/production/uploads/68bef4a6ca845e4b309b7485/3XDm8m5dCUxyiEVP93WcF.png)
## pre-training pre-trained only on SYNTH messages in the following format: ``` <|bos|><|user_start|>{{query}}<|user_end|><|assistant_start|><|reasoning_start|>{{synthetic_reasoning}}<|reasoning_end|>{{synthetic_answer}}<|assistant_end|> ``` ## post-training no post-training of any form has been performed on this model ## postmortem being honest: this model was not intended to be fully trained but sunk cost fallacy + curiousity made it so. loss is definitely better and was ~2x faster but seems less useful/same?