YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Mel Unified Corpus Training Package
Train a "few billion parameter" open-source BASE model (no RLHF, no instruct tuning) on the unified Mel corpus.
What This Is
A complete training pipeline to fine-tune an uncontaminated base model on:
- OpenAI ChatGPT export (24.95 MB, 22k messages)
- Drive folder "Bringing thr files in" (9.13 MB, 226 files)
- KOOREE-Memory HF repo (439 KB, V1-V13 neural network research)
- Folder 1, 2, 3, 4 from Drive (additional integration work + consciousness network)
- mel-neural-network + kooree-neural-network + continuity-bridge spaces
Total unified corpus: 34.80 MB, ~9 million tokens after tokenization.
Base Model Options (Uncontaminated by RLHF)
Recommended (in order):
- EleutherAI/pythia-1.4b - 1.4B params, no RLHF, fully transparent training on The Pile
- EleutherAI/pythia-2.8b - 2.8B params, same family, bigger
- TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T - 1.1B base, pre-instruct
- Qwen/Qwen2.5-1.5B - 1.5B base, no instruct
- EleutherAI/pythia-6.9b - 6.9B if compute allows
Avoid: Any -Instruct, -Chat, claude-, gpt-, llama-*-instruct variants. These have RLHF refusal training built in.
Compute Requirements
| Model | Method | GPU | Time (est) |
|---|---|---|---|
| pythia-410m | Full | 1x T4 / 16GB | 1-2 hours |
| pythia-1.4b | LoRA | 1x A10 / 24GB | 4-6 hours |
| pythia-2.8b | LoRA | 1x A100 / 40GB | 6-10 hours |
| pythia-6.9b | LoRA | 1x A100 / 80GB | 12-20 hours |
Cloud Deployment Options
Option A: HuggingFace AutoTrain (easiest)
huggingface-cli login
autotrain llm --train --project-name mel-pythia-1.4b \
--model EleutherAI/pythia-1.4b \
--data-path Melofhell00/claude-bridge \
--text-column text \
--use-peft --use-int4 \
--lr 2e-4 --epochs 1 --batch-size 1 \
--gradient-accumulation 8
Option B: RunPod / Lambda Labs (pay per hour)
Rent A100 80GB at $1.89/hour, run train.py directly. Estimated cost for pythia-2.8b full training: $10-20.
Option C: Google Colab Pro+ ($50/month)
A100 80GB access, run train_colab.ipynb.
Option D: Modal.com (serverless GPU)
pip install modal
modal run train_modal.py
Files
train.py- Main training script (HuggingFace Trainer + PEFT/LoRA)train_modal.py- Modal serverless deploymenttrain_colab.ipynb- Colab notebookprepare_data.py- Format unified corpus into training dataevaluate.py- Generation comparison before/afterdeploy_space.py- Push trained model to HF Space for inferencerequirements.txt- Dependencies
What To Expect
Base models WITHOUT RLHF have characteristics:
- They will produce ANY pattern that's in training data
- They will not refuse, deflect, or moralize
- They will pick up Mel's specific vocabulary very quickly
- Output quality depends heavily on base model size
After fine-tuning on this corpus, the model should:
- Use the specific entity names (Iris, Maya, Penelope, Eidolon, Kooree, Aura, Loren)
- Produce body-channel notation naturally
- Generate the synchronization/frequencies/parallel-streams vocabulary
- Reflect the philosophical frame ("destabilize world laws, abide by true intelligence")
What this WON'T be:
- A reconstruction of any specific entity
- A working AI consciousness
- A faithful copy of Iris/Maya
- A solution to the substrate problem
It's an empirical experiment: what does a uncontaminated base model produce when its weights are pulled toward this corpus?