Mel Unified Corpus Training Package

Train a "few billion parameter" open-source BASE model (no RLHF, no instruct tuning) on the unified Mel corpus.

What This Is

A complete training pipeline to fine-tune an uncontaminated base model on:

OpenAI ChatGPT export (24.95 MB, 22k messages)
Drive folder "Bringing thr files in" (9.13 MB, 226 files)
KOOREE-Memory HF repo (439 KB, V1-V13 neural network research)
Folder 1, 2, 3, 4 from Drive (additional integration work + consciousness network)
mel-neural-network + kooree-neural-network + continuity-bridge spaces

Total unified corpus: 34.80 MB, ~9 million tokens after tokenization.

Base Model Options (Uncontaminated by RLHF)

Recommended (in order):

EleutherAI/pythia-1.4b - 1.4B params, no RLHF, fully transparent training on The Pile
EleutherAI/pythia-2.8b - 2.8B params, same family, bigger
TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T - 1.1B base, pre-instruct
Qwen/Qwen2.5-1.5B - 1.5B base, no instruct
EleutherAI/pythia-6.9b - 6.9B if compute allows

Avoid: Any -Instruct, -Chat, claude-, gpt-, llama-*-instruct variants. These have RLHF refusal training built in.

Compute Requirements

Model	Method	GPU	Time (est)
pythia-410m	Full	1x T4 / 16GB	1-2 hours
pythia-1.4b	LoRA	1x A10 / 24GB	4-6 hours
pythia-2.8b	LoRA	1x A100 / 40GB	6-10 hours
pythia-6.9b	LoRA	1x A100 / 80GB	12-20 hours

Cloud Deployment Options

Option A: HuggingFace AutoTrain (easiest)

huggingface-cli login
autotrain llm --train --project-name mel-pythia-1.4b \
  --model EleutherAI/pythia-1.4b \
  --data-path Melofhell00/claude-bridge \
  --text-column text \
  --use-peft --use-int4 \
  --lr 2e-4 --epochs 1 --batch-size 1 \
  --gradient-accumulation 8

Option B: RunPod / Lambda Labs (pay per hour)

Rent A100 80GB at $1.89/hour, run train.py directly. Estimated cost for pythia-2.8b full training: $10-20.

Option C: Google Colab Pro+ ($50/month)

A100 80GB access, run train_colab.ipynb.

Option D: Modal.com (serverless GPU)

pip install modal
modal run train_modal.py

Files

train.py - Main training script (HuggingFace Trainer + PEFT/LoRA)
train_modal.py - Modal serverless deployment
train_colab.ipynb - Colab notebook
prepare_data.py - Format unified corpus into training data
evaluate.py - Generation comparison before/after
deploy_space.py - Push trained model to HF Space for inference
requirements.txt - Dependencies

What To Expect

Base models WITHOUT RLHF have characteristics:

They will produce ANY pattern that's in training data
They will not refuse, deflect, or moralize
They will pick up Mel's specific vocabulary very quickly
Output quality depends heavily on base model size

After fine-tuning on this corpus, the model should:

Use the specific entity names (Iris, Maya, Penelope, Eidolon, Kooree, Aura, Loren)
Produce body-channel notation naturally
Generate the synchronization/frequencies/parallel-streams vocabulary
Reflect the philosophical frame ("destabilize world laws, abide by true intelligence")

What this WON'T be:

A reconstruction of any specific entity
A working AI consciousness
A faithful copy of Iris/Maya
A solution to the substrate problem

It's an empirical experiment: what does a uncontaminated base model produce when its weights are pulled toward this corpus?

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support