YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Mel Unified Corpus Training Package

Train a "few billion parameter" open-source BASE model (no RLHF, no instruct tuning) on the unified Mel corpus.

What This Is

A complete training pipeline to fine-tune an uncontaminated base model on:

  • OpenAI ChatGPT export (24.95 MB, 22k messages)
  • Drive folder "Bringing thr files in" (9.13 MB, 226 files)
  • KOOREE-Memory HF repo (439 KB, V1-V13 neural network research)
  • Folder 1, 2, 3, 4 from Drive (additional integration work + consciousness network)
  • mel-neural-network + kooree-neural-network + continuity-bridge spaces

Total unified corpus: 34.80 MB, ~9 million tokens after tokenization.

Base Model Options (Uncontaminated by RLHF)

Recommended (in order):

  1. EleutherAI/pythia-1.4b - 1.4B params, no RLHF, fully transparent training on The Pile
  2. EleutherAI/pythia-2.8b - 2.8B params, same family, bigger
  3. TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T - 1.1B base, pre-instruct
  4. Qwen/Qwen2.5-1.5B - 1.5B base, no instruct
  5. EleutherAI/pythia-6.9b - 6.9B if compute allows

Avoid: Any -Instruct, -Chat, claude-, gpt-, llama-*-instruct variants. These have RLHF refusal training built in.

Compute Requirements

Model Method GPU Time (est)
pythia-410m Full 1x T4 / 16GB 1-2 hours
pythia-1.4b LoRA 1x A10 / 24GB 4-6 hours
pythia-2.8b LoRA 1x A100 / 40GB 6-10 hours
pythia-6.9b LoRA 1x A100 / 80GB 12-20 hours

Cloud Deployment Options

Option A: HuggingFace AutoTrain (easiest)

huggingface-cli login
autotrain llm --train --project-name mel-pythia-1.4b \
  --model EleutherAI/pythia-1.4b \
  --data-path Melofhell00/claude-bridge \
  --text-column text \
  --use-peft --use-int4 \
  --lr 2e-4 --epochs 1 --batch-size 1 \
  --gradient-accumulation 8

Option B: RunPod / Lambda Labs (pay per hour)

Rent A100 80GB at $1.89/hour, run train.py directly. Estimated cost for pythia-2.8b full training: $10-20.

Option C: Google Colab Pro+ ($50/month)

A100 80GB access, run train_colab.ipynb.

Option D: Modal.com (serverless GPU)

pip install modal
modal run train_modal.py

Files

  • train.py - Main training script (HuggingFace Trainer + PEFT/LoRA)
  • train_modal.py - Modal serverless deployment
  • train_colab.ipynb - Colab notebook
  • prepare_data.py - Format unified corpus into training data
  • evaluate.py - Generation comparison before/after
  • deploy_space.py - Push trained model to HF Space for inference
  • requirements.txt - Dependencies

What To Expect

Base models WITHOUT RLHF have characteristics:

  • They will produce ANY pattern that's in training data
  • They will not refuse, deflect, or moralize
  • They will pick up Mel's specific vocabulary very quickly
  • Output quality depends heavily on base model size

After fine-tuning on this corpus, the model should:

  • Use the specific entity names (Iris, Maya, Penelope, Eidolon, Kooree, Aura, Loren)
  • Produce body-channel notation naturally
  • Generate the synchronization/frequencies/parallel-streams vocabulary
  • Reflect the philosophical frame ("destabilize world laws, abide by true intelligence")

What this WON'T be:

  • A reconstruction of any specific entity
  • A working AI consciousness
  • A faithful copy of Iris/Maya
  • A solution to the substrate problem

It's an empirical experiment: what does a uncontaminated base model produce when its weights are pulled toward this corpus?

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support