WiggleGPT

A 124M parameter transformer that challenges a 56-year-old assumption in neural network design.

WiggleGPT Architecture

What Makes It Different?

Since Minsky and Papert's Perceptrons (1969), neural networks have relied on monotonic activation functions (Sigmoid, ReLU, GELU) — requiring multiple hidden layers to solve non-linearly separable problems like XOR.

WiggleGPT replaces monotonic activations with learnable oscillating functions, enabling single neurons to create multiple decision boundaries:

f(x) = sin(ωx + φ) · tanh(x) + baseline

Where ω (frequency) and φ (phase) are learnable per-neuron parameters.

Results

Model Parameters Val Loss Notes
WiggleGPT 124M 3.1621 Oscillating activation
GPT-2 124M ~3.12 Standard GELU baseline

Within 1.3% of GPT-2 performance — proving oscillating activations are a functional drop-in replacement at scale.

The Model Actually Learned to Oscillate

Parameter Init After Training Change
ω mean 1.0 1.096 +9.6%
ω std 0.1 0.602 6× increase
ω range [0.8, 1.2] [-0.19, 5.17] Massive expansion
  • 95% of neurons retained active oscillation (ω > 0.1)
  • Some neurons learned frequencies up to ω = 5.17 (five oscillations per unit input)
  • Full phase coverage [-Ï€, +Ï€] after training

Checkpoints

File Description
ckpt_pretrain.pt Base model trained on OpenWebText (~600k iterations)
ckpt_finetune.pt Fine-tuned on SmolTalk2 (instruction following)

Architecture

Component Specification
Parameters 123,697,920
Layers 12
Attention Heads 12
Embedding Dimension 768
Oscillating Neurons 36,864 (each with learnable ω, φ, baseline)
Normalization RMSNorm
Position Encoding RoPE (Rotary)
Attention Flash Attention (when available)

Usage

See the GitHub repository for full training, inference, and chat scripts.

# Quick inference example
import torch
from model_bio import GPT, GPTConfig

# Load checkpoint
checkpoint = torch.load('ckpt_pretrain.pt', map_location='cuda')
config = GPTConfig(**checkpoint['config'])
model = GPT(config)
model.load_state_dict(checkpoint['model'])
model.eval()

# Generate text (see sample_bio.py for full implementation)

Training Details

Pretraining:

  • Dataset: OpenWebText (~9B tokens)
  • Iterations: 600,000
  • Hardware: RTX 3070 (steps 0–354k) → RTX 5060 Ti 16GB (steps 354k–600k)
  • Time: Roughly 20 days total (~15 days on 3070, ~5 days on 5060 Ti)

Fine-tuning:

  • Dataset: SmolTalk2 (406K examples)
  • Oscillation parameters (ω, φ) remained stable — 0.0% of neurons shifted by >0.1

Citation

@software{wigglegpt2025,
  author = {O'Brien, Phillip C.},
  title = {WiggleGPT: Revisiting the Monotonicity Assumption in Neural Networks via Oscillating Activation Functions},
  year = {2025},
  url = {https://github.com/Eden-Eldith/WiggleGPT}
}

Author

Eden (Phillip C. O'Brien)
Independent AI Researcher | ORCID: 0009-0007-3961-1182

Built in a garage lab in Gosport, UK. No academic affiliation, no institutional funding — just curiosity and an RTX 3070.

License

GPL-3.0 — if you build on this, keep it open source.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train edeneldith/WiggleGPT