YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸ€– Ultron β€” Recurrent-Depth Transformer

An open-source, research-grounded looped transformer for latent reasoning.

Ultron is a clean implementation of a Recurrent-Depth Transformer (RDT) that combines only proven techniques from the latest research. Unlike speculative reconstructions, every architectural choice in Ultron is backed by published results with clear attribution.

Architecture

Input tokens (B, T)
    ↓
[Embedding + RoPE]
    ↓
[Prelude]              β€” L_p standard transformer blocks, run once
    ↓
[LayerNorm(e)]         β€” Prelude normalization (Parcae stability trick)
    ↓
[Recurrent Block Γ—T]   β€” L_r transformer layers, looped T times
    ↑_________↓          h_{t+1} = AΒ·h_t + BΒ·e + R(h_t, e)  [LTI-stable]
    ↓                    + depth-wise LoRA + ACT halting
[C Β· h_T]              β€” Output projection
    ↓
[Coda]                 β€” L_c standard transformer blocks, run once
    ↓
[RMSNorm β†’ LM Head]
    ↓
Output logits (B, T, vocab_size)

Key Design Principles

  1. Only proven components: Every technique has published results. MoE is optional (default OFF) because MoE + looping is untested at scale.
  2. Parcae stability: LTI-constrained injection (ρ(A) < 1 by construction), prelude normalization, per-sequence depth sampling.
  3. Depth extrapolation: Train on N loops, test on N+k. More loops at inference = deeper reasoning.
  4. Adaptive compute: ACT halting lets easy tokens exit early, hard tokens get full depth.
  5. Parameter efficiency: A 770M looped model matches a 1.3B standard transformer (Parcae, 2026).

Installation

pip install torch
git clone https://huggingface.co/trojan0x/ultron
cd ultron

Quick Start

import torch
from ultron.model import Ultron, UltronConfig

# Minimal config for testing
cfg = UltronConfig(
    vocab_size=32000, dim=768, n_heads=12, n_kv_heads=4,
    max_seq_len=2048,
    prelude_layers=2, coda_layers=2,
    recurrent_layers=4, max_loop_iters=8,
    lora_rank=8,
)

model = Ultron(cfg)
print(f"Parameters: {model.get_num_params():,}")
print(f"Spectral radius ρ(A): {model.get_spectral_radius():.6f} (must be < 1)")

# Forward pass
ids = torch.randint(0, 32000, (2, 128))
logits = model(ids)  # (2, 128, 32000)

# Generation with depth extrapolation
prompt = torch.randint(0, 32000, (1, 16))
output = model.generate(prompt, max_new_tokens=64, n_loops=16)  # deeper reasoning

Pre-configured Variants

from ultron.variants import ultron_small, ultron_base, ultron_medium, ultron_large

cfg = ultron_small()   # ~75M params, effective depth 36 layers
cfg = ultron_base()    # ~166M params, effective depth 78 layers
cfg = ultron_medium()  # ~1B params, effective depth 136 layers
cfg = ultron_large()   # ~3B params, effective depth 300 layers
Variant dim heads Prelude Recurrent Coda Loops Effective Depth Params
ultron_small 768 12 2 4 2 8 36 ~75M
ultron_base 1024 16 3 6 3 12 78 ~166M
ultron_medium 2048 16 4 8 4 16 136 ~1B
ultron_large 4096 32 6 12 6 24 300 ~3B

Improvements over OpenMythos

Feature OpenMythos Ultron Rationale
Prelude norm Missing βœ… RMSNorm on encoded input Critical for stability at 1.3B+ scale (Parcae Appendix J)
C output projection Missing βœ… Diagonal C matrix Completes the LTI dynamical system (Parcae)
Recurrent depth 1 layer per loop βœ… Multiple layers per loop More expressive recurrent block
ACT bias init Default βœ… Bias = -3 (encourage full loops early) Prevents premature halting during early training
Grad checkpointing None βœ… Built-in Required for memory-efficient loop unrolling
MoE Always on (64 experts) βœ… Optional (default OFF) MoE + looping is unproven
Top-p sampling Missing βœ… Nucleus sampling support Better generation quality
LoRA init Random βœ… Near-zero initialization Starts as near-identity, prevents early instability

Research Foundation

Every component is grounded in published work:

Component Paper Key Result
LTI-stable injection Parcae (Prairie et al., 2026) 6.3% lower PPL, eliminates training instability
Prelude normalization Parcae, Appendix J Critical for stability at 1.3B+ scale
Depth extrapolation Loop, Think, & Generalize (2025) Train 5-hop, test 10-hop by increasing loops
Depth-wise LoRA Relaxed Recursive Transformers (Bae et al., 2024) Recursive Gemma 1B recovers most of Gemma 2B
Looped = implicit CoT Saunshi et al., 2025 Formally proven: T loops simulate T steps of CoT
ACT halting Graves, 2016 Per-position adaptive computation
GQA Ainslie et al., 2023 Efficient KV cache, proven with looping
RMSNorm Zhang & Sennrich, 2019 Standard normalization
RoPE Su et al., 2021 Rotary positional encoding
MLA (optional) DeepSeek-V2, 2024 10-20Γ— smaller KV cache
MoE (optional) DeepSeekMoE, 2024 Fine-grained expert routing

Proven vs. Experimental

βœ… Proven (default ON)

  • LTI-stable injection with spectral radius < 1
  • Prelude normalization
  • Depth extrapolation via inference-time loops
  • ACT halting for adaptive compute
  • Depth-wise LoRA adaptation
  • GQA attention

⚠️ Experimental (optional, default OFF)

  • MoE FFN in recurrent block (use_moe=True)
  • MLA attention (attn_type="mla")
  • Loop-index sinusoidal embedding

Training Recipe (from Parcae)

Based on published scaling laws:

Setting Value Source
Optimizer AdamW (Ξ²1=0.9, Ξ²2=0.95) Standard
Learning rate 3e-4 (140M), 2e-4 (370M+) Parcae
Schedule Cosine decay with warmup Parcae
Warmup steps 2000 Parcae
Weight decay 0.1 Parcae
Batch size 512 Γ— 1280 tokens Saunshi et al.
Dataset FineWeb-Edu Parcae / FineWeb
ΞΌ_bwd ⌈μ_rec/2βŒ‰ Parcae (backprop truncation)
Depth sampling Per-sequence within micro-batch Parcae

License

MIT License

Citation

@software{ultron2026,
  title   = {Ultron: An Open-Source Recurrent-Depth Transformer},
  year    = {2026},
  url     = {https://huggingface.co/trojan0x/ultron},
  note    = {Grounded in Parcae, Relaxed Recursive Transformers, and looped transformer theory}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for trojan0x/ultron