Parcae: Scaling Laws For Stable Looped Language Models
Paper β’ 2604.12946 β’ Published β’ 6
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
An open-source, research-grounded looped transformer for latent reasoning.
Ultron is a clean implementation of a Recurrent-Depth Transformer (RDT) that combines only proven techniques from the latest research. Unlike speculative reconstructions, every architectural choice in Ultron is backed by published results with clear attribution.
Input tokens (B, T)
β
[Embedding + RoPE]
β
[Prelude] β L_p standard transformer blocks, run once
β
[LayerNorm(e)] β Prelude normalization (Parcae stability trick)
β
[Recurrent Block ΓT] β L_r transformer layers, looped T times
β_________β h_{t+1} = AΒ·h_t + BΒ·e + R(h_t, e) [LTI-stable]
β + depth-wise LoRA + ACT halting
[C Β· h_T] β Output projection
β
[Coda] β L_c standard transformer blocks, run once
β
[RMSNorm β LM Head]
β
Output logits (B, T, vocab_size)
pip install torch
git clone https://huggingface.co/trojan0x/ultron
cd ultron
import torch
from ultron.model import Ultron, UltronConfig
# Minimal config for testing
cfg = UltronConfig(
vocab_size=32000, dim=768, n_heads=12, n_kv_heads=4,
max_seq_len=2048,
prelude_layers=2, coda_layers=2,
recurrent_layers=4, max_loop_iters=8,
lora_rank=8,
)
model = Ultron(cfg)
print(f"Parameters: {model.get_num_params():,}")
print(f"Spectral radius Ο(A): {model.get_spectral_radius():.6f} (must be < 1)")
# Forward pass
ids = torch.randint(0, 32000, (2, 128))
logits = model(ids) # (2, 128, 32000)
# Generation with depth extrapolation
prompt = torch.randint(0, 32000, (1, 16))
output = model.generate(prompt, max_new_tokens=64, n_loops=16) # deeper reasoning
from ultron.variants import ultron_small, ultron_base, ultron_medium, ultron_large
cfg = ultron_small() # ~75M params, effective depth 36 layers
cfg = ultron_base() # ~166M params, effective depth 78 layers
cfg = ultron_medium() # ~1B params, effective depth 136 layers
cfg = ultron_large() # ~3B params, effective depth 300 layers
| Variant | dim | heads | Prelude | Recurrent | Coda | Loops | Effective Depth | Params |
|---|---|---|---|---|---|---|---|---|
ultron_small |
768 | 12 | 2 | 4 | 2 | 8 | 36 | ~75M |
ultron_base |
1024 | 16 | 3 | 6 | 3 | 12 | 78 | ~166M |
ultron_medium |
2048 | 16 | 4 | 8 | 4 | 16 | 136 | ~1B |
ultron_large |
4096 | 32 | 6 | 12 | 6 | 24 | 300 | ~3B |
| Feature | OpenMythos | Ultron | Rationale |
|---|---|---|---|
| Prelude norm | Missing | β RMSNorm on encoded input | Critical for stability at 1.3B+ scale (Parcae Appendix J) |
| C output projection | Missing | β Diagonal C matrix | Completes the LTI dynamical system (Parcae) |
| Recurrent depth | 1 layer per loop | β Multiple layers per loop | More expressive recurrent block |
| ACT bias init | Default | β Bias = -3 (encourage full loops early) | Prevents premature halting during early training |
| Grad checkpointing | None | β Built-in | Required for memory-efficient loop unrolling |
| MoE | Always on (64 experts) | β Optional (default OFF) | MoE + looping is unproven |
| Top-p sampling | Missing | β Nucleus sampling support | Better generation quality |
| LoRA init | Random | β Near-zero initialization | Starts as near-identity, prevents early instability |
Every component is grounded in published work:
| Component | Paper | Key Result |
|---|---|---|
| LTI-stable injection | Parcae (Prairie et al., 2026) | 6.3% lower PPL, eliminates training instability |
| Prelude normalization | Parcae, Appendix J | Critical for stability at 1.3B+ scale |
| Depth extrapolation | Loop, Think, & Generalize (2025) | Train 5-hop, test 10-hop by increasing loops |
| Depth-wise LoRA | Relaxed Recursive Transformers (Bae et al., 2024) | Recursive Gemma 1B recovers most of Gemma 2B |
| Looped = implicit CoT | Saunshi et al., 2025 | Formally proven: T loops simulate T steps of CoT |
| ACT halting | Graves, 2016 | Per-position adaptive computation |
| GQA | Ainslie et al., 2023 | Efficient KV cache, proven with looping |
| RMSNorm | Zhang & Sennrich, 2019 | Standard normalization |
| RoPE | Su et al., 2021 | Rotary positional encoding |
| MLA (optional) | DeepSeek-V2, 2024 | 10-20Γ smaller KV cache |
| MoE (optional) | DeepSeekMoE, 2024 | Fine-grained expert routing |
use_moe=True)attn_type="mla")Based on published scaling laws:
| Setting | Value | Source |
|---|---|---|
| Optimizer | AdamW (Ξ²1=0.9, Ξ²2=0.95) | Standard |
| Learning rate | 3e-4 (140M), 2e-4 (370M+) | Parcae |
| Schedule | Cosine decay with warmup | Parcae |
| Warmup steps | 2000 | Parcae |
| Weight decay | 0.1 | Parcae |
| Batch size | 512 Γ 1280 tokens | Saunshi et al. |
| Dataset | FineWeb-Edu | Parcae / FineWeb |
| ΞΌ_bwd | βΞΌ_rec/2β | Parcae (backprop truncation) |
| Depth sampling | Per-sequence within micro-batch | Parcae |
MIT License
@software{ultron2026,
title = {Ultron: An Open-Source Recurrent-Depth Transformer},
year = {2026},
url = {https://huggingface.co/trojan0x/ultron},
note = {Grounded in Parcae, Relaxed Recursive Transformers, and looped transformer theory}
}