Open-Source Training/Implementation of Loop Attention for Qwen3-0.6B
Hello world! Iβm poodle, I wanted to share a open-source methodology of how I implemented loop attention into Qwen3-0.6B. I did not want to just hand you the weights so I also included the training script meant for qwenβs architecture.
I hope you enjoy!
This model implements Loop Attention on top of Qwen3-0.6B, a novel architecture that performs two forward passes through the attention mechanism:
This is a custom implementation of Loop Attention applied to the Qwen3-0.6B architecture. It features a novel gating mechanism that dynamically mixes global context (from a first pass) with local windowed attention (in a second pass), aiming to improve generation coherence and context usage.
Repository: coolpoodle/Qwen3-0.6B-Looped
Base Model: Qwen/Qwen3-0.6B
Model Details
- Architecture: Qwen3 with Loop Attention Wrapper
- Run: Denotes the training run / what specifically we tried doing different.
- Parameter Count: ~0.6B (Base) + ~58k (Gates)
- Trained on: WikiText-2
Run 1 (Notes)
For Run 1, I started with the following parameters:
- Context Length: Trained with 512 context.
Run 2 Experiments (Notes)
For Run 2, we attempted the following changes:
- Context Length: Retrained with 1024 context (vs 512 in Run 1).
- Layer Norms: Unfrozen layer norms during training (in hope that features are more stable).
Results
| Model | Validation Loss | Perplexity (PPL) |
|---|---|---|
| Baseline Qwen3-0.6B | 3.7274 | 41.57 |
| Loop Run1 (Epoch 3) | 3.5549 | 35.01 |
| Loop Run2 (Epoch 1) | 3.6434 | 38.22 |
| Loop Run2 (Epoch 2) | 3.5936 | 36.37 |
| Loop Run2 (Epoch 3) | 3.5642 | 35.31 |
π Easy Inference
You can load this model directly using transformers.
Note: trust_remote_code=True is required because this model uses a custom architecture (Qwen3LoopForCausalLM).
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "coolpoodle/Qwen3-0.6B-Looped"
print("Loading model...")
# trust_remote_code=True is essential for the custom architecture
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# Prompt
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate
# use_cache=False is RECOMMENDED for Loop Attention to fully activate its mixing logic during generation
print("Generating...")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
use_cache=False
)
print("-" * 20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print("-" * 20)
How it Works
The model performs two passes for each forward step (during training or non-cached generation):
- Global Pass: Standard full attention.
- Local/Mix Pass: A gated combination of the cached global context and a local sliding window attention.
The gate starts initialized to prioritize global attention (bias +5.0) to prevent initialization shock, gradually learning to utilize local context.
Files
Qwen3-0.6B-Looped-Run2-Final.bin: The main model weights.modeling_qwen_loop.py: The custom model code.pytorch_model.bin.index.json: Maps the custom weight file for seamless loading.
Todo:
- Upload benchmarks on HumanEval to see if attention loop provides transferable gains to the entirety of the model.
- Keep working on the math, to see if I can improve the training
- Sleep?
Citation
@misc{qwen3-looped,
author = {coolpoodle},
title = {Qwen3-0.6B-Looped},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/coolpoodle/Qwen3-0.6B-Looped}}
}
- Downloads last month
- 34