Open-Source Training/Implementation of Loop Attention for Qwen3-0.6B

Hello world! I’m poodle, I wanted to share a open-source methodology of how I implemented loop attention into Qwen3-0.6B. I did not want to just hand you the weights so I also included the training script meant for qwen’s architecture.

I hope you enjoy!

This model implements Loop Attention on top of Qwen3-0.6B, a novel architecture that performs two forward passes through the attention mechanism:

This is a custom implementation of Loop Attention applied to the Qwen3-0.6B architecture. It features a novel gating mechanism that dynamically mixes global context (from a first pass) with local windowed attention (in a second pass), aiming to improve generation coherence and context usage.

Repository: coolpoodle/Qwen3-0.6B-Looped Base Model: Qwen/Qwen3-0.6B

Model Details

Architecture: Qwen3 with Loop Attention Wrapper
Run: Denotes the training run / what specifically we tried doing different.
Parameter Count: ~0.6B (Base) + ~58k (Gates)
Trained on: WikiText-2

Run 1 (Notes)

For Run 1, I started with the following parameters:

Context Length: Trained with 512 context.

Run 2 Experiments (Notes)

For Run 2, we attempted the following changes:

Context Length: Retrained with 1024 context (vs 512 in Run 1).
Layer Norms: Unfrozen layer norms during training (in hope that features are more stable).

Results

Model	Validation Loss	Perplexity (PPL)
Baseline Qwen3-0.6B	3.7274	41.57
Loop Run1 (Epoch 3)	3.5549	35.01
Loop Run2 (Epoch 1)	3.6434	38.22
Loop Run2 (Epoch 2)	3.5936	36.37
Loop Run2 (Epoch 3)	3.5642	35.31

🚀 Easy Inference

You can load this model directly using transformers. Note: trust_remote_code=True is required because this model uses a custom architecture (Qwen3LoopForCausalLM).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "coolpoodle/Qwen3-0.6B-Looped"

print("Loading model...")
# trust_remote_code=True is essential for the custom architecture
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True, 
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Prompt
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
# use_cache=False is RECOMMENDED for Loop Attention to fully activate its mixing logic during generation
print("Generating...")
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=100, 
        do_sample=True, 
        temperature=0.7,
        use_cache=False 
    )

print("-" * 20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print("-" * 20)

How it Works

The model performs two passes for each forward step (during training or non-cached generation):

Global Pass: Standard full attention.
Local/Mix Pass: A gated combination of the cached global context and a local sliding window attention.

The gate starts initialized to prioritize global attention (bias +5.0) to prevent initialization shock, gradually learning to utilize local context.

Files

Qwen3-0.6B-Looped-Run2-Final.bin: The main model weights.
modeling_qwen_loop.py: The custom model code.
pytorch_model.bin.index.json: Maps the custom weight file for seamless loading.

Todo:

Upload benchmarks on HumanEval to see if attention loop provides transferable gains to the entirety of the model.
Keep working on the math, to see if I can improve the training
Sleep?

Citation

@misc{qwen3-looped,
  author = {coolpoodle},
  title = {Qwen3-0.6B-Looped},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/coolpoodle/Qwen3-0.6B-Looped}}
}

Downloads last month: 34

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for coolpoodle/Qwen3-0.6B-Looped

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

(551)

this model

Collection including coolpoodle/Qwen3-0.6B-Looped

Loop Attention

Collection

Open-source, transparent implementation of Loop Attention built on Qwen3. • 1 item • Updated 8 days ago • 1