Open-Source Training/Implementation of Loop Attention for Qwen3-0.6B

Hello world! I’m poodle, I wanted to share a open-source methodology of how I implemented loop attention into Qwen3-0.6B. I did not want to just hand you the weights so I also included the training script meant for qwen’s architecture.

I hope you enjoy!

This model implements Loop Attention on top of Qwen3-0.6B, a novel architecture that performs two forward passes through the attention mechanism:

This is a custom implementation of Loop Attention applied to the Qwen3-0.6B architecture. It features a novel gating mechanism that dynamically mixes global context (from a first pass) with local windowed attention (in a second pass), aiming to improve generation coherence and context usage.

Repository: coolpoodle/Qwen3-0.6B-Looped Base Model: Qwen/Qwen3-0.6B

Model Details

  • Architecture: Qwen3 with Loop Attention Wrapper
  • Run: Denotes the training run / what specifically we tried doing different.
  • Parameter Count: ~0.6B (Base) + ~58k (Gates)
  • Trained on: WikiText-2

Run 1 (Notes)

For Run 1, I started with the following parameters:

  • Context Length: Trained with 512 context.

Run 2 Experiments (Notes)

For Run 2, we attempted the following changes:

  • Context Length: Retrained with 1024 context (vs 512 in Run 1).
  • Layer Norms: Unfrozen layer norms during training (in hope that features are more stable).

Results

Model Validation Loss Perplexity (PPL)
Baseline Qwen3-0.6B 3.7274 41.57
Loop Run1 (Epoch 3) 3.5549 35.01
Loop Run2 (Epoch 1) 3.6434 38.22
Loop Run2 (Epoch 2) 3.5936 36.37
Loop Run2 (Epoch 3) 3.5642 35.31

πŸš€ Easy Inference

You can load this model directly using transformers. Note: trust_remote_code=True is required because this model uses a custom architecture (Qwen3LoopForCausalLM).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "coolpoodle/Qwen3-0.6B-Looped"

print("Loading model...")
# trust_remote_code=True is essential for the custom architecture
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True, 
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Prompt
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
# use_cache=False is RECOMMENDED for Loop Attention to fully activate its mixing logic during generation
print("Generating...")
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=100, 
        do_sample=True, 
        temperature=0.7,
        use_cache=False 
    )

print("-" * 20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print("-" * 20)

How it Works

The model performs two passes for each forward step (during training or non-cached generation):

  1. Global Pass: Standard full attention.
  2. Local/Mix Pass: A gated combination of the cached global context and a local sliding window attention.

The gate starts initialized to prioritize global attention (bias +5.0) to prevent initialization shock, gradually learning to utilize local context.

Files

  • Qwen3-0.6B-Looped-Run2-Final.bin: The main model weights.
  • modeling_qwen_loop.py: The custom model code.
  • pytorch_model.bin.index.json: Maps the custom weight file for seamless loading.

Todo:

  1. Upload benchmarks on HumanEval to see if attention loop provides transferable gains to the entirety of the model.
  2. Keep working on the math, to see if I can improve the training
  3. Sleep?

Citation

@misc{qwen3-looped,
  author = {coolpoodle},
  title = {Qwen3-0.6B-Looped},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/coolpoodle/Qwen3-0.6B-Looped}}
}
Downloads last month
34
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for coolpoodle/Qwen3-0.6B-Looped

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(551)
this model

Collection including coolpoodle/Qwen3-0.6B-Looped