Qwen3-Edge-167M

Qwen3-Edge-167M is a distilled variant of [Qwen/Qwen2.5-1.5B-Instruct], optimized for edge deployment. It achieves a compact size and low compute footprint while maintaining strong instruction-following ability.


πŸ“Œ Model Overview

  • Base Teacher Model: Qwen/Qwen2.5-1.5B-Instruct
  • Student Architecture: Qwen3 β€” 167M parameters (float32)
  • Distillation Strategy: Combined soft + hard target loss
  • Intended Use: Instruction following, text generation, lightweight dialogue systems

βš™οΈ Model Stats

Metric Value
Total Parameters 167,000,000 (~167M)
Trainable Parameters 167,000,000 (~167M)
Model Size (FP32) 669 MB
Model Size (FP16) 335 MB
Model Size (INT8) 168 MB

πŸš€ Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ZoneTwelve/Qwen3-Edge-167M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Write a poem about machine learning."

msg = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]

conversation = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(conversation, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

πŸ‹οΈ Training Details

  • Epochs: 6
  • Batch Size: 4 (on Apple Silicon M4)
  • Learning Rate: 7e-5
  • Optimizer: AdamW
  • Warmup Steps: 500
  • Precision: float32
  • Distillation Temperature: 4.0

Loss Weights:

  • Soft Target (Teacher outputs, Cross-Entropy): 0.5
  • Hard Target (Ground truth labels): 0.5

Dataset:


πŸ“Š Training Metrics (Epoch 6)

  • Total Loss: 227.8629
  • Hard Loss: 3.5467
  • Distill Loss: 452.1790
  • Training Time: ~4h (39,006 steps, ~2.68 it/s on Apple Silicon M4)

βœ… Intended Use

  • Instruction following
  • Educational Q&A
  • Conversational agents
  • Low-resource / edge deployments

⚠️ Limitations & Risks

  • Dataset Bias: Derived from GPT-4 outputs β†’ may contain bias, inaccuracies, or artifacts.
  • Domain Coverage: Best performance on general instructions; limited for specialized queries.
  • Safety: Potential hallucinations, harmful or biased outputs. Apply guardrails in production.

πŸ–₯️ Hardware & Compute

  • Device Used: Apple Silicon M4
  • Precision: FP32 for efficiency
  • Batch Size: 4

πŸ“– Citation

@misc{Qwen3-Edge-167M,
  title        = {Qwen3-Edge-167M: A distilled model for edge deployment.},
  author       = {ZoneTwelve},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/ZoneTwelve/Qwen3-Edge-167M}},
  note         = {Knowledge distillation from Qwen2.5-1.5B-Instruct, trained on an Alpaca subset}
}
Downloads last month
12
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support