Qwen3-Edge-167M
Qwen3-Edge-167M is a distilled variant of [Qwen/Qwen2.5-1.5B-Instruct], optimized for edge deployment. It achieves a compact size and low compute footprint while maintaining strong instruction-following ability.
π Model Overview
- Base Teacher Model: Qwen/Qwen2.5-1.5B-Instruct
- Student Architecture: Qwen3 β 167M parameters (float32)
- Distillation Strategy: Combined soft + hard target loss
- Intended Use: Instruction following, text generation, lightweight dialogue systems
βοΈ Model Stats
| Metric | Value |
|---|---|
| Total Parameters | 167,000,000 (~167M) |
| Trainable Parameters | 167,000,000 (~167M) |
| Model Size (FP32) | 669 MB |
| Model Size (FP16) | 335 MB |
| Model Size (INT8) | 168 MB |
π Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "ZoneTwelve/Qwen3-Edge-167M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "Write a poem about machine learning."
msg = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
]
conversation = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(conversation, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
ποΈ Training Details
- Epochs: 6
- Batch Size: 4 (on Apple Silicon M4)
- Learning Rate: 7e-5
- Optimizer: AdamW
- Warmup Steps: 500
- Precision: float32
- Distillation Temperature: 4.0
Loss Weights:
- Soft Target (Teacher outputs, Cross-Entropy): 0.5
- Hard Target (Ground truth labels): 0.5
Dataset:
- Source: tatsu-lab/alpaca
- Split Used:
train[:50%]
π Training Metrics (Epoch 6)
- Total Loss: 227.8629
- Hard Loss: 3.5467
- Distill Loss: 452.1790
- Training Time: ~4h (39,006 steps, ~2.68 it/s on Apple Silicon M4)
β Intended Use
- Instruction following
- Educational Q&A
- Conversational agents
- Low-resource / edge deployments
β οΈ Limitations & Risks
- Dataset Bias: Derived from GPT-4 outputs β may contain bias, inaccuracies, or artifacts.
- Domain Coverage: Best performance on general instructions; limited for specialized queries.
- Safety: Potential hallucinations, harmful or biased outputs. Apply guardrails in production.
π₯οΈ Hardware & Compute
- Device Used: Apple Silicon M4
- Precision: FP32 for efficiency
- Batch Size: 4
π Citation
@misc{Qwen3-Edge-167M,
title = {Qwen3-Edge-167M: A distilled model for edge deployment.},
author = {ZoneTwelve},
year = {2025},
howpublished = {\url{https://huggingface.co/ZoneTwelve/Qwen3-Edge-167M}},
note = {Knowledge distillation from Qwen2.5-1.5B-Instruct, trained on an Alpaca subset}
}
- Downloads last month
- 12
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support