Quark-50m-Instruct

Quark-50m-Instruct is a small (≈50M parameters) decoder-only language model, fine-tuned for instruction following. It is built on the same architecture as the now‑abandoned “SmolLM” family and was fully pretrained on 5 billion tokens from HuggingFaceTB/smollm‑corpus.

  • Model type: Causal Language Model (LLaMA‑style decoder)
  • Architecture: GQA · SwiGLU · RMSNorm · RoPE · Weight‑tying
  • Pretraining tokens: 5 B
  • Fine‑tuning: Instruction‑tuned (details below)
  • Creators: OvercastLab (research & development lab for ML/AI)
  • Release date: 22 April 2026

Model Summary

Quark-50m-Instruct is designed to be an efficient assistant that can run on consumer GPUs (e.g., RTX 3070 with 8 GB VRAM) and even on CPU for light workloads. It is not competitive with large models on knowledge‑intensive tasks, but it excels at:

  • Simple conversational tasks
  • Code generation and explanation (Python)
  • Short text rewriting and summarisation
  • On‑device / edge inference

The architecture closely follows the efficient‑small‑LM blueprint popularised by SmolLM:

Component Details
Vocab size 49,152
Hidden size 384
Layers 24
Attention Grouped Query (6 Q heads, 2 KV heads)
FFN SwiGLU with 1,024 intermediate
Position RoPE (θ = 10,000)
Normalisation RMSNorm (pre‑block)

Total trainable parameters: ≈48 M (with weight tying).

Uses

Direct Use

The model can be used via the 🤗 Transformers library for standard text generation. It expects chat‑formatted input (see example below).

Downstream Use

Because of the open Apache‑2.0 license, you may fine‑tune Quark-50m‑Instruct on your own data for domain‑specific tasks – for instance, a customer‑support bot, a code reviewer, or a story writer.

Limitations

  • Limited world knowledge (stopped at mid‑2025 pretraining data).
  • Short context window (2,048 tokens).
  • Small size means it can make more factual mistakes than larger models.

Training Details

Pretraining

The base model was pretrained from scratch on a single NVIDIA RTX 3070 (8 GB VRAM). Training took approximately X days (wall clock) and consumed about Y kWh (see Environmental Impact).

Data mix

Quark‑50m was trained on exactly 5 billion tokens sampled from HuggingFaceTB/smollm-corpus with the following proportions:

Subset Share Tokens
cosmopedia‑v2 60% 3.0 B
fineweb‑edu‑dedup 40% 2.0 B

All data was tokenised with the official Cosmo2 tokenizer (vocab size 49,152).

Hyperparameters (pretraining)

Parameter Value
Sequence length 2,048
Micro‑batch size 4
Gradient accumulation 16
Effective batch 64 seqs (≈131k tokens)
Optimizer AdamW (β₁=0.9, β₂=0.95)
Learning rate 3e‑4 → 3e‑5 (cosine decay)
Warmup steps 1,000
Weight decay 0.1
Gradient clipping 1.0
Mixed precision bfloat16

Instruction Fine‑tuning

The base model was fine‑tuned on a curated set of instruction‑following data (details to be released).
The fine‑tuning used LoRA with the same sequence length and a lower learning rate (1e‑4) for a few thousand steps.

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "OvercastLab/Quark-50m-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [
    {"role": "system", "content": "You are Quark, a helpful assistant."},
    {"role": "user", "content": "Explain group query attention in one sentence."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
41
Safetensors
Model size
56.7M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train OvercastLab/Quark-50m-Instruct