SARFTokenizer v0.2
A bilingual Arabic/English tokenizer at 65,000 vocabulary built on HuggingFace's Unigram LM with an AraToken-style normalization pipeline. Ships a single tokenizer.json that loads via AutoTokenizer.from_pretrained without trust_remote_code.
The headline โ what we actually claim
Strongest Arabic chars-per-token in the sub-100k vocabulary tier, outperforming GPT-4o/GPT-5 (o200k_base, 200k vocab) on Arabic by 13.5% at 32.5% of its vocabulary size. Held out of-distribution on FineWeb-2 Arabic.
Out-of-distribution benchmark (FineWeb-2 AR + FineWeb-Edu EN, 300 samples each)
Every tokenizer evaluated on the same text โ drawn from HuggingFace's FineWeb corpora, not in any tokenizer's training distribution. This is the fair comparison; all other tokenizers' internal benchmarks suffer from in-domain advantage.
| Rank | Tokenizer | Vocab | AR CpT | EN CpT | Parity |
|---|---|---|---|---|---|
| ๐ฅ AR | SARFTokenizer v0.2 | 65,000 | 3.597 | 4.514 | 0.797 |
| 2 | Qwen3.6-35B-A3B | 248,077 | 3.230 | 4.596 | 0.703 |
| 3 | tiktoken/o200k_base (GPT-4o, GPT-5) | 200,019 | 3.168 | 4.796 | 0.661 |
| 4 | ALLaM-7B-Instruct-preview | 64,000 | 2.929 | 3.453 | 0.848 |
| 5 | google/gemma-4-31B-it | 262,144 | 2.913 | 4.660 | 0.625 |
| 5t | google/gemma-3-1b-pt | 262,145 | 2.913 | 4.660 | 0.625 |
| 7 | google/gemma-2-2b | 256,000 | 2.864 | 4.679 | 0.612 |
| 8 | QCRI/Fanar-1-9B-Instruct | 128,256 | 2.862 | 4.564 | 0.627 |
| 9 | Qwen2.5-0.5B | 151,665 | 2.668 | 4.649 | 0.574 |
| 10 | Hala-350M | 64,400 | 2.252 | 4.648 | 0.484 |
| 11 | Kimi-K2.6 | 163,840 | 2.115 | 4.743 | 0.446 |
| 12 | tiktoken/cl100k_base (GPT-4) | 100,277 | 1.462 | 4.741 | 0.308 |
| 13 | Falcon-7B | 65,024 | 1.016 | 4.559 | 0.223 |
What this table shows:
- v0.2 is #1 on Arabic CpT by a wide margin on out-of-distribution data.
- On English, GPT-4o wins at 200k vocab (4.796 vs our 4.514, +6.2%). This is expected โ English compression scales with vocab budget more cleanly than Arabic, and GPT-4o has 3ร our budget for long English n-grams.
- v0.2 has the best parity within the sub-100k tier (0.848 for ALLaM vs our 0.797; everyone else < 0.85 on OOD).
Honest comparison to flagship tokenizers (OOD FineWeb)
| Competitor | Vocab | Their AR CpT | Our AR CpT | ฮ AR | Their EN CpT | Our EN CpT | ฮ EN |
|---|---|---|---|---|---|---|---|
| GPT-5 / GPT-4o (o200k_base) | 200,019 | 3.168 | 3.597 | +13.5% | 4.796 | 4.514 | โ5.9% |
| Gemma-4-31B-it | 262,144 | 2.913 | 3.597 | +23.5% | 4.660 | 4.514 | โ3.1% |
| Qwen3.6-35B-A3B | 248,077 | 3.230 | 3.597 | +11.4% | 4.596 | 4.514 | โ1.8% |
| Kimi-K2.6 | 163,840 | 2.115 | 3.597 | +70.1% | 4.743 | 4.514 | โ4.8% |
| ALLaM-7B (64k bilingual AR focus) | 64,000 | 2.929 | 3.597 | +22.8% | 3.453 | 4.514 | +30.7% |
Why this matters
- Arabic-first deployments: ~13% better Arabic chars/token vs GPT-4o means ~13% more Arabic context in the same window, or ~13% faster Arabic generation at the same model. At 32.5% the vocabulary, that means a ~530M-param embedding-table saving for a 4096-dim model.
- Against Arabic-specialized peers: the best bilingual Arabic tokenizer at 65k class (ALLaM-7B) is 22.8% behind on AR CpT AND 30.7% behind on EN CpT. The combined win vs ALLaM on FineWeb-Edu EN actually widens vs our internal benchmark.
- Same infrastructure as GPT-4o:
AutoTokenizer.from_pretrained, notrust_remote_code, no Python preprocessing shims.
Caveats we want you to know
- Our in-domain benchmark (on
deeplatent-hq-bilingual) inflated our numbers slightly. On that corpus we reported AR 3.683, EN 3.522. On OOD FineWeb the AR was 3.597 (โ2.3%) and EN was 4.514 (+28%, because FineWeb-Edu is structurally easier). The original "beats GPT-4o on both languages" headline was an artifact of evaluation domain; the "beats GPT-4o on Arabic" claim holds up OOD. - Our tokenizer includes aggressive Arabic normalization (tashkeel strip, Alef unification, Ya Maksura, Indic digits). For text with diacritics we're doing information-loss that competitors are not โ giving us an unfair compression advantage on unnormalized Arabic. On pre-normalized Arabic (the common case in web text) the effect is small (<3%).
- Vocab-size comparison is not perfectly apples-to-apples. GPT-4o's 200k covers 100+ languages + emoji + code; our 65k covers only AR + EN. A fairer framing is that we deliver equivalent effective AR+EN vocab in 1/3 the footprint.
Full honest writeup: FAIR_BENCHMARK.md.
Overview
| Property | Value |
|---|---|
| Model | Unigram LM (HuggingFace tokenizers.models.Unigram) |
| Vocabulary size | 65,000 |
| Pre-tokenizer | Metaspace (โ marker, SentencePiece-style) |
| Normalizer | AraToken-style: NFKC โ Alef + Ya unification โ tashkeel/tatweel/zero-width strip โ Indic digits โ ASCII |
| Special tokens | <s>, </s>, <unk>, <pad> |
| Languages | Arabic + English (50/50 balanced during training) |
| Training data | 500M chars from bilingual HQ corpus |
| Public API | AutoTokenizer.from_pretrained without trust_remote_code |
Quick start
Colab (private repo, with token)
from huggingface_hub import login
from transformers import AutoTokenizer
from google.colab import userdata
hf_token = userdata.get("HF-Token")
login(token=hf_token, add_to_git_credential=False) # sets global auth
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
print("vocab:", tok.vocab_size)
print(tok.decode(tok.encode("ุงูู
ุนูู
ูุดุฑุญ ุงูุฏุฑุณ."), skip_special_tokens=True))
The login() call is necessary in transformers>=5.6 โ passing token= directly to AutoTokenizer.from_pretrained doesn't propagate into the internal AutoConfig fetch on tokenizer-only repos.
Local environment (no token)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
print("vocab:", tok.vocab_size)
Robust: download-and-load
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
local = snapshot_download("almaghrabima/SARFTokenizer")
tok = AutoTokenizer.from_pretrained(local)
Low-level tokenizers API
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("almaghrabima/SARFTokenizer")
out = tok.encode(
"The student reads a book. ุงูุทุงูุจ ููุฑุฃ ูุชุงุจุง.",
add_special_tokens=False,
)
print(out.tokens)
# โ ['โThe', 'โstudent', 'โreads', 'โa', 'โbook.', 'โุงูุทุงูุจ', 'โููุฑุฃ', 'โูุชุงุจุง', '.']
Batch encoding for training
batch = tok(
["ุงูุณูุงู
ุนูููู
", "Hello world"],
padding=True, truncation=True, max_length=128,
return_tensors="pt",
)
Reproduce the benchmark (fair OOD version)
The exact 600-document OOD eval corpus is published at almaghrabima/SARFTokenizer-benchmark-eval.
from datasets import load_dataset
from transformers import AutoTokenizer
from huggingface_hub import login
login(token="your_hf_token", add_to_git_credential=False)
# (original eval uses the internal corpus โ this version you can also run
# on FineWeb for out-of-distribution numbers)
ds = load_dataset("almaghrabima/SARFTokenizer-benchmark-eval", split="test")
ar_texts = [r["text"] for r in ds if r["language"] == "ar"]
en_texts = [r["text"] for r in ds if r["language"] == "en"]
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
ar_c = sum(len(t) for t in ar_texts); en_c = sum(len(t) for t in en_texts)
ar_t = sum(len(tok.encode(t, add_special_tokens=False)) for t in ar_texts)
en_t = sum(len(tok.encode(t, add_special_tokens=False)) for t in en_texts)
print(f"AR CpT={ar_c/ar_t:.3f} EN CpT={en_c/en_t:.3f} Parity={(ar_c/ar_t)/(en_c/en_t):.3f}")
For the OOD FineWeb fair benchmark, use scripts/fair_bench.py in the GitHub repo.
Head-to-head on your own text
from transformers import AutoTokenizer
sarf = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
gemma = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
qwen = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-35B-A3B")
kimi = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.6", trust_remote_code=True)
text = "your text here"
for name, t in [("SARF v0.2", sarf), ("Gemma-4", gemma),
("Qwen3.6", qwen), ("Kimi-K2.6", kimi)]:
ids = t.encode(text, add_special_tokens=False)
print(f"{name:<14} vocab={len(t):>7,} tokens={len(ids):>5} CpT={len(text)/len(ids):.3f}")
Normalization applied at encode time
All Arabic text is normalized at encode time by the embedded normalizer:
- NFKC compat normalization
- Tashkeel (
U+064BโU+0652,U+0670) removed - Tatweel
U+0640removed - Zero-width + BiDi controls removed
- Alef variants (
ุฃ,ุฅ,ุข,ูฑ) โ bare Alefุง - Alef Maksura
ูโ Yaู - Arabic-Indic digits (
ูโูฉ) โ ASCII0โ9
Encoding is lossy on diacritics and Alef-Hamza variants โ by design. If your downstream task requires preserving these (classical poetry with full diacritics, Qur'anic text), v0.2 is not suitable.
Inspect it:
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("almaghrabima/SARFTokenizer")
print(tok.normalizer.normalize_str("ุงููู
ูุนูููู
ู ุฅูู ุขู ุฃู
ููุฉ ูกูขูฃ"))
# โ "ุงูู
ุนูู
ุงูู ุงู ุงู
ุฉ 123"
Special tokens
| Token | Purpose |
|---|---|
<s> |
beginning-of-sequence |
</s> |
end-of-sequence |
<unk> |
unknown (byte fallback) |
<pad> |
padding |
Why Unigram?
Recent literature (AraToken arXiv:2512.18399, Dec 2025; "Rethinking Tokenization for Rich Morphology" arXiv:2508.08424, Aug 2025) finds Unigram LM outperforms BPE on morphologically rich languages because its EM-based pruning recovers morphology implicitly โ no explicit morpheme preprocessing required.
We confirmed this on the 65k bilingual AR/EN regime: Unigram lifts AR CpT by 10.3% and EN CpT by 19.9% over a well-tuned parity-aware BPE pipeline (v0.1) at the same vocab budget, on OOD FineWeb data.
Files
tokenizer.jsonโ HuggingFace-format tokenizer (model + normalizer + pre_tokenizer + decoder)tokenizer_config.jsonโ tokenizer configuration (specifiesPreTrainedTokenizerFast)special_tokens_map.jsonโ special tokens mappingBENCHMARK.mdโ in-domain benchmark (honest about evaluation bias)FAIR_BENCHMARK.mdโ out-of-distribution benchmark on FineWebbenchmark_results.jsonโ raw per-tokenizer metrics (in-domain)benchmark_results_2026flagships.jsonโ raw metrics for Gemma-4, Qwen3.6, Kimi-K2.6fair_benchmark_results.jsonโ raw OOD FineWeb metrics for all 13 tokenizers
Related datasets
almaghrabima/SARFTokenizer-benchmark-evalโ 300 AR + 300 EN documents used for the in-domain benchmark (public, 474 KB)
Troubleshooting
OSError: ... is not a local folder and is not a valid model identifier
In transformers >= 5.6, token= doesn't always propagate through the internal AutoConfig lookup on tokenizer-only repos. Fix: login(token=hf_token) first.
Unrecognized model in ... Should have a model_type key in its config.json
Same root cause. Use login() or snapshot_download() โ load from local path.
401 Unauthorized but whoami worked
Token valid but AutoConfig not forwarding it. Use login() to set auth globally.
Version history
- v0.2 โ Unigram LM at 65k vocab with AraToken normalizer. Best Arabic chars/token in the sub-100k vocab tier on OOD FineWeb (+13.5% vs GPT-4o, +22.8% vs ALLaM, +70% vs Kimi-K2.6).
- v0.1 โ MYTE + parity-aware BPE at 62,912 vocab with 14k Arabic morpheme atomic tokens.
License
Apache 2.0.