SARFTokenizer v0.2

A bilingual Arabic/English tokenizer at 65,000 vocabulary built on HuggingFace's Unigram LM with an AraToken-style normalization pipeline. Ships a single tokenizer.json that loads via AutoTokenizer.from_pretrained without trust_remote_code.

The headline — what we actually claim

Strongest Arabic chars-per-token in the sub-100k vocabulary tier, outperforming GPT-4o/GPT-5 (o200k_base, 200k vocab) on Arabic by 13.5% at 32.5% of its vocabulary size. Held out of-distribution on FineWeb-2 Arabic.

Out-of-distribution benchmark (FineWeb-2 AR + FineWeb-Edu EN, 300 samples each)

Every tokenizer evaluated on the same text — drawn from HuggingFace's FineWeb corpora, not in any tokenizer's training distribution. This is the fair comparison; all other tokenizers' internal benchmarks suffer from in-domain advantage.

Rank	Tokenizer	Vocab	AR CpT	EN CpT	Parity
🥇 AR	SARFTokenizer v0.2	65,000	3.597	4.514	0.797
2	Qwen3.6-35B-A3B	248,077	3.230	4.596	0.703
3	tiktoken/o200k_base (GPT-4o, GPT-5)	200,019	3.168	4.796	0.661
4	ALLaM-7B-Instruct-preview	64,000	2.929	3.453	0.848
5	google/gemma-4-31B-it	262,144	2.913	4.660	0.625
5t	google/gemma-3-1b-pt	262,145	2.913	4.660	0.625
7	google/gemma-2-2b	256,000	2.864	4.679	0.612
8	QCRI/Fanar-1-9B-Instruct	128,256	2.862	4.564	0.627
9	Qwen2.5-0.5B	151,665	2.668	4.649	0.574
10	Hala-350M	64,400	2.252	4.648	0.484
11	Kimi-K2.6	163,840	2.115	4.743	0.446
12	tiktoken/cl100k_base (GPT-4)	100,277	1.462	4.741	0.308
13	Falcon-7B	65,024	1.016	4.559	0.223

What this table shows:

v0.2 is #1 on Arabic CpT by a wide margin on out-of-distribution data.
On English, GPT-4o wins at 200k vocab (4.796 vs our 4.514, +6.2%). This is expected — English compression scales with vocab budget more cleanly than Arabic, and GPT-4o has 3× our budget for long English n-grams.
v0.2 has the best parity within the sub-100k tier (0.848 for ALLaM vs our 0.797; everyone else < 0.85 on OOD).

Honest comparison to flagship tokenizers (OOD FineWeb)

Competitor	Vocab	Their AR CpT	Our AR CpT	Δ AR	Their EN CpT	Our EN CpT	Δ EN
GPT-5 / GPT-4o (o200k_base)	200,019	3.168	3.597	+13.5%	4.796	4.514	–5.9%
Gemma-4-31B-it	262,144	2.913	3.597	+23.5%	4.660	4.514	–3.1%
Qwen3.6-35B-A3B	248,077	3.230	3.597	+11.4%	4.596	4.514	–1.8%
Kimi-K2.6	163,840	2.115	3.597	+70.1%	4.743	4.514	–4.8%
ALLaM-7B (64k bilingual AR focus)	64,000	2.929	3.597	+22.8%	3.453	4.514	+30.7%

Why this matters

Arabic-first deployments: ~13% better Arabic chars/token vs GPT-4o means ~13% more Arabic context in the same window, or ~13% faster Arabic generation at the same model. At 32.5% the vocabulary, that means a ~530M-param embedding-table saving for a 4096-dim model.
Against Arabic-specialized peers: the best bilingual Arabic tokenizer at 65k class (ALLaM-7B) is 22.8% behind on AR CpT AND 30.7% behind on EN CpT. The combined win vs ALLaM on FineWeb-Edu EN actually widens vs our internal benchmark.
Same infrastructure as GPT-4o: AutoTokenizer.from_pretrained, no trust_remote_code, no Python preprocessing shims.

Caveats we want you to know

Our in-domain benchmark (on deeplatent-hq-bilingual) inflated our numbers slightly. On that corpus we reported AR 3.683, EN 3.522. On OOD FineWeb the AR was 3.597 (−2.3%) and EN was 4.514 (+28%, because FineWeb-Edu is structurally easier). The original "beats GPT-4o on both languages" headline was an artifact of evaluation domain; the "beats GPT-4o on Arabic" claim holds up OOD.
Our tokenizer includes aggressive Arabic normalization (tashkeel strip, Alef unification, Ya Maksura, Indic digits). For text with diacritics we're doing information-loss that competitors are not — giving us an unfair compression advantage on unnormalized Arabic. On pre-normalized Arabic (the common case in web text) the effect is small (<3%).
Vocab-size comparison is not perfectly apples-to-apples. GPT-4o's 200k covers 100+ languages + emoji + code; our 65k covers only AR + EN. A fairer framing is that we deliver equivalent effective AR+EN vocab in 1/3 the footprint.

Full honest writeup: FAIR_BENCHMARK.md.

Overview

Property	Value
Model	Unigram LM (HuggingFace `tokenizers.models.Unigram`)
Vocabulary size	65,000
Pre-tokenizer	Metaspace (`▁` marker, SentencePiece-style)
Normalizer	AraToken-style: NFKC → Alef + Ya unification → tashkeel/tatweel/zero-width strip → Indic digits → ASCII
Special tokens	`<s>`, `</s>`, `<unk>`, `<pad>`
Languages	Arabic + English (50/50 balanced during training)
Training data	500M chars from bilingual HQ corpus
Public API	`AutoTokenizer.from_pretrained` without `trust_remote_code`

Quick start

Colab (private repo, with token)

from huggingface_hub import login
from transformers import AutoTokenizer
from google.colab import userdata

hf_token = userdata.get("HF-Token")
login(token=hf_token, add_to_git_credential=False)   # sets global auth

tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
print("vocab:", tok.vocab_size)
print(tok.decode(tok.encode("المعلم يشرح الدرس."), skip_special_tokens=True))

The login() call is necessary in transformers>=5.6 — passing token= directly to AutoTokenizer.from_pretrained doesn't propagate into the internal AutoConfig fetch on tokenizer-only repos.

Local environment (no token)

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
print("vocab:", tok.vocab_size)

Robust: download-and-load

from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

local = snapshot_download("almaghrabima/SARFTokenizer")
tok = AutoTokenizer.from_pretrained(local)

Low-level `tokenizers` API

from tokenizers import Tokenizer

tok = Tokenizer.from_pretrained("almaghrabima/SARFTokenizer")
out = tok.encode(
    "The student reads a book. الطالب يقرأ كتابا.",
    add_special_tokens=False,
)
print(out.tokens)
# → ['▁The', '▁student', '▁reads', '▁a', '▁book.', '▁الطالب', '▁يقرأ', '▁كتابا', '.']

Batch encoding for training

batch = tok(
    ["السلام عليكم", "Hello world"],
    padding=True, truncation=True, max_length=128,
    return_tensors="pt",
)

Reproduce the benchmark (fair OOD version)

The exact 600-document OOD eval corpus is published at almaghrabima/SARFTokenizer-benchmark-eval.

from datasets import load_dataset
from transformers import AutoTokenizer
from huggingface_hub import login
login(token="your_hf_token", add_to_git_credential=False)

# (original eval uses the internal corpus — this version you can also run
#  on FineWeb for out-of-distribution numbers)
ds = load_dataset("almaghrabima/SARFTokenizer-benchmark-eval", split="test")
ar_texts = [r["text"] for r in ds if r["language"] == "ar"]
en_texts = [r["text"] for r in ds if r["language"] == "en"]

tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
ar_c = sum(len(t) for t in ar_texts); en_c = sum(len(t) for t in en_texts)
ar_t = sum(len(tok.encode(t, add_special_tokens=False)) for t in ar_texts)
en_t = sum(len(tok.encode(t, add_special_tokens=False)) for t in en_texts)
print(f"AR CpT={ar_c/ar_t:.3f}  EN CpT={en_c/en_t:.3f}  Parity={(ar_c/ar_t)/(en_c/en_t):.3f}")

For the OOD FineWeb fair benchmark, use scripts/fair_bench.py in the GitHub repo.

Head-to-head on your own text

from transformers import AutoTokenizer

sarf  = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
gemma = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
qwen  = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-35B-A3B")
kimi  = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.6", trust_remote_code=True)

text = "your text here"
for name, t in [("SARF v0.2", sarf), ("Gemma-4", gemma),
                ("Qwen3.6", qwen), ("Kimi-K2.6", kimi)]:
    ids = t.encode(text, add_special_tokens=False)
    print(f"{name:<14} vocab={len(t):>7,} tokens={len(ids):>5} CpT={len(text)/len(ids):.3f}")

Normalization applied at encode time

All Arabic text is normalized at encode time by the embedded normalizer:

NFKC compat normalization
Tashkeel (U+064B–U+0652, U+0670) removed
Tatweel U+0640 removed
Zero-width + BiDi controls removed
Alef variants (أ, إ, آ, ٱ) → bare Alef ا
Alef Maksura ى → Ya ي
Arabic-Indic digits (٠–٩) → ASCII 0–9

Encoding is lossy on diacritics and Alef-Hamza variants — by design. If your downstream task requires preserving these (classical poetry with full diacritics, Qur'anic text), v0.2 is not suitable.

Inspect it:

from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("almaghrabima/SARFTokenizer")
print(tok.normalizer.normalize_str("الْمَعْلَمُ إلى آل أمَّة ١٢٣"))
# → "المعلم الي ال امة 123"

Special tokens

Token	Purpose
`<s>`	beginning-of-sequence
`</s>`	end-of-sequence
`<unk>`	unknown (byte fallback)
`<pad>`	padding

Why Unigram?

Recent literature (AraToken arXiv:2512.18399, Dec 2025; "Rethinking Tokenization for Rich Morphology" arXiv:2508.08424, Aug 2025) finds Unigram LM outperforms BPE on morphologically rich languages because its EM-based pruning recovers morphology implicitly — no explicit morpheme preprocessing required.

We confirmed this on the 65k bilingual AR/EN regime: Unigram lifts AR CpT by 10.3% and EN CpT by 19.9% over a well-tuned parity-aware BPE pipeline (v0.1) at the same vocab budget, on OOD FineWeb data.

Files

tokenizer.json — HuggingFace-format tokenizer (model + normalizer + pre_tokenizer + decoder)
tokenizer_config.json — tokenizer configuration (specifies PreTrainedTokenizerFast)
special_tokens_map.json — special tokens mapping
BENCHMARK.md — in-domain benchmark (honest about evaluation bias)
FAIR_BENCHMARK.md — out-of-distribution benchmark on FineWeb
benchmark_results.json — raw per-tokenizer metrics (in-domain)
benchmark_results_2026flagships.json — raw metrics for Gemma-4, Qwen3.6, Kimi-K2.6
fair_benchmark_results.json — raw OOD FineWeb metrics for all 13 tokenizers

Related datasets

almaghrabima/SARFTokenizer-benchmark-eval — 300 AR + 300 EN documents used for the in-domain benchmark (public, 474 KB)

Troubleshooting

`OSError: ... is not a local folder and is not a valid model identifier`

In transformers >= 5.6, token= doesn't always propagate through the internal AutoConfig lookup on tokenizer-only repos. Fix: login(token=hf_token) first.

`Unrecognized model in ... Should have a model_type key in its config.json`

Same root cause. Use login() or snapshot_download() → load from local path.

`401 Unauthorized` but `whoami` worked

Token valid but AutoConfig not forwarding it. Use login() to set auth globally.

Version history

v0.2 — Unigram LM at 65k vocab with AraToken normalizer. Best Arabic chars/token in the sub-100k vocab tier on OOD FineWeb (+13.5% vs GPT-4o, +22.8% vs ALLaM, +70% vs Kimi-K2.6).
v0.1 — MYTE + parity-aware BPE at 62,912 vocab with 14k Arabic morpheme atomic tokens.

License

Apache 2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for almaghrabima/SARFTokenizer

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

Paper • 2512.18399 • Published Dec 20, 2025

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Paper • 2508.08424 • Published Nov 10, 2025