json-tokenizer: Structure-Aware Tokenization for JSON
A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.
Code: github.com/anthony-maio/json-tokenizer
Key Results
| Metric | Value |
|---|---|
| Token savings vs cl100k_base | 5-15% on schema-repetitive JSON |
| Vocabulary size | 4,251 tokens (vs 100,256 for cl100k_base) |
| Vocab reduction | ~90x smaller |
| Roundtrip fidelity | 100% lossless across 4,200+ test objects |
| Crossover point | Beats cl100k_base at just 558 tokens |
Architecture
Three-tier vocabulary:
- Structural tokens (IDs 0-15):
{,},[,],:,,,true,false,null, type markers - Key vocabulary (IDs 32-N): Learned single-token keys from training data (125 keys)
- BPE subwords (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)
This Model
This pretrained tokenizer was trained on structured JSON datasets:
- GeoJSON city features (geographic data)
- Observability telemetry logs (monitoring data)
- Kubernetes manifests (infrastructure config)
- Structured API outputs
- Synthetic training corpus (700 objects)
Total training objects: 72,991 Vocabulary: 4,251 tokens (16 structural + 16 reserved + 125 keys + 4,096 BPE)
Usage
With HuggingFace Transformers
# Requires: pip install json-tokenizer[huggingface]
from json_tokenizer.hf_compat import JSONPreTrainedTokenizer
tokenizer = JSONPreTrainedTokenizer.from_pretrained("anthonym21/json-tokenizer-structured")
# Encode JSON
output = tokenizer('{"name": "Alice", "age": 30, "active": true}')
print(output["input_ids"])
# Decode back to JSON (lossless)
decoded = tokenizer.decode(output["input_ids"])
print(decoded) # {"name": "Alice", "age": 30, "active": true}
With Core Library
# Requires: pip install json-tokenizer
from json_tokenizer import JSONTokenizer
tokenizer = JSONTokenizer.load("./path/to/saved/tokenizer")
# Encode (accepts Python dicts, lists, or JSON strings)
ids = tokenizer.encode({"name": "Alice", "age": 30})
# Decode (lossless roundtrip)
json_str = tokenizer.decode(ids)
Training Your Own
from json_tokenizer import JSONTokenizer
tok = JSONTokenizer(bpe_vocab_size=4096, max_key_vocab=512)
tok.train_from_json_files(["your_data.jsonl"])
tok.save("./my_tokenizer")
# Convert to HF format
from json_tokenizer.hf_compat import JSONPreTrainedTokenizer
hf_tok = JSONPreTrainedTokenizer.from_json_tokenizer(tok)
hf_tok.save_pretrained("./my_hf_tokenizer")
Where It Wins / Where It Loses
| Scenario | json-tokenizer | cl100k_base |
|---|---|---|
| GeoJSON (schema-repetitive) | +7.8% savings | baseline |
| Telemetry logs | +5.5% savings | baseline |
| Batch JSON arrays | +9.3% savings | baseline |
| Config objects | +12.3% savings | baseline |
| Prose-heavy JSON (Alpaca) | -26.2% | wins |
| K8s manifests (deep nesting) | break-even | break-even |
Best for: API responses, observability logs, function calling, structured outputs Not for: Prose-heavy JSON, general-purpose text
Citation
@software{maio2026jsontokenizer,
author = {Maio, Anthony},
title = {Structure-Aware Tokenization for {JSON}: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
year = {2026},
url = {https://github.com/anthony-maio/json-tokenizer},
doi = {10.5281/zenodo.18879110},
version = {0.2.0},
license = {MIT}
}
License
MIT