mmBERT-base
Transformers v5 compatible checkpoint of jhu-clsp/mmBERT-base.
| Parameters | 307M |
| Hidden size | 768 |
| Layers | 22 |
| Attention heads | 12 |
| Max seq length | 8,192 |
| RoPE theta | 160,000 (both global & local) |
Usage (transformers v5)
from transformers import ModernBertModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-base")
model = ModernBertModel.from_pretrained("datalama/mmBERT-base")
inputs = tokenizer("์ธ๊ณต์ง๋ฅ ๊ธฐ์ ์ ๋น ๋ฅด๊ฒ ๋ฐ์ ํ๊ณ ์์ต๋๋ค.", return_tensors="pt")
outputs = model(**inputs)
# [CLS] embedding (768-dim)
cls_embedding = outputs.last_hidden_state[:, 0, :]
For masked language modeling:
from transformers import ModernBertForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-base")
model = ModernBertForMaskedLM.from_pretrained("datalama/mmBERT-base")
inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
outputs = model(**inputs)
Migration Details
This checkpoint was migrated from jhu-clsp/mmBERT-base with the following changes:
1. Weight format: pytorch_model.bin โ model.safetensors
- Tied weights (
model.embeddings.tok_embeddings.weightโdecoder.weight) were cloned to separate tensors before saving - All 138 tensors verified bitwise equal after conversion
2. Config: Added explicit rope_parameters for transformers v5
{
"global_rope_theta": 160000,
"local_rope_theta": 160000,
"rope_parameters": {
"full_attention": {"rope_type": "default", "rope_theta": 160000.0},
"sliding_attention": {"rope_type": "default", "rope_theta": 160000.0}
}
}
The original flat fields (global_rope_theta, local_rope_theta) are preserved for backward compatibility.
In transformers v5, ModernBertConfig defaults sliding_attention.rope_theta to 10,000 โ but mmBERT uses 160,000 for both, so explicit rope_parameters are required.
Verification
Cross-environment verification was performed between transformers v4 (original) and v5 (this checkpoint):
| Check | Result |
|---|---|
| RoPE config | rope_parameters present, theta=160,000 for both attention types |
| Weight integrity | 138 tensors bitwise equal (jhu-clsp .bin vs datalama .safetensors) |
| Inference output | v4 vs v5 max diff across 4 multilingual sentences: 7.63e-06 |
| Fine-tuning readiness | Tokenizer roundtrip, forward+backward pass, gradient propagation โ all OK |
Credit
Original model by JHU CLSP. See the original model card for training details and benchmarks.
- Downloads last month
- -
Model tree for datalama/mmBERT-base
Base model
jhu-clsp/mmBERT-base