mmBERT-base

Transformers v5 compatible checkpoint of jhu-clsp/mmBERT-base.

Parameters 307M
Hidden size 768
Layers 22
Attention heads 12
Max seq length 8,192
RoPE theta 160,000 (both global & local)

Usage (transformers v5)

from transformers import ModernBertModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-base")
model = ModernBertModel.from_pretrained("datalama/mmBERT-base")

inputs = tokenizer("์ธ๊ณต์ง€๋Šฅ ๊ธฐ์ˆ ์€ ๋น ๋ฅด๊ฒŒ ๋ฐœ์ „ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.", return_tensors="pt")
outputs = model(**inputs)

# [CLS] embedding (768-dim)
cls_embedding = outputs.last_hidden_state[:, 0, :]

For masked language modeling:

from transformers import ModernBertForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-base")
model = ModernBertForMaskedLM.from_pretrained("datalama/mmBERT-base")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
outputs = model(**inputs)

Migration Details

This checkpoint was migrated from jhu-clsp/mmBERT-base with the following changes:

1. Weight format: pytorch_model.bin โ†’ model.safetensors

  • Tied weights (model.embeddings.tok_embeddings.weight โ†” decoder.weight) were cloned to separate tensors before saving
  • All 138 tensors verified bitwise equal after conversion

2. Config: Added explicit rope_parameters for transformers v5

{
  "global_rope_theta": 160000,
  "local_rope_theta": 160000,
  "rope_parameters": {
    "full_attention": {"rope_type": "default", "rope_theta": 160000.0},
    "sliding_attention": {"rope_type": "default", "rope_theta": 160000.0}
  }
}

The original flat fields (global_rope_theta, local_rope_theta) are preserved for backward compatibility. In transformers v5, ModernBertConfig defaults sliding_attention.rope_theta to 10,000 โ€” but mmBERT uses 160,000 for both, so explicit rope_parameters are required.

Verification

Cross-environment verification was performed between transformers v4 (original) and v5 (this checkpoint):

Check Result
RoPE config rope_parameters present, theta=160,000 for both attention types
Weight integrity 138 tensors bitwise equal (jhu-clsp .bin vs datalama .safetensors)
Inference output v4 vs v5 max diff across 4 multilingual sentences: 7.63e-06
Fine-tuning readiness Tokenizer roundtrip, forward+backward pass, gradient propagation โ€” all OK

Credit

Original model by JHU CLSP. See the original model card for training details and benchmarks.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for datalama/mmBERT-base

Finetuned
(58)
this model