LTZ E1 (base)

A ModernBERT-based masked language model pretrained on Luxembourgish, following the Ettin recipe (see here: https://huggingface.co/jhu-clsp/ettin-encoder-150m)

Model Details

  • Architecture: ModernBERT (encoder)
  • Size: base
  • Vocabulary: 50,368 tokens (BPE, GPTNeoXTokenizerFast)
  • Context length: 1,024 tokens
  • Language: Luxembourgish (lb/ltz)
  • License: CC BY-SA 4.0

Usage

Requires transformers>=4.48.0.

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("instilux/ltz-e1-base")
model = AutoModelForMaskedLM.from_pretrained("instilux/ltz-e1-base")

inputs = tokenizer("Wéi spéit [MASK] et?", return_tensors="pt")
mask_pos = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    outputs = model(**inputs)

top_tokens = outputs.logits[0, mask_pos].topk(5)
for token_id, score in zip(top_tokens.indices[0], top_tokens.values[0]):
    token = tokenizer.decode(token_id)
    print(f"{token:15s} {score:.3f}")

Tokenizer Notes

The tokenizer is BPE-based (GPTNeoXTokenizerFast) with BERT-style special tokens ([CLS], [SEP], [MASK], [PAD]). A [CLS] token is prepended automatically (add_bos_token: true).

Citation

A paper describing this model will be published soon. In the meantime, please cite this repository if you use this model in your work.

Downloads last month
11
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support