Token Classification
Safetensors
Tatar
bert
tatar
morphology
rubert

RuBERT fine-tuned for Tatar Morphological Analysis

This model is a fine-tuned version of DeepPavlov/rubert-base-cased for morphological analysis of the Tatar language. It was trained on a subset of 80,000 sentences from the Tatar Morphological Corpus. The model predicts fine-grained morphological tags (e.g., N+Sg+Nom, V+PRES(Й)+3SG).

Performance on Test Set

Metric Value 95% CI
Token Accuracy 0.9861 [0.9852, 0.9870]
Micro F1 0.9861 [0.9851, 0.9870]
Macro F1 0.5059 [0.5432, 0.5836]*

*Note: macro F1 CI as reported in the paper.

Accuracy by Part of Speech (Top 10)

POS Accuracy
PUNCT 1.0000
NOUN 0.9827
VERB 0.9640
ADJ 0.9614
PRON 0.9914
PART 0.9995
PROPN 0.9724
ADP 1.0000
CCONJ 1.0000
ADV 0.9897

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "TatarNLPWorld/rubert-tatar-morph"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

tokens = ["Татар", "теле", "бик", "бай", "."]
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

# Get tag mapping from model config
id2tag = model.config.id2label

word_ids = inputs.word_ids()
prev_word = None
for idx, word_idx in enumerate(word_ids):
    if word_idx is not None and word_idx != prev_word:
        tag_id = predictions[0][idx].item()
        if isinstance(id2tag, dict):
            tag = id2tag.get(str(tag_id), id2tag.get(tag_id, "UNK"))
        else:
            tag = id2tag[tag_id] if tag_id < len(id2tag) else "UNK"
        print(tokens[word_idx], "->", tag)
    prev_word = word_idx

Expected output (approximately):

Татар -> N+Sg+Nom
теле -> N+Sg+POSS_3(СЫ)+Nom
бик -> Adv
бай -> Adj
. -> PUNCT

Citation

If you use this model, please cite it as:

@misc{arabov-rubert-tatar-morph-2026,
  title = {RuBERT fine-tuned for Tatar Morphological Analysis},
  author = {Arabov Mullosharaf Kurbonovich},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/TatarNLPWorld/rubert-tatar-morph}
}

License

Apache 2.0

Downloads last month
52
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support