MarianMT Model for Samaritan Hebrew โ†” Samaritan Aramaic Translation

Model Description

This is a bidirectional translation model fine-tuned from Helsinki-NLP/opus-mt-sem-sem for translating between Samaritan Hebrew (smp) and Samaritan Aramaic (sam). The model supports both translation directions using special language tags (>>smp<< and >>sam<<).

The model is based on the Marian architecture and was fine-tuned specifically for the Samaritan Hebrew-Aramaic language pair, which are closely related Semitic languages written in the Hebrew script.

Model Details

  • Model Type: Seq2Seq (Marian)
  • Base Model: Helsinki-NLP/opus-mt-sem-sem
  • Languages: Samaritan Hebrew (smp) โ†” Samaritan Aramaic (sam)
  • Direction: Bidirectional
  • Vocabulary Size: 33,702 tokens (2 additional special tokens: >>smp<< and >>sam<<)
  • Model Parameters: 61,918,208
  • Input/Output Max Length: 313 tokens

Training Details

Training Configuration

  • Training Epochs: 96.35 (of 100 planned, early stopping at step 29,000)
  • Batch Size: 16 per device
  • Effective Batch Size: 32 (with gradient accumulation)
  • Learning Rate: 1e-5
  • Warmup Steps: 1,000
  • Weight Decay: 0.01
  • Gradient Accumulation Steps: 2
  • Optimization: AdamW with cosine learning rate schedule with restarts
  • Precision: bfloat16 (BF16)
  • Training Time: ~47.8 minutes (2,866 seconds)

Dataset

  • Train Split: 9,610 sentence pairs (4,805 original bidirectional pairs)
  • Validation Split: 1,080 sentence pairs (540 original pairs)
  • Test Split: 108 sentence pairs (54 original pairs)
  • Total Dataset: 10,798 bidirectional sentence pairs from biblical parallel texts
  • Format: Pipe-delimited CSV with columns: Book|Chapter|Verse|Samaritan|Aramaic
  • Script: Hebrew script for both languages

The dataset contains parallel biblical texts in Samaritan Hebrew and Samaritan Aramaic (Targumic), with both directions included in the training data to enable bidirectional translation.

Training Process

Training was conducted with:

  • Early stopping patience: 5 evaluation steps
  • Evaluation every 500 steps
  • Best model checkpoint: checkpoint-26500 (BLEU: 60.48)
  • Final checkpoint: checkpoint-29000 (BLEU: 59.72 after 96.35 epochs)

Performance

Evaluation Metrics (Test Set)

  • BLEU Score: 59.72 (best: 60.48 at checkpoint-26500)
  • chrF Score: 77.91
  • Character Accuracy: 51.09%

Training Metrics

  • Final Training Loss: 0.722
  • Final Evaluation Loss: 0.825
  • Best BLEU (validation): 60.48 at step 26,500

Usage

Direct Inference with Transformers

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("johnlockejrr/marianmt-smp-sam")
model = AutoModelForSeq2SeqLM.from_pretrained("johnlockejrr/marianmt-smp-sam")

# Translate from Samaritan Hebrew to Samaritan Aramaic
text_smp = "ืื—ืจ ื”ื“ื‘ืจื™ื ื”ืืœื” ื”ื™ื” ื“ื‘ืจ ื™ื”ื•ื” ืืœ ืื‘ืจื"
input_text = f">>smp<< {text_smp}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=313, truncation=True)
outputs = model.generate(**inputs, max_length=313, num_beams=4, length_penalty=0.6)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)

# Translate from Samaritan Aramaic to Samaritan Hebrew
text_sam = "ื‘ืชืจ ืžืžืœืœื™ื” ืืœื™ืŸ ื”ื•ื” ืžืœืœ ื™ื”ื•ื” ืขื ืื‘ืจื"
input_text = f">>sam<< {text_sam}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=313, truncation=True)
outputs = model.generate(**inputs, max_length=313, num_beams=4, length_penalty=0.6)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)

Using the Pipeline

The model can also be used with the provided inference script (if available in the repository):

python inference_smp_sam.py \
  --model_path johnlockejrr/marianmt-smp-sam \
  --text "ืื—ืจ ื”ื“ื‘ืจื™ื ื”ืืœื” ื”ื™ื” ื“ื‘ืจ ื™ื”ื•ื” ืืœ ืื‘ืจื" \
  --direction smp2sam

Language Tags

The model uses special language tags to indicate translation direction:

  • >>smp<< - Prefix for Samaritan Hebrew (source) โ†’ Samaritan Aramaic (target)
  • >>sam<< - Prefix for Samaritan Aramaic (source) โ†’ Samaritan Hebrew (target)

These tags must be included at the beginning of the input text for proper direction control.

Limitations and Considerations

  1. Domain Specificity: The model was trained primarily on biblical texts and may perform better on similar religious or historical texts.
  2. Script Normalization: Input texts may need normalization (removal of diacritics/niqqud) depending on your use case.
  3. Length Constraints: Maximum sequence length is 313 tokens. Longer texts will be truncated.
  4. Character Accuracy: At 51.09%, character-level accuracy indicates room for improvement, though BLEU and chrF scores suggest reasonable translation quality.

Citation

If you use this model, please cite:

@misc{marianmt-smp-sam,
  title={MarianMT Model for Samaritan Hebrew โ†” Samaritan Aramaic Translation},
  author={johnlockejrr},
  year={2025},
  howpublished={\url{https://huggingface.co/johnlockejrr/marianmt-smp-sam}}
}

Acknowledgments

  • Base model: Helsinki-NLP/opus-mt-sem-sem
  • Training framework: Hugging Face Transformers
  • Dataset: Parallel biblical texts in Samaritan Hebrew and Samaritan Aramaic

Model Card Contact

For questions, issues, or contributions, please refer to the model repository.

Downloads last month
3
Safetensors
Model size
61.4M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results

  • BLEU Score on Samaritan Hebrew-Samaritan Aramaic Parallel Corpus
    self-reported
    60.480
  • chrF Score on Samaritan Hebrew-Samaritan Aramaic Parallel Corpus
    self-reported
    77.910
  • Character Accuracy on Samaritan Hebrew-Samaritan Aramaic Parallel Corpus
    self-reported
    51.090