MarianMT Model for Samaritan Hebrew โ Samaritan Aramaic Translation
Model Description
This is a bidirectional translation model fine-tuned from Helsinki-NLP/opus-mt-sem-sem for translating between Samaritan Hebrew (smp) and Samaritan Aramaic (sam). The model supports both translation directions using special language tags (>>smp<< and >>sam<<).
The model is based on the Marian architecture and was fine-tuned specifically for the Samaritan Hebrew-Aramaic language pair, which are closely related Semitic languages written in the Hebrew script.
Model Details
- Model Type: Seq2Seq (Marian)
- Base Model: Helsinki-NLP/opus-mt-sem-sem
- Languages: Samaritan Hebrew (smp) โ Samaritan Aramaic (sam)
- Direction: Bidirectional
- Vocabulary Size: 33,702 tokens (2 additional special tokens:
>>smp<<and>>sam<<) - Model Parameters: 61,918,208
- Input/Output Max Length: 313 tokens
Training Details
Training Configuration
- Training Epochs: 96.35 (of 100 planned, early stopping at step 29,000)
- Batch Size: 16 per device
- Effective Batch Size: 32 (with gradient accumulation)
- Learning Rate: 1e-5
- Warmup Steps: 1,000
- Weight Decay: 0.01
- Gradient Accumulation Steps: 2
- Optimization: AdamW with cosine learning rate schedule with restarts
- Precision: bfloat16 (BF16)
- Training Time: ~47.8 minutes (2,866 seconds)
Dataset
- Train Split: 9,610 sentence pairs (4,805 original bidirectional pairs)
- Validation Split: 1,080 sentence pairs (540 original pairs)
- Test Split: 108 sentence pairs (54 original pairs)
- Total Dataset: 10,798 bidirectional sentence pairs from biblical parallel texts
- Format: Pipe-delimited CSV with columns: Book|Chapter|Verse|Samaritan|Aramaic
- Script: Hebrew script for both languages
The dataset contains parallel biblical texts in Samaritan Hebrew and Samaritan Aramaic (Targumic), with both directions included in the training data to enable bidirectional translation.
Training Process
Training was conducted with:
- Early stopping patience: 5 evaluation steps
- Evaluation every 500 steps
- Best model checkpoint: checkpoint-26500 (BLEU: 60.48)
- Final checkpoint: checkpoint-29000 (BLEU: 59.72 after 96.35 epochs)
Performance
Evaluation Metrics (Test Set)
- BLEU Score: 59.72 (best: 60.48 at checkpoint-26500)
- chrF Score: 77.91
- Character Accuracy: 51.09%
Training Metrics
- Final Training Loss: 0.722
- Final Evaluation Loss: 0.825
- Best BLEU (validation): 60.48 at step 26,500
Usage
Direct Inference with Transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("johnlockejrr/marianmt-smp-sam")
model = AutoModelForSeq2SeqLM.from_pretrained("johnlockejrr/marianmt-smp-sam")
# Translate from Samaritan Hebrew to Samaritan Aramaic
text_smp = "ืืืจ ืืืืจืื ืืืื ืืื ืืืจ ืืืื ืื ืืืจื"
input_text = f">>smp<< {text_smp}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=313, truncation=True)
outputs = model.generate(**inputs, max_length=313, num_beams=4, length_penalty=0.6)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)
# Translate from Samaritan Aramaic to Samaritan Hebrew
text_sam = "ืืชืจ ืืืืืื ืืืื ืืื ืืื ืืืื ืขื ืืืจื"
input_text = f">>sam<< {text_sam}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=313, truncation=True)
outputs = model.generate(**inputs, max_length=313, num_beams=4, length_penalty=0.6)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)
Using the Pipeline
The model can also be used with the provided inference script (if available in the repository):
python inference_smp_sam.py \
--model_path johnlockejrr/marianmt-smp-sam \
--text "ืืืจ ืืืืจืื ืืืื ืืื ืืืจ ืืืื ืื ืืืจื" \
--direction smp2sam
Language Tags
The model uses special language tags to indicate translation direction:
>>smp<<- Prefix for Samaritan Hebrew (source) โ Samaritan Aramaic (target)>>sam<<- Prefix for Samaritan Aramaic (source) โ Samaritan Hebrew (target)
These tags must be included at the beginning of the input text for proper direction control.
Limitations and Considerations
- Domain Specificity: The model was trained primarily on biblical texts and may perform better on similar religious or historical texts.
- Script Normalization: Input texts may need normalization (removal of diacritics/niqqud) depending on your use case.
- Length Constraints: Maximum sequence length is 313 tokens. Longer texts will be truncated.
- Character Accuracy: At 51.09%, character-level accuracy indicates room for improvement, though BLEU and chrF scores suggest reasonable translation quality.
Citation
If you use this model, please cite:
@misc{marianmt-smp-sam,
title={MarianMT Model for Samaritan Hebrew โ Samaritan Aramaic Translation},
author={johnlockejrr},
year={2025},
howpublished={\url{https://huggingface.co/johnlockejrr/marianmt-smp-sam}}
}
Acknowledgments
- Base model: Helsinki-NLP/opus-mt-sem-sem
- Training framework: Hugging Face Transformers
- Dataset: Parallel biblical texts in Samaritan Hebrew and Samaritan Aramaic
Model Card Contact
For questions, issues, or contributions, please refer to the model repository.
- Downloads last month
- 3
Evaluation results
- BLEU Score on Samaritan Hebrew-Samaritan Aramaic Parallel Corpusself-reported60.480
- chrF Score on Samaritan Hebrew-Samaritan Aramaic Parallel Corpusself-reported77.910
- Character Accuracy on Samaritan Hebrew-Samaritan Aramaic Parallel Corpusself-reported51.090