MarianMT Model for Samaritan Hebrew ↔ Samaritan Aramaic Translation

Model Description

This is a bidirectional translation model fine-tuned from Helsinki-NLP/opus-mt-sem-sem for translating between Samaritan Hebrew (smp) and Samaritan Aramaic (sam). The model supports both translation directions using special language tags (>>smp<< and >>sam<<).

The model is based on the Marian architecture and was fine-tuned specifically for the Samaritan Hebrew-Aramaic language pair, which are closely related Semitic languages written in the Hebrew script.

Model Details

Model Type: Seq2Seq (Marian)
Base Model: Helsinki-NLP/opus-mt-sem-sem
Languages: Samaritan Hebrew (smp) ↔ Samaritan Aramaic (sam)
Direction: Bidirectional
Vocabulary Size: 33,702 tokens (2 additional special tokens: >>smp<< and >>sam<<)
Model Parameters: 61,918,208
Input/Output Max Length: 313 tokens

Training Details

Training Configuration

Training Epochs: 96.35 (of 100 planned, early stopping at step 29,000)
Batch Size: 16 per device
Effective Batch Size: 32 (with gradient accumulation)
Learning Rate: 1e-5
Warmup Steps: 1,000
Weight Decay: 0.01
Gradient Accumulation Steps: 2
Optimization: AdamW with cosine learning rate schedule with restarts
Precision: bfloat16 (BF16)
Training Time: ~47.8 minutes (2,866 seconds)

Dataset

Train Split: 9,610 sentence pairs (4,805 original bidirectional pairs)
Validation Split: 1,080 sentence pairs (540 original pairs)
Test Split: 108 sentence pairs (54 original pairs)
Total Dataset: 10,798 bidirectional sentence pairs from biblical parallel texts
Format: Pipe-delimited CSV with columns: Book|Chapter|Verse|Samaritan|Aramaic
Script: Hebrew script for both languages

The dataset contains parallel biblical texts in Samaritan Hebrew and Samaritan Aramaic (Targumic), with both directions included in the training data to enable bidirectional translation.

Training Process

Training was conducted with:

Early stopping patience: 5 evaluation steps
Evaluation every 500 steps
Best model checkpoint: checkpoint-26500 (BLEU: 60.48)
Final checkpoint: checkpoint-29000 (BLEU: 59.72 after 96.35 epochs)

Performance

Evaluation Metrics (Test Set)

BLEU Score: 59.72 (best: 60.48 at checkpoint-26500)
chrF Score: 77.91
Character Accuracy: 51.09%

Training Metrics

Final Training Loss: 0.722
Final Evaluation Loss: 0.825
Best BLEU (validation): 60.48 at step 26,500

Usage

Direct Inference with Transformers

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("johnlockejrr/marianmt-smp-sam")
model = AutoModelForSeq2SeqLM.from_pretrained("johnlockejrr/marianmt-smp-sam")

# Translate from Samaritan Hebrew to Samaritan Aramaic
text_smp = "אחר הדברים האלה היה דבר יהוה אל אברם"
input_text = f">>smp<< {text_smp}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=313, truncation=True)
outputs = model.generate(**inputs, max_length=313, num_beams=4, length_penalty=0.6)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)

# Translate from Samaritan Aramaic to Samaritan Hebrew
text_sam = "בתר ממלליה אלין הוה מלל יהוה עם אברם"
input_text = f">>sam<< {text_sam}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=313, truncation=True)
outputs = model.generate(**inputs, max_length=313, num_beams=4, length_penalty=0.6)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)

Using the Pipeline

The model can also be used with the provided inference script (if available in the repository):

python inference_smp_sam.py \
  --model_path johnlockejrr/marianmt-smp-sam \
  --text "אחר הדברים האלה היה דבר יהוה אל אברם" \
  --direction smp2sam

Language Tags

The model uses special language tags to indicate translation direction:

>>smp<< - Prefix for Samaritan Hebrew (source) → Samaritan Aramaic (target)
>>sam<< - Prefix for Samaritan Aramaic (source) → Samaritan Hebrew (target)

These tags must be included at the beginning of the input text for proper direction control.

Limitations and Considerations

Domain Specificity: The model was trained primarily on biblical texts and may perform better on similar religious or historical texts.
Script Normalization: Input texts may need normalization (removal of diacritics/niqqud) depending on your use case.
Length Constraints: Maximum sequence length is 313 tokens. Longer texts will be truncated.
Character Accuracy: At 51.09%, character-level accuracy indicates room for improvement, though BLEU and chrF scores suggest reasonable translation quality.

Citation

If you use this model, please cite:

@misc{marianmt-smp-sam,
  title={MarianMT Model for Samaritan Hebrew ↔ Samaritan Aramaic Translation},
  author={johnlockejrr},
  year={2025},
  howpublished={\url{https://huggingface.co/johnlockejrr/marianmt-smp-sam}}
}

Acknowledgments

Base model: Helsinki-NLP/opus-mt-sem-sem
Training framework: Hugging Face Transformers
Dataset: Parallel biblical texts in Samaritan Hebrew and Samaritan Aramaic

Model Card Contact

For questions, issues, or contributions, please refer to the model repository.

Downloads last month: 3

Safetensors

Model size

61.4M params

Tensor type

F32

Evaluation results

BLEU Score on Samaritan Hebrew-Samaritan Aramaic Parallel Corpus
self-reported

60.480
chrF Score on Samaritan Hebrew-Samaritan Aramaic Parallel Corpus
self-reported

77.910
Character Accuracy on Samaritan Hebrew-Samaritan Aramaic Parallel Corpus
self-reported

51.090

View on Papers With Code