asr-whisper-largev2-v5
This model is a domain-adapted version of openchs/asr-whisper-helpline-sw-v1 fine-tuned on real phone call recordings from the Tanzania Child Helpline system, powered by OpenCHS.
Model Description
This ASR model represents domain-specific fine-tuning to bridge the gap between clean, read speech (Common Voice) and real-world telephony audio. While the base model achieved 23.56% WER on Common Voice validation data, this model is optimized for actual call center environments with authentic phone call audio quality challenges.
Key Characteristics:
- Domain: Child helpline phone call transcription (Swahili)
- Best Checkpoint: Step 4,500
- Validation WER: 45.74% on real phone call audio
- Validation Loss: 1.133
- Training Dataset: Custom Swahili ASR v6 (31,720 augmented samples from real calls)
Performance Context: The higher WER compared to the base model (45.74% vs 23.56%) reflects the significant domain shift from clean Common Voice recordings to real telephony audio. This is expected and represents realistic performance on production audio with:
- Telephone bandwidth limitations (8kHz โ upsampled)
- Background noise and cross-talk
- Natural conversational speech (vs. read speech)
- Various audio quality conditions
- Authentic Tanzanian Swahili dialects and speaking styles
Training Strategy
Three-Stage Training Pipeline:
- Stage 1 - Common Voice 17.0: Initial fine-tuning from Whisper Large v2 (10,000 steps)
- Stage 2 - Common Voice 23.0: Continued training on updated Common Voice data (7,500 steps โ 23.56% WER)
- Stage 3 (This Model) - Real Phone Calls: Domain adaptation on actual helpline recordings (4,500 steps โ 45.74% WER on telephony)
This model represents Stage 3 with domain-specific optimization for production deployment.
Intended Uses & Limitations
Intended Uses
Primary:
- Transcribing Swahili speech in Tanzania Child Helpline call center environments
- Real-time or batch processing of telephony audio (8kHz phone quality)
- Production ASR system for helpline service documentation and analytics
Secondary:
- General Swahili ASR for telephony/call center applications
- Research baseline for domain adaptation studies (clean speech โ telephony)
- Transfer learning base for similar low-resource telephony ASR tasks
Key Improvements Over Base Model
โ Domain Adaptation: Fine-tuned on 31,720+ augmented samples from real phone calls โ Telephony Robustness: Optimized for phone bandwidth (8kHz) and call quality variations โ Dialect Coverage: Trained on authentic Tanzanian Swahili dialects from real conversations โ Production Ready: Validated on actual helpline audio (not just clean datasets)
Limitations
โ ๏ธ Domain-Specific Vocabulary:
- Optimized for child helpline and healthcare-related conversations
- May underperform on technical, legal, or specialized domains outside training data scope
โ ๏ธ Dialect Specificity:
- Best performance on Tanzanian Swahili dialects represented in training data
- May have reduced accuracy on coastal, northern, or other regional variants not well-represented
โ ๏ธ Audio Quality Requirements:
- Designed for telephony (8kHz-16kHz), may need retuning for high-fidelity audio
- Performance degrades with severe background noise or very poor connections (though trained on augmented noisy data)
โ ๏ธ Code-Switching:
- Limited handling of Swahili-English code-switching common in urban Tanzania
- May struggle with mixed-language utterances
โ ๏ธ Model Size:
- Large model (Whisper Large v2 architecture) requires GPU for real-time transcription
- Consider quantization or distillation for edge deployment
Training and Evaluation Data
Custom Swahili ASR v6 Dataset
Source: Real phone call recordings from Tanzania Child Helpline operations
Dataset Composition:
- Training: 31,720 augmented samples
- Real call recordings with authentic telephony characteristics
- Data augmentation: background noise, speed variations, pitch shifts
- Validation: 1,813 samples (500 used during training for faster evaluation)
- Test: 907 samples (reserved for final evaluation)
Data Format:
- Custom TSV format with audio paths, transcriptions, and duration metadata
- Audio: Telephony quality (primarily 8kHz, upsampled to 16kHz for Whisper)
- Language: Tanzanian Swahili with natural conversational characteristics
Data Characteristics:
- โ Real-world phone call audio (not studio recordings)
- โ Natural conversational speech patterns
- โ Authentic background noise and acoustic conditions
- โ Diverse speaker demographics
- โ Domain-relevant vocabulary (child welfare, healthcare)
Preprocessing:
- Audio normalization and resampling to 16kHz (Whisper requirement)
- Standard Whisper feature extraction (80-channel log-mel spectrogram)
- Text normalization for Swahili
Training Procedure
Training Hyperparameters
Optimization:
- learning_rate: 1e-05
- lr_scheduler_type: cosine_with_restarts
- lr_scheduler_warmup_steps: 500
- optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
- max_training_steps: 12,000 (stopped at 6,000, best at 4,500)
- seed: 42
Batch Configuration:
- per_device_train_batch_size: 16
- per_device_eval_batch_size: 16
- gradient_accumulation_steps: 1
- Effective batch size: 16
Memory Optimization:
- gradient_checkpointing: true (with use_reentrant=False)
- mixed_precision_training: Native AMP (FP16)
- dataloader_num_workers: 2
Evaluation & Checkpointing:
- evaluation_strategy: steps
- eval_steps: 500
- save_steps: 500
- logging_steps: 50
- save_total_limit: 3
Best Model Selection:
- load_best_model_at_end: true
- metric_for_best_model: "wer"
- greater_is_better: false
- early_stopping_patience: 3 evaluations (1,500 steps)
Infrastructure:
- GPU: RunPod A40 (40GB VRAM)
- Training time: ~6.5 hours for 6,000 steps
- Checkpoint size: ~3GB per checkpoint
Training Results
| Training Loss | Epoch | Step | Validation Loss | WER | Notes |
|---|---|---|---|---|---|
| 0.9509 | 0.0417 | 500 | 0.8714 | 49.7126 | Initial adaptation |
| 0.6505 | 0.0833 | 1000 | 0.8277 | 52.6501 | |
| 0.4923 | 0.125 | 1500 | 0.8766 | 50.7503 | |
| 0.3597 | 1.0014 | 2000 | 0.9145 | 48.4994 | |
| 0.2188 | 1.0431 | 2500 | 0.9662 | 48.4036 | |
| 0.1351 | 1.0848 | 3000 | 1.0237 | 46.5358 | |
| 0.1057 | 1.1264 | 3500 | 1.0614 | 47.3819 | |
| 0.0839 | 2.0028 | 4000 | 1.1110 | 46.6156 | |
| 0.0541 | 2.0445 | 4500 | 1.1333 | 45.7375 | โ Best checkpoint |
| 0.0411 | 2.0862 | 5000 | 1.1670 | 47.1264 | Performance degradation |
| 0.0321 | 2.1278 | 5500 | 1.1806 | 46.5358 | |
| 0.0243 | 3.0042 | 6000 | 1.2159 | 46.8870 | Overfitting signs |
Training Observations:
- Convergence: Best WER achieved at step 4,500 (45.74%)
- Early signs of overfitting: Validation loss increased after step 4,500 while training loss continued decreasing
- Model selection: Weights restored to step 4,500 checkpoint for optimal generalization
- Training curve: Steady improvement from 49.71% โ 45.74% WER over first 4,500 steps
Final Metrics (Step 4,500):
- Training loss: 0.0541
- Validation loss: 1.1333
- Validation WER: 45.74%
- Total training time: ~4.7 hours
- Total training samples processed: ~192,000 (31,720 samples ร ~6 epochs)
Domain Adaptation Summary
| Stage | Dataset | WER | Domain Gap |
|---|---|---|---|
| Stage 1 (Base) | Common Voice 17.0 | 23.62% | Clean read speech |
| Stage 2 (Base) | Common Voice 23.0 | 23.56% | Clean read speech |
| Stage 3 (This Model) | Real Phone Calls v6 | 45.74% | Telephony, conversational |
Domain Gap Analysis: The ~22 percentage point WER increase from Common Voice (23.56%) to real phone calls (45.74%) quantifies the domain adaptation challenge:
- ๐ Telephony bandwidth vs. full-bandwidth audio
- ๐ค Conversational vs. read speech
- ๐ Real noise conditions vs. clean recordings
- ๐ฃ๏ธ Natural disfluencies vs. prepared text
This gap is expected and normal for production ASR systems deployed on telephony audio.
Performance Comparison
| Model | Test Domain | WER | Notes |
|---|---|---|---|
| Whisper Large v2 (zero-shot) | Common Voice 17.0 | 89.05% | Baseline |
| Base model (v1) - Stage 1 | Common Voice 17.0 | 23.62% | Clean speech tuning |
| Base model (v1) - Stage 2 | Common Voice 23.0 | 23.56% | Clean speech tuning |
| This model (v5) | Real phone calls | 45.74% | Telephony adaptation |
Key Insight: While WER appears higher, this model is optimized for the actual production domain (telephony) where the base model would perform significantly worse despite its lower WER on clean data.
Usage
Quick Start
from transformers import pipeline
# Load the model
pipe = pipeline("automatic-speech-recognition",
model="openchs/asr-whisper-largev2-v5")
# Transcribe phone call audio
result = pipe("path/to/phone_call.wav")
print(result["text"])
Advanced Usage with Audio Preprocessing
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa
# Load model and processor
processor = WhisperProcessor.from_pretrained("openchs/asr-whisper-largev2-v5")
model = WhisperForConditionalGeneration.from_pretrained("openchs/asr-whisper-largev2-v5")
# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Load and preprocess audio (handles telephony audio)
audio, sr = librosa.load("path/to/phone_call.wav", sr=16000, mono=True)
# Process audio
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device)
# Generate transcription with language hint
forced_decoder_ids = processor.get_decoder_prompt_ids(language="sw", task="transcribe")
predicted_ids = model.generate(
input_features,
forced_decoder_ids=forced_decoder_ids,
max_length=448
)
# Decode transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
Production Deployment Recommendations
Audio Requirements:
- Sample rate: 16kHz (model will work with 8kHz telephony audio upsampled to 16kHz)
- Format: Mono (single channel)
- Duration: Optimal <30 seconds per segment for memory efficiency
Inference Optimization:
# Use half-precision for faster inference
model = model.half() # FP16
# Enable batch processing for multiple files
batch_size = 8
results = pipe(audio_files, batch_size=batch_size)
Real-time Considerations:
- GPU required for real-time transcription (RTF < 1.0)
- CPU inference possible but slower (RTF ~3-5x on modern CPUs)
- Consider model quantization for edge deployment
Evaluation Methodology
Validation Set:
- 500 samples randomly selected from 1,813-sample validation split
- Evaluated every 500 training steps
- Represents diverse call scenarios and speakers
WER Calculation:
- Standard Word Error Rate:
(Substitutions + Deletions + Insertions) / Total Words - Normalized text (lowercase, punctuation handling)
- Swahili-specific text normalization applied
Best Model Selection:
- Automatic selection based on lowest validation WER
- Early stopping after 3 evaluations without improvement
- Final model: Step 4,500 checkpoint
Future Work
- Test set evaluation: Comprehensive evaluation on held-out 907-sample test set
- Code-switching support: Improve Swahili-English mixed utterance handling
- Model compression: Quantization and distillation for faster inference
- Streaming ASR: Adapt for real-time streaming transcription
- Dialect expansion: Include more regional Swahili variants
- Noise robustness: Further augmentation with extreme noise conditions
- Benchmark comparison: Evaluate against other Swahili ASR systems
Citation
If you use this model in your research or production systems, please cite:
@misc{openchs-swahili-asr-v5,
title={Domain-Adapted Swahili ASR for Tanzania Child Helpline Telephony},
author={OpenCHS Team},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/openchs/asr-whisper-largev2-v5}},
note={Fine-tuned from openchs/asr-whisper-helpline-sw-v1 on real phone call data}
}
Framework Versions
- Transformers: 4.56.2
- PyTorch: 2.8.0+cu128
- Datasets: 2.21.0
- Tokenizers: 0.22.1
License
Apache 2.0
Acknowledgments
- Base model: openchs/asr-whisper-helpline-sw-v1
- Foundation model: OpenAI Whisper Large v2
- Training infrastructure: RunPod (A40 GPU)
- Project: OpenCHS - Open Source Child Helpline System
- Data collection: Tanzania Child Helpline operations team
Model Status: โ Production Ready - Optimized for Tanzania Child Helpline telephony transcription
Last Updated: 2025-11-17 (Checkpoint 4,500 restored as best performing model)
- Downloads last month
- 13
Model tree for openchs/asr-whisper-largev2-v5
Base model
openai/whisper-large-v2Evaluation results
- WER on Custom Swahili ASR v6 (Phone Calls)validation set self-reported45.737