---
title: asr-multi-model
emoji: 😊
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: "4.44.0"
app_file: app.py
pinned: false
---


# 🎤 Multi-Model ASR Speech Recognition

A comprehensive Automatic Speech Recognition (ASR) application with support for multiple models and Word Error Rate (WER) analysis.

## ✨ Features

- **🤖 Multiple Models**: Support for Wav2Vec2 and Whisper models
- **🎤 Audio Recording**: Direct microphone recording
- **📁 File Upload**: Support for various audio formats
- **📊 WER Analysis**: Calculate Word Error Rate with detailed breakdown
- **💾 Memory Efficient**: Dynamic model loading and cleanup
- **🌍 Multilingual**: Whisper models support multiple languages

## 🚀 Live Demo

This application is hosted on Hugging Face Spaces. You can access it at:
[Your HF Spaces URL]

## 🤖 Available Models

| Model | Type | Speed | Accuracy | Memory | Best For |
|-------|------|-------|----------|--------|----------|
| Wav2Vec2 Base (100h) | Wav2Vec2 | ⚡ Fast | 🟡 Good | ~300MB | Basic tasks |
| Wav2Vec2 Base (960h) | Wav2Vec2 | 🟡 Balanced | 🟢 Better | ~1GB | General use |
| Wav2Vec2 Large (960h) | Wav2Vec2 | 🐌 Slower | 🔴 High | ~3GB | Difficult audio |
| Whisper Large V3 Turbo | Whisper | 🐌 Slower | 🔴 Best | ~5GB | Multilingual |

## 📖 How to Use

1. **Select Model**: Choose from available Wav2Vec2 and Whisper models
2. **Load Model**: Click 'Load Model' to load the selected model
3. **Record/Upload**: Record audio or upload an audio file
4. **Transcribe**: Click 'Transcribe' or wait for auto-transcription
5. **WER Analysis**: Enter reference text to calculate Word Error Rate
6. **Copy Text**: Use 'Copy Text' to copy the result

## 🔧 Technical Details

### Models Used
- **Wav2Vec2**: Facebook's self-supervised speech recognition models
- **Whisper**: OpenAI's multilingual speech recognition model

### Audio Processing
- Automatic resampling to 16kHz
- Mono conversion for stereo audio
- Audio normalization
- Support for various formats (MP3, WAV, M4A, FLAC)

### WER Calculation
- Uses edit distance for accurate alignment
- Normalizes text (lowercase, no punctuation)
- Provides detailed breakdown of insertions, deletions, and substitutions

## 🛠️ Local Development

### Prerequisites
- Python 3.8+
- CUDA-compatible GPU (optional, for faster inference)

### Installation
```bash
git clone [your-repo-url]
cd [your-repo-name]
pip install -r requirements.txt
python app.py
```

### Requirements
```
gradio>=4.44.0
torch>=2.6.0
torchaudio>=2.6.0
transformers>=4.36.2
librosa>=0.10.1
soundfile>=0.12.1
numpy>=1.24.3
editdistance>=1.0.11
```

## 📊 WER Analysis

The application provides detailed Word Error Rate analysis:

- **Word Error Rate**: Percentage of errors
- **Error Breakdown**: Insertions, deletions, substitutions
- **Word Statistics**: Correct words, total words, accuracy
- **Normalized Texts**: Shows processed texts for verification

## 🎯 Performance Tips

- **Clear Speech**: Speak clearly for better accuracy
- **Quiet Environment**: Minimize background noise
- **Good Microphone**: Use quality audio input
- **Model Selection**: Choose based on your needs (speed vs accuracy)

## 🤝 Contributing

Feel free to submit issues, feature requests, or pull requests to improve this application.

## 📝 License

This project is open source and available under the MIT License.

## 🙏 Acknowledgments

- Hugging Face for the transformers library
- Facebook for Wav2Vec2 models
- OpenAI for Whisper models
- Gradio for the web interface framework