--- title: asr-multi-model emoji: 😊 colorFrom: blue colorTo: blue sdk: gradio sdk_version: "4.44.0" app_file: app.py pinned: false --- # 🎤 Multi-Model ASR Speech Recognition A comprehensive Automatic Speech Recognition (ASR) application with support for multiple models and Word Error Rate (WER) analysis. ## ✨ Features - **🤖 Multiple Models**: Support for Wav2Vec2 and Whisper models - **🎤 Audio Recording**: Direct microphone recording - **📁 File Upload**: Support for various audio formats - **📊 WER Analysis**: Calculate Word Error Rate with detailed breakdown - **💾 Memory Efficient**: Dynamic model loading and cleanup - **🌍 Multilingual**: Whisper models support multiple languages ## 🚀 Live Demo This application is hosted on Hugging Face Spaces. You can access it at: [Your HF Spaces URL] ## 🤖 Available Models | Model | Type | Speed | Accuracy | Memory | Best For | |-------|------|-------|----------|--------|----------| | Wav2Vec2 Base (100h) | Wav2Vec2 | ⚡ Fast | 🟡 Good | ~300MB | Basic tasks | | Wav2Vec2 Base (960h) | Wav2Vec2 | 🟡 Balanced | 🟢 Better | ~1GB | General use | | Wav2Vec2 Large (960h) | Wav2Vec2 | 🐌 Slower | 🔴 High | ~3GB | Difficult audio | | Whisper Large V3 Turbo | Whisper | 🐌 Slower | 🔴 Best | ~5GB | Multilingual | ## 📖 How to Use 1. **Select Model**: Choose from available Wav2Vec2 and Whisper models 2. **Load Model**: Click 'Load Model' to load the selected model 3. **Record/Upload**: Record audio or upload an audio file 4. **Transcribe**: Click 'Transcribe' or wait for auto-transcription 5. **WER Analysis**: Enter reference text to calculate Word Error Rate 6. **Copy Text**: Use 'Copy Text' to copy the result ## 🔧 Technical Details ### Models Used - **Wav2Vec2**: Facebook's self-supervised speech recognition models - **Whisper**: OpenAI's multilingual speech recognition model ### Audio Processing - Automatic resampling to 16kHz - Mono conversion for stereo audio - Audio normalization - Support for various formats (MP3, WAV, M4A, FLAC) ### WER Calculation - Uses edit distance for accurate alignment - Normalizes text (lowercase, no punctuation) - Provides detailed breakdown of insertions, deletions, and substitutions ## 🛠️ Local Development ### Prerequisites - Python 3.8+ - CUDA-compatible GPU (optional, for faster inference) ### Installation ```bash git clone [your-repo-url] cd [your-repo-name] pip install -r requirements.txt python app.py ``` ### Requirements ``` gradio>=4.44.0 torch>=2.6.0 torchaudio>=2.6.0 transformers>=4.36.2 librosa>=0.10.1 soundfile>=0.12.1 numpy>=1.24.3 editdistance>=1.0.11 ``` ## 📊 WER Analysis The application provides detailed Word Error Rate analysis: - **Word Error Rate**: Percentage of errors - **Error Breakdown**: Insertions, deletions, substitutions - **Word Statistics**: Correct words, total words, accuracy - **Normalized Texts**: Shows processed texts for verification ## 🎯 Performance Tips - **Clear Speech**: Speak clearly for better accuracy - **Quiet Environment**: Minimize background noise - **Good Microphone**: Use quality audio input - **Model Selection**: Choose based on your needs (speed vs accuracy) ## 🤝 Contributing Feel free to submit issues, feature requests, or pull requests to improve this application. ## 📝 License This project is open source and available under the MIT License. ## 🙏 Acknowledgments - Hugging Face for the transformers library - Facebook for Wav2Vec2 models - OpenAI for Whisper models - Gradio for the web interface framework