| | --- |
| | language: |
| | - en |
| | tags: |
| | - audio-text-to-text |
| | - speech-translation |
| | - speech-understanding |
| | - audio |
| | - chat |
| |
|
| |
|
| | license: apache-2.0 |
| | datasets: |
| | - custom |
| | metrics: |
| | - wer |
| | - bleu |
| | - AIR-Bench |
| | --- |
| | <div align="center"> |
| | <h1> |
| | Soundwave: Less is More for Speech-Text Alignment in LLMs |
| | </h1> |
| | </div> |
| |
|
| | <p align="center"> |
| | <font size="3"><a href="https://github.com/FreedomIntelligence/Soundwave">🐈⬛ Github</a> | <a href="https://arxiv.org/abs/2502.12900">📃 Paper</a>| <a href="https://huggingface.co/spaces/puccho/Soundwave">📼 Online Demo</a> </font> |
| | </p> |
| |
|
| | ## Model Description |
| | Soundwave is a Speech-to-Text model that bridges the gap between speech and text. It is trained on just 10k hours of data and delivers exceptional performance in speech translation and AIR-Bench speech tasks. |
| |
|
| | ### Key Features |
| | <div> |
| | <ul> |
| | <font size="3"><li>A Speech-to-Text Model Bridging the Gap Between Speech and Text</li></font> |
| | <font size="3"><li>Utilizes Data-Efficient Strategy and Unique Architecture, Trained on Only 10k Hours of Data</li></font> |
| | <font size="3"><li>Exceptional Performance in Speech Translation and AIR-Bench Speech Tasks</li></font> |
| | <font size="3"><li>Retains Intelligence During Conversations, Ideal for Interactive Tasks</li></font> |
| | </ul> |
| | </div> |
| | |
| | ## Usage |
| | Load the Soundwave model and run inference with your audio files as shown in the <a href="https://github.com/FreedomIntelligence/Soundwave">GitHub repository</a>. |
| |
|
| | # <span>📖 Citation</span> |
| | ``` |
| | @article{zhang2025soundwave, |
| | title={Soundwave: Less is More for Speech-Text Alignment in LLMs}, |
| | author={Zhang, Yuhao and Liu, Zhiheng and Bu, Fan and Zhang, Ruiyu and Wang, Benyou and Li, Haizhou}, |
| | journal={arXiv preprint arXiv:2502.12900}, |
| | year={2025} |
| | } |
| | ``` |