MOSS-TTS Family
Overview
MOSSโTTS Family is an openโsource speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for highโfidelity, highโexpressiveness, and complex realโworld scenarios, covering stable longโform speech, multiโspeaker dialogue, voice/character design, environmental sound effects, and realโtime streaming TTS.
Paper Information
This model is based on the research presented in the paper: MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models.
Introduction
When a single piece of audio needs to sound like a real person, pronounce every word accurately, switch speaking styles across content, remain stable over tens of minutes, and support dialogue, roleโplay, and realโtime interaction, a single TTS model is often not enough. The MOSSโTTS Family breaks the workflow into five productionโready models that can be used independently or composed into a complete pipeline.
- MOSSโTTS: MOSS-TTS is the flagship production TTS foundation model, centered on high-fidelity zero-shot voice cloning with controllable long-form synthesis, pronunciation, and multilingual/code-switched speech. It serves as the core engine for scalable narration, dubbing, and voice-driven products.
- MOSSโTTSD: MOSS-TTSD is a production long-form dialogue model for expressive multi-speaker conversational audio at scale. It supports long-duration continuity, turn-taking control, and zero-shot voice cloning from short references for podcasts, audiobooks, commentary, dubbing, and entertainment dialogue.
- MOSSโVoiceGenerator: MOSS-VoiceGenerator is an open-source voice design model that creates speaker timbres directly from free-form text, without reference audio. It unifies timbre design, style control, and content synthesis, and can be used standalone or as a voice-design layer for downstream TTS.
- MOSSโSoundEffect: MOSS-SoundEffect is a high-fidelity text-to-sound model with broad category coverage and controllable duration for real content production. It generates stable audio from prompts across ambience, urban scenes, creatures, human actions, and music-like clips for film, games, interactive media, and data synthesis.
- MOSSโTTSโRealtime: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS model for real-time voice agents. By conditioning on dialogue history across both text and prior user acoustics, it delivers low-latency synthesis with coherent, consistent voice responses across turns.
Released Models
| Model | Architecture | Size | Model Card | Hugging Face |
|---|---|---|---|---|
| MOSS-TTS | MossTTSDelay | 8B | moss_tts_model_card.md | ๐ค Huggingface |
| MossTTSLocal | 1.7B | moss_tts_model_card.md | ๐ค Huggingface | |
| MOSSโTTSDโV1.0 | MossTTSDelay | 8B | moss_ttsd_model_card.md | ๐ค Huggingface |
| MOSSโVoiceGenerator | MossTTSDelay | 1.7B | moss_voice_generator_model_card.md | ๐ค Huggingface |
| MOSSโSoundEffect | MossTTSDelay | 8B | moss_sound_effect_model_card.md | ๐ค Huggingface |
| MOSSโTTSโRealtime | MossTTSRealtime | 1.7B | moss_tts_realtime_model_card.md | ๐ค Huggingface |
Supported Languages
MOSS-TTS, MOSS-TTSD and MOSS-TTS-Realtime currently supports 20 languages:
| Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |
|---|---|---|---|---|---|---|---|---|
| Chinese | zh | ๐จ๐ณ | English | en | ๐บ๐ธ | German | de | ๐ฉ๐ช |
| Spanish | es | ๐ช๐ธ | French | fr | ๐ซ๐ท | Japanese | ja | ๐ฏ๐ต |
| Italian | it | ๐ฎ๐น | Hebrew | he | ๐ฎ๐ฑ | Korean | ko | ๐ฐ๐ท |
| Russian | ru | ๐ท๐บ | Persian (Farsi) | fa | ๐ฎ๐ท | Arabic | ar | ๐ธ๐ฆ |
| Polish | pl | ๐ต๐ฑ | Portuguese | pt | ๐ต๐น | Czech | cs | ๐จ๐ฟ |
| Danish | da | ๐ฉ๐ฐ | Swedish | sv | ๐ธ๐ช | Hungarian | hu | ๐ญ๐บ |
| Greek | el | ๐ฌ๐ท | Turkish | tr | ๐น๐ท |
MOSS-TTS-Realtime
1. Overview
1.1 TTS Family Positioning
MOSS-TTS-Realtime is a high-performance, real-time speech synthesis model within the broader MOSS TTS Family. It is designed for interactive voice agents that require low-latency, continuous speech generation across multi-turn conversations. Unlike conventional streaming TTS systems that synthesize each response in isolation, MOSS-TTS-Realtime natively models dialogue context by conditioning speech generation on both textual and acoustic information from previous turns. By tightly integrating multi-turn context awareness with incremental streaming synthesis, it produces natural, coherent, and voice-consistent audio responses, enabling fluid and human-like spoken interactions for real-time applications.
Key Capabilities
Context-Aware & Expressive Speech Generation: Generates expressive and coherent speech by modeling both textual and acoustic context across multiple dialogue turns.
High-Fidelity Voice Cloning with Multi-Turn Consistency: Achieves exceptionally high voice similarity while maintaining strong speaker identity consistency across multiple dialogue turns.
Long-Context: Supports long-range context with a maximum context length of 32K (about 40 minutes), enabling stable and consistent speech generation in extended conversations.
Highly Human-Like Speech with Natural Prosody: Trained on over 2.5 million hours of single-speaker speech and more than 1 million hours of two-speaker and multi-speaker conversational data, resulting in highly natural prosody and strong human-like expressiveness.
Multilingual Speech Support: Supports over 10 languages beyond Chinese and English, including Korean, Japanese, German, and French, enabling consistent and expressive speech across languages.
1.2 Model Architecture
2. Quick Start
2.1 Environment Setup
We recommend a clean, isolated Python environment with Transformers 5.0.0 to avoid dependency conflicts.
conda create -n moss-tts python=3.12 -y
conda activate moss-tts
Install all required dependencies:
git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .
2.2 Usage
Please refer to the following GitHub repository for detailed usage instructions and examples:
๐ Usage Guide:
https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md
Citation
If you use this code or result in your paper, please cite our work as:
@misc{gong2026mossaudiotokenizerscalingaudiotokenizers,
title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models},
author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
year={2026},
eprint={2602.10934},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2602.10934},
}
- Downloads last month
- 2,289