MOSS-TTS Family


    

Overview

MOSSโ€‘TTS Family is an openโ€‘source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for highโ€‘fidelity, highโ€‘expressiveness, and complex realโ€‘world scenarios, covering stable longโ€‘form speech, multiโ€‘speaker dialogue, voice/character design, environmental sound effects, and realโ€‘time streaming TTS.

Paper Information

This model is based on the research presented in the paper: MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models.

Introduction

When a single piece of audio needs to sound like a real person, pronounce every word accurately, switch speaking styles across content, remain stable over tens of minutes, and support dialogue, roleโ€‘play, and realโ€‘time interaction, a single TTS model is often not enough. The MOSSโ€‘TTS Family breaks the workflow into five productionโ€‘ready models that can be used independently or composed into a complete pipeline.

  • MOSSโ€‘TTS: MOSS-TTS is the flagship production TTS foundation model, centered on high-fidelity zero-shot voice cloning with controllable long-form synthesis, pronunciation, and multilingual/code-switched speech. It serves as the core engine for scalable narration, dubbing, and voice-driven products.
  • MOSSโ€‘TTSD: MOSS-TTSD is a production long-form dialogue model for expressive multi-speaker conversational audio at scale. It supports long-duration continuity, turn-taking control, and zero-shot voice cloning from short references for podcasts, audiobooks, commentary, dubbing, and entertainment dialogue.
  • MOSSโ€‘VoiceGenerator: MOSS-VoiceGenerator is an open-source voice design model that creates speaker timbres directly from free-form text, without reference audio. It unifies timbre design, style control, and content synthesis, and can be used standalone or as a voice-design layer for downstream TTS.
  • MOSSโ€‘SoundEffect: MOSS-SoundEffect is a high-fidelity text-to-sound model with broad category coverage and controllable duration for real content production. It generates stable audio from prompts across ambience, urban scenes, creatures, human actions, and music-like clips for film, games, interactive media, and data synthesis.
  • MOSSโ€‘TTSโ€‘Realtime: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS model for real-time voice agents. By conditioning on dialogue history across both text and prior user acoustics, it delivers low-latency synthesis with coherent, consistent voice responses across turns.

Released Models

Model Architecture Size Model Card Hugging Face
MOSS-TTS MossTTSDelay 8B moss_tts_model_card.md ๐Ÿค— Huggingface
MossTTSLocal 1.7B moss_tts_model_card.md ๐Ÿค— Huggingface
MOSSโ€‘TTSDโ€‘V1.0 MossTTSDelay 8B moss_ttsd_model_card.md ๐Ÿค— Huggingface
MOSSโ€‘VoiceGenerator MossTTSDelay 1.7B moss_voice_generator_model_card.md ๐Ÿค— Huggingface
MOSSโ€‘SoundEffect MossTTSDelay 8B moss_sound_effect_model_card.md ๐Ÿค— Huggingface
MOSSโ€‘TTSโ€‘Realtime MossTTSRealtime 1.7B moss_tts_realtime_model_card.md ๐Ÿค— Huggingface

Supported Languages

MOSS-TTS, MOSS-TTSD and MOSS-TTS-Realtime currently supports 20 languages:

Language Code Flag Language Code Flag Language Code Flag
Chinese zh ๐Ÿ‡จ๐Ÿ‡ณ English en ๐Ÿ‡บ๐Ÿ‡ธ German de ๐Ÿ‡ฉ๐Ÿ‡ช
Spanish es ๐Ÿ‡ช๐Ÿ‡ธ French fr ๐Ÿ‡ซ๐Ÿ‡ท Japanese ja ๐Ÿ‡ฏ๐Ÿ‡ต
Italian it ๐Ÿ‡ฎ๐Ÿ‡น Hebrew he ๐Ÿ‡ฎ๐Ÿ‡ฑ Korean ko ๐Ÿ‡ฐ๐Ÿ‡ท
Russian ru ๐Ÿ‡ท๐Ÿ‡บ Persian (Farsi) fa ๐Ÿ‡ฎ๐Ÿ‡ท Arabic ar ๐Ÿ‡ธ๐Ÿ‡ฆ
Polish pl ๐Ÿ‡ต๐Ÿ‡ฑ Portuguese pt ๐Ÿ‡ต๐Ÿ‡น Czech cs ๐Ÿ‡จ๐Ÿ‡ฟ
Danish da ๐Ÿ‡ฉ๐Ÿ‡ฐ Swedish sv ๐Ÿ‡ธ๐Ÿ‡ช Hungarian hu ๐Ÿ‡ญ๐Ÿ‡บ
Greek el ๐Ÿ‡ฌ๐Ÿ‡ท Turkish tr ๐Ÿ‡น๐Ÿ‡ท

MOSS-TTS-Realtime

1. Overview

1.1 TTS Family Positioning

MOSS-TTS-Realtime is a high-performance, real-time speech synthesis model within the broader MOSS TTS Family. It is designed for interactive voice agents that require low-latency, continuous speech generation across multi-turn conversations. Unlike conventional streaming TTS systems that synthesize each response in isolation, MOSS-TTS-Realtime natively models dialogue context by conditioning speech generation on both textual and acoustic information from previous turns. By tightly integrating multi-turn context awareness with incremental streaming synthesis, it produces natural, coherent, and voice-consistent audio responses, enabling fluid and human-like spoken interactions for real-time applications.

Key Capabilities

  • Context-Aware & Expressive Speech Generation: Generates expressive and coherent speech by modeling both textual and acoustic context across multiple dialogue turns.

  • High-Fidelity Voice Cloning with Multi-Turn Consistency: Achieves exceptionally high voice similarity while maintaining strong speaker identity consistency across multiple dialogue turns.

  • Long-Context: Supports long-range context with a maximum context length of 32K (about 40 minutes), enabling stable and consistent speech generation in extended conversations.

  • Highly Human-Like Speech with Natural Prosody: Trained on over 2.5 million hours of single-speaker speech and more than 1 million hours of two-speaker and multi-speaker conversational data, resulting in highly natural prosody and strong human-like expressiveness.

  • Multilingual Speech Support: Supports over 10 languages beyond Chinese and English, including Korean, Japanese, German, and French, enabling consistent and expressive speech across languages.

1.2 Model Architecture

2. Quick Start

2.1 Environment Setup

We recommend a clean, isolated Python environment with Transformers 5.0.0 to avoid dependency conflicts.

conda create -n moss-tts python=3.12 -y
conda activate moss-tts

Install all required dependencies:

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .

2.2 Usage

Please refer to the following GitHub repository for detailed usage instructions and examples:

๐Ÿ‘‰ Usage Guide:
https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md

Citation

If you use this code or result in your paper, please cite our work as:

@misc{gong2026mossaudiotokenizerscalingaudiotokenizers,
      title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models}, 
      author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
      year={2026},
      eprint={2602.10934},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2602.10934}, 
}
Downloads last month
2,289
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 2 Ask for provider support

Collection including OpenMOSS-Team/MOSS-TTS-Realtime

Paper for OpenMOSS-Team/MOSS-TTS-Realtime