MOSS-TTS Family

Overview

MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for high‑fidelity, high‑expressiveness, and complex real‑world scenarios, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.

Paper Information

This model is based on the research presented in the paper: MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models.

Introduction

When a single piece of audio needs to sound like a real person, pronounce every word accurately, switch speaking styles across content, remain stable over tens of minutes, and support dialogue, role‑play, and real‑time interaction, a single TTS model is often not enough. The MOSS‑TTS Family breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.

MOSS‑TTS: MOSS-TTS is the flagship production TTS foundation model, centered on high-fidelity zero-shot voice cloning with controllable long-form synthesis, pronunciation, and multilingual/code-switched speech. It serves as the core engine for scalable narration, dubbing, and voice-driven products.
MOSS‑TTSD: MOSS-TTSD is a production long-form dialogue model for expressive multi-speaker conversational audio at scale. It supports long-duration continuity, turn-taking control, and zero-shot voice cloning from short references for podcasts, audiobooks, commentary, dubbing, and entertainment dialogue.
MOSS‑VoiceGenerator: MOSS-VoiceGenerator is an open-source voice design model that creates speaker timbres directly from free-form text, without reference audio. It unifies timbre design, style control, and content synthesis, and can be used standalone or as a voice-design layer for downstream TTS.
MOSS‑SoundEffect: MOSS-SoundEffect is a high-fidelity text-to-sound model with broad category coverage and controllable duration for real content production. It generates stable audio from prompts across ambience, urban scenes, creatures, human actions, and music-like clips for film, games, interactive media, and data synthesis.
MOSS‑TTS‑Realtime: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS model for real-time voice agents. By conditioning on dialogue history across both text and prior user acoustics, it delivers low-latency synthesis with coherent, consistent voice responses across turns.

Released Models

Model	Architecture	Size	Model Card	Hugging Face
MOSS-TTS	MossTTSDelay	8B	moss_tts_model_card.md	🤗 Huggingface
	MossTTSLocal	1.7B	moss_tts_model_card.md	🤗 Huggingface
MOSS‑TTSD‑V1.0	MossTTSDelay	8B	moss_ttsd_model_card.md	🤗 Huggingface
MOSS‑VoiceGenerator	MossTTSDelay	1.7B	moss_voice_generator_model_card.md	🤗 Huggingface
MOSS‑SoundEffect	MossTTSDelay	8B	moss_sound_effect_model_card.md	🤗 Huggingface
MOSS‑TTS‑Realtime	MossTTSRealtime	1.7B	moss_tts_realtime_model_card.md	🤗 Huggingface

Supported Languages

MOSS-TTS, MOSS-TTSD and MOSS-TTS-Realtime currently supports 20 languages:

Language	Code	Flag	Language	Code	Flag	Language	Code	Flag
Chinese	zh	🇨🇳	English	en	🇺🇸	German	de	🇩🇪
Spanish	es	🇪🇸	French	fr	🇫🇷	Japanese	ja	🇯🇵
Italian	it	🇮🇹	Hebrew	he	🇮🇱	Korean	ko	🇰🇷
Russian	ru	🇷🇺	Persian (Farsi)	fa	🇮🇷	Arabic	ar	🇸🇦
Polish	pl	🇵🇱	Portuguese	pt	🇵🇹	Czech	cs	🇨🇿
Danish	da	🇩🇰	Swedish	sv	🇸🇪	Hungarian	hu	🇭🇺
Greek	el	🇬🇷	Turkish	tr	🇹🇷

MOSS-TTS-Realtime

1. Overview

1.1 TTS Family Positioning

MOSS-TTS-Realtime is a high-performance, real-time speech synthesis model within the broader MOSS TTS Family. It is designed for interactive voice agents that require low-latency, continuous speech generation across multi-turn conversations. Unlike conventional streaming TTS systems that synthesize each response in isolation, MOSS-TTS-Realtime natively models dialogue context by conditioning speech generation on both textual and acoustic information from previous turns. By tightly integrating multi-turn context awareness with incremental streaming synthesis, it produces natural, coherent, and voice-consistent audio responses, enabling fluid and human-like spoken interactions for real-time applications.

Key Capabilities

Context-Aware & Expressive Speech Generation: Generates expressive and coherent speech by modeling both textual and acoustic context across multiple dialogue turns.
High-Fidelity Voice Cloning with Multi-Turn Consistency: Achieves exceptionally high voice similarity while maintaining strong speaker identity consistency across multiple dialogue turns.
Long-Context: Supports long-range context with a maximum context length of 32K (about 40 minutes), enabling stable and consistent speech generation in extended conversations.
Highly Human-Like Speech with Natural Prosody: Trained on over 2.5 million hours of single-speaker speech and more than 1 million hours of two-speaker and multi-speaker conversational data, resulting in highly natural prosody and strong human-like expressiveness.
Multilingual Speech Support: Supports over 10 languages beyond Chinese and English, including Korean, Japanese, German, and French, enabling consistent and expressive speech across languages.

1.2 Model Architecture

2. Quick Start

2.1 Environment Setup

We recommend a clean, isolated Python environment with Transformers 5.0.0 to avoid dependency conflicts.

conda create -n moss-tts python=3.12 -y
conda activate moss-tts

Install all required dependencies:

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .

2.2 Usage

Please refer to the following GitHub repository for detailed usage instructions and examples:

👉 Usage Guide:
https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md

Citation

If you use this code or result in your paper, please cite our work as:

@misc{gong2026mossaudiotokenizerscalingaudiotokenizers,
      title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models}, 
      author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
      year={2026},
      eprint={2602.10934},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2602.10934}, 
}

Downloads last month: 2,289

Safetensors

Model size

2B params

Tensor type

BF16

Collection including OpenMOSS-Team/MOSS-TTS-Realtime

MOSS-TTS

Collection

8 items • Updated 1 day ago • 16

Paper for OpenMOSS-Team/MOSS-TTS-Realtime

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Paper • 2602.10934 • Published 4 days ago • 47