Fish Audio S2 Pro

Technical Report | GitHub | Playground

Fish Audio S2 Pro is a leading text-to-speech (TTS) model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine.

Architecture

S2 Pro builds on a decoder-only transformer combined with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate) using a Dual-Autoregressive (Dual-AR) architecture:

Slow AR (4B parameters): Operates along the time axis and predicts the primary semantic codebook.
Fast AR (400M parameters): Generates the remaining 9 residual codebooks at each time step, reconstructing fine-grained acoustic detail.

This asymmetric design keeps inference efficient while preserving audio fidelity. Because the Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, it inherits all LLM-native serving optimizations from SGLang — including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching.

Fine-Grained Inline Control

S2 Pro enables localized control over speech generation by embedding natural-language instructions directly within the text using [tag] syntax. Rather than relying on a fixed set of predefined tags, S2 Pro accepts free-form textual descriptions — such as [whisper in small voice], [professional broadcast tone], or [pitch up] — allowing open-ended expression control at the word level.

Common tags (15,000+ unique tags supported):

[pause] [emphasis] [laughing] [inhale] [chuckle] [tsk] [singing] [excited] [laughing tone] [interrupting] [chuckling] [excited tone] [volume up] [echo] [angry] [low volume] [sigh] [low voice] [whisper] [screaming] [shouting] [loud] [surprised] [short pause] [exhale] [delight] [panting] [audience laughter] [with strong accent] [volume down] [clearing throat] [sad] [moaning] [shocked]

Supported Languages

S2 Pro supports 80+ languages.

Tier 1: Japanese (ja), English (en), Chinese (zh)

Tier 2: Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de)

Other supported languages: sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, xsl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, and more.

Production Streaming Performance

On a single NVIDIA H200 GPU:

Real-Time Factor (RTF): 0.195
Time-to-first-audio: ~100 ms
Throughput: 3,000+ acoustic tokens/s while maintaining RTF below 0.5

Technical Report

If you find our work useful, please consider citing our report:

@misc{liao2026fishaudios2technical,
      title={Fish Audio S2 Technical Report}, 
      author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},
      year={2026},
      eprint={2603.08823},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2603.08823},
}

License

This model is licensed under the Fish Audio Research License. Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact [email protected].

Downloads last month: 128,022

Safetensors

Model size

5B params

Tensor type

BF16

Model tree for fishaudio/s2-pro

Finetunes

8 models

Quantizations

9 models

Spaces using fishaudio/s2-pro 26

Collection including fishaudio/s2-pro

Fish Audio S2

Collection

1 item • Updated Mar 10 • 19

Paper for fishaudio/s2-pro

Fish Audio S2 Technical Report

Paper • 2603.08823 • Published Mar 9 • 38

fishaudio
/

s2-pro