F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Paper • 2410.06885 • Published • 47
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
You agree to not use the model to generate, share, or promote content that is illegal, harmful, deceptive, or intended to impersonate real individuals without their informed consent.
Log in or Sign Up to review the conditions and access this model content.
# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n ez-vc python=3.10
conda activate ez-vc
git clone https://github.com/EZ-VC/EZ-VC
cd EZ-VC
git submodule update --init --recursive
pip install -e .
# Install espnet for xeus (Exactly this version)
pip install 'espnet @ git+https://github.com/wanchichen/espnet.git@ssl'
We have provided a Jupyter notebook for inference in "src/f5_tts/infer/infer.ipynb".
Open Inference notebook.
Run all.
The converted audio will be available at the last cell.
If our work and codebase is useful for you, please cite as:
@inproceedings{joglekar-etal-2025-ez,
title = "{EZ}-{VC}: Easy Zero-shot Any-to-Any Voice Conversion",
author = "Joglekar, Advait and
Singh, Divyanshu and
Bhatia, Rooshil Rohit and
Umesh, Srinivasan",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.1077/",
doi = "10.18653/v1/2025.findings-emnlp.1077",
pages = "19768--19774",
ISBN = "979-8-89176-335-7",
abstract = "Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages. We provide our code, model checkpoint and demo samples here: https://github.com/ez-vc/ez-vc"
}
Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license. Sorry for any inconvenience this may cause.