You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

You agree to not use the model to generate, share, or promote content that is illegal, harmful, deceptive, or intended to impersonate real individuals without their informed consent.

EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion

Our paper has been published in the Findings of EMNLP 2025!

Installation

Create a separate environment if needed

# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n ez-vc python=3.10
conda activate ez-vc

Local installation

git clone https://github.com/EZ-VC/EZ-VC
cd EZ-VC
git submodule update --init --recursive
pip install -e .

# Install espnet for xeus (Exactly this version)
pip install 'espnet @ git+https://github.com/wanchichen/espnet.git@ssl'

Inference

We have provided a Jupyter notebook for inference in "src/f5_tts/infer/infer.ipynb".

Open Inference notebook.

Run all.

The converted audio will be available at the last cell.

Acknowledgements

F5-TTS for opensourcing their code which has made EZ-VC possible.

Citation

If our work and codebase is useful for you, please cite as:

@inproceedings{joglekar-etal-2025-ez,
    title = "{EZ}-{VC}: Easy Zero-shot Any-to-Any Voice Conversion",
    author = "Joglekar, Advait  and
      Singh, Divyanshu  and
      Bhatia, Rooshil Rohit  and
      Umesh, Srinivasan",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.1077/",
    doi = "10.18653/v1/2025.findings-emnlp.1077",
    pages = "19768--19774",
    ISBN = "979-8-89176-335-7",
    abstract = "Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages. We provide our code, model checkpoint and demo samples here: https://github.com/ez-vc/ez-vc"
}

License

Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license. Sorry for any inconvenience this may cause.

Downloads last month: 85

Paper for SPRINGLab/EZ-VC

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Paper • 2410.06885 • Published Oct 9, 2024 • 47