cuuupid / f5-tts

F5-TTS, a new state-of-the-art in open source voice cloning

replicate.com
Total runs: 171
24-hour runs: 0
7-day runs: 0
30-day runs: 0
Github
Model's Last Updated: 10月 14 2024

Introduction of f5-tts

Model Details of f5-tts

Readme

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

F5-TTS : Diffusion Transformer with ConvNeXt V2, faster trained and inference.

E2 TTS : Flat-UNet Transformer, closest reproduction.

Sway Sampling : Inference-time flow step sampling strategy, greatly improves performance

Installation

Clone the repository:

git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS

Install packages:

pip install -r requirements.txt

Install torch with your CUDA version, e.g. :

pip install torch==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

Note : install numpy with version < 2.x, e.g. pip install numpy==1.22.0 .

Prepare Dataset

Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in model/ dataset.py .

# prepare custom dataset up to your need
# download corresponding dataset first, and fill in the path in scripts

# Prepare the Emilia dataset
python scripts/prepare_emilia.py

# Prepare the Wenetspeech4TTS dataset
python scripts/prepare_wenetspeech4tts.py
Training

Once your datasets are prepared, you can start the training process.

# setup accelerate config, e.g. use multi-gpu ddp, fp16
# will be to: ~/.cache/huggingface/accelerate/default_config.yaml     
accelerate config
accelerate launch test_train.py

An initial guidance on Finetuning #57 .

Inference

To run inference with pretrained models, download the checkpoints from 🤗 Hugging Face .

Currently support up to 30s generation, which is the TOTAL length of prompt audio and the generated. Batch inference with chunks is supported by Gradio APP now. - To avoid possible inference failures, make sure you have seen through the following instructions. - A longer prompt audio allows shorter generated output. The part longer than 30s cannot be generated properly. Consider split your text and do several separate inferences or leverage the local Gradio APP which enables a batch inference with chunks. - Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words. - Add some spaces (blank: ” “) or punctuations (e.g. “,” “.”) to explicitly introduce some pauses. If first few words skipped in code-switched generation (cuz different speed with different languages), this might help.

Single Inference

You can test single inference using the following command. Before running the command, modify the config up to your need.

# modify the config up to your need,
# e.g. fix_duration (the total length of prompt + to_generate, currently support up to 30s)
#      nfe_step     (larger takes more time to do more precise inference ode)
#      ode_method   (switch to 'midpoint' for better compatibility with small nfe_step, )
#                   ( though 'midpoint' is 2nd-order ode solver, slower compared to 1st-order 'Euler')
python test_infer_single.py
Speech Editing

To test speech editing capabilities, use the following command.

python test_infer_single_edit.py
Gradio App

You can launch a Gradio app (web interface) to launch a GUI for inference.

First, make sure you have the dependencies installed ( pip install -r requirements.txt ). Then, install the Gradio app dependencies:

pip install -r requirements_gradio.txt

After installing the dependencies, launch the app (will load ckpt from Huggingface, you may set ckpt_path to local file in gradio_app.py ):

python gradio_app.py

You can specify the port/host:

python gradio_app.py --port 7860 --host 0.0.0.0

Or launch a share link:

python gradio_app.py --share
Evaluation
Prepare Test Datasets
  1. Seed-TTS test set: Download from seed-tts-eval .
  2. LibriSpeech test-clean: Download from OpenSLR .
  3. Unzip the downloaded datasets and place them in the data/ directory.
  4. Update the path for the test-clean data in test_infer_batch.py
  5. Our filtered LibriSpeech-PC 4-10s subset is already under data/ in this repo
Batch Inference for Test Set

To run batch inference for evaluations, execute the following commands:

# batch inference for evaluations
accelerate config  # if not set before
bash test_infer_batch.sh
Download Evaluation Model Checkpoints
  1. Chinese ASR Model: Paraformer-zh
  2. English ASR Model: Faster-Whisper
  3. WavLM Model: Download from Google Drive .
Objective Evaluation

Some Notes

For faster-whisper with CUDA 11:

pip install --force-reinstall ctranslate2==3.24.0

(Recommended) To avoid possible ASR failures, such as abnormal repetitions in output:

pip install faster-whisper==0.10.1

Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:

# Evaluation for Seed-TTS test set
python scripts/eval_seedtts_testset.py

# Evaluation for LibriSpeech-PC test-clean (cross-sentence)
python scripts/eval_librispeech_test_clean.py
Acknowledgements
Citation
@article{chen-etal-2024-f5tts,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2410.06885},
      year={2024},
}
License

Our code is released under MIT License.

Runs of cuuupid f5-tts on replicate.com

171
Total runs
0
24-hour runs
0
3-day runs
0
7-day runs
0
30-day runs

More Information About f5-tts replicate.com Model

More f5-tts license Visit here:

https://spdx.org/licenses/CC-BY-NC-4.0

f5-tts replicate.com

f5-tts replicate.com is an AI model on replicate.com that provides f5-tts's model effect (F5-TTS, a new state-of-the-art in open source voice cloning), which can be used instantly with this cuuupid f5-tts model. replicate.com supports a free trial of the f5-tts model, and also provides paid use of the f5-tts. Support call f5-tts model through api, including Node.js, Python, http.

cuuupid f5-tts online free

f5-tts replicate.com is an online trial and call api platform, which integrates f5-tts's modeling effects, including api services, and provides a free online trial of f5-tts, you can try f5-tts online for free by clicking the link below.

cuuupid f5-tts online free url in replicate.com:

https://replicate.com/cuuupid/f5-tts

f5-tts install

f5-tts is an open source model from GitHub that offers a free installation service, and any user can find f5-tts on GitHub to install. At the same time, replicate.com provides the effect of f5-tts install, users can directly use f5-tts installed effect in replicate.com for debugging and trial. It also supports api for free installation.

f5-tts install url in replicate.com:

https://replicate.com/cuuupid/f5-tts

f5-tts install url in github:

https://github.com/cuuupid/cog-f5-tts

Url of f5-tts

Provider of f5-tts replicate.com

Other API from cuuupid

replicate

Best-in-class clothing virtual try on in the wild (non-commercial use only)

Total runs: 581.3K
Run Growth: 65.2K
Growth Rate: 11.26%
Updated: 8月 24 2024
replicate

Embed text with Qwen2-7b-Instruct

Total runs: 337.6K
Run Growth: 155.8K
Growth Rate: 46.48%
Updated: 8月 06 2024
replicate

GLM-4V is a multimodal model released by Tsinghua University that is competitive with GPT-4o and establishes a new SOTA on several benchmarks, including OCR.

Total runs: 76.9K
Run Growth: 2.9K
Growth Rate: 3.77%
Updated: 7月 02 2024
replicate

Microsoft's tool to convert Office documents, PDFs, images, audio, and more to LLM-ready markdown.

Total runs: 3.8K
Run Growth: 3.1K
Growth Rate: 85.83%
Updated: 1月 17 2025
replicate

Convert scanned or electronic documents to markdown, very very very fast

Total runs: 2.3K
Run Growth: 0
Growth Rate: 0.00%
Updated: 12月 07 2023
replicate

Generate high quality videos from a prompt

Total runs: 1.7K
Run Growth: 100
Growth Rate: 5.88%
Updated: 8月 27 2024
replicate

Flux finetuned for black and white line art.

Total runs: 1.4K
Run Growth: 100
Growth Rate: 7.14%
Updated: 8月 23 2024
replicate

SDXL finetuned on line art

Total runs: 1.1K
Run Growth: 0
Growth Rate: 0.00%
Updated: 6月 05 2024
replicate

Translate audio while keeping the original style, pronunciation and tone of your original audio.

Total runs: 767
Run Growth: 70
Growth Rate: 9.13%
Updated: 12月 06 2023
replicate

SOTA open-source model for chatting with videos and the newest model in the Qwen family

Total runs: 448
Run Growth: 21
Growth Rate: 4.69%
Updated: 8月 31 2024
replicate

Zonos-v0.1 beta, a SOTA text-to-speech Transformer model with extraordinary expressive range, built by Zyphra.

Total runs: 164
Run Growth: 93
Growth Rate: 56.71%
Updated: 2月 11 2025
replicate

Finetuned E5 embeddings for instruct based on Mistral.

Total runs: 131
Run Growth: 0
Growth Rate: 0.00%
Updated: 2月 03 2024
replicate

MiniCPM LLama3-V 2.5, a new SOTA open-source VLM that surpasses GPT-4V-1106 and Phi-128k on a number of benchmarks.

Total runs: 127
Run Growth: 0
Growth Rate: 0.00%
Updated: 6月 04 2024
replicate

Llama-3-8B finetuned with ReFT to hyperfocus on New Jersey, the Garden State, the best state, the only state!

Total runs: 105
Run Growth: 0
Growth Rate: 0.00%
Updated: 6月 03 2024
replicate

make meow emojis!

Total runs: 68
Run Growth: 0
Growth Rate: 0.00%
Updated: 1月 11 2024
replicate

An example using Garden State Llama to ReFT on the Golden Gate bridge.

Total runs: 30
Run Growth: 0
Growth Rate: 0.00%
Updated: 6月 03 2024