Tutorial - VoiceCraft

Let's run VoiceCraft , a Zero-Shot Speech Editing and Text-to-Speech in the Wild!

What you need

One of the following Jetson devices:

Jetson AGX Orin (64GB) Jetson AGX Orin (32GB)
Running one of the following versions of JetPack :

JetPack 6 (L4T r36.x)
NVMe SSD highly recommended for storage speed and space
- 15.6 GB for voicecraft container image
- Space for models

Clone and setup jetson-containers :

git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh

How to start

Use run.sh and autotag script to automatically pull or build a compatible container image.

jetson-containers run $(autotag voicecraft)

The container has a default run command ( CMD ) that will automatically start the Gradio app.

Open your browser and access http://<IP_ADDRESS>:7860 .

Gradio app

VoiceCraft repo comes with Gradio demo app.

Select which models you want to use, I recommend using 330M_TTSEnhanced on 32GB AGX Orin
Click load, if you run it for the first time, models are downloaded from huggingface, otherwise are loaded from /data folder, where are saved to from previous runs
Upload audio file of your choice (MP3/wav)
Click transcribe, it will use whisper to get transcription along with start/end time of each word spoken
Now you can edit the sentence, or use TTS. Click Run to generate output.

Warning

For TTS it's okay to use only first few seconds of audio as prompt, since it consumes a lot of memory. On AGX 32GB Orin the maximal TTS length of generated audio is around ~16 seconds in headless mode.

Resources

If you want to know how it works under the hood, you can read following papers: