Tutorial - VoiceCraft
Let's run VoiceCraft , a Zero-Shot Speech Editing and Text-to-Speech in the Wild!
What you need
-
One of the following Jetson devices:
Jetson AGX Orin (64GB) Jetson AGX Orin (32GB)
-
Running one of the following versions of JetPack :
JetPack 6 (L4T r36.x)
-
Sufficient storage space (preferably with NVMe SSD).
-
15.6 GB
forvoicecraft
container image - Space for models
-
-
Clone and setup
jetson-containers
:git clone https://github.com/dusty-nv/jetson-containers bash jetson-containers/install.sh
How to start
Use
run.sh
and
autotag
script to automatically pull or build a compatible container image.
jetson-containers run $(autotag voicecraft)
The container has a default run command (
CMD
) that will automatically start the Gradio app.
Open your browser and access
http://<IP_ADDRESS>:7860
.
Gradio app
VoiceCraft repo comes with Gradio demo app.
- Select which models you want to use, I recommend using 330M_TTSEnhanced on 32GB AGX Orin
-
Click load, if you run it for the first time, models are downloaded from huggingface, otherwise are loaded from
/data
folder, where are saved to from previous runs - Upload audio file of your choice (MP3/wav)
- Click transcribe, it will use whisper to get transcription along with start/end time of each word spoken
- Now you can edit the sentence, or use TTS. Click Run to generate output.
Warning
For TTS it's okay to use only first few seconds of audio as prompt, since it consumes a lot of memory. On AGX 32GB Orin the maximal TTS length of generated audio is around ~16 seconds in headless mode.
Resources
If you want to know how it works under the hood, you can read following papers: