Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

06/24/2022
by   Florian Lux, et al.
0

The cloning of a speaker's voice using an untranscribed reference sample is one of the great advances of modern neural text-to-speech (TTS) methods. Approaches for mimicking the prosody of a transcribed reference audio have also been proposed recently. In this work, we bring these two tasks together for the first time through utterance level normalization in conjunction with an utterance level speaker embedding. We further introduce a lightweight aligner for extracting fine-grained prosodic features, that can be finetuned on individual samples within seconds. We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality and high similarity to both original voice and prosody, as our objective evaluation and human study show. All of our code and trained models are available, alongside static and interactive demos.

READ FULL TEXT

page 2

page 4

research
10/28/2022

Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

Text-based voice editing (TBVE) uses synthetic output from text-to-speec...
research
10/12/2022

Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech

Several recently proposed text-to-speech (TTS) models achieved to genera...
research
05/12/2023

Using Deepfake Technologies for Word Emphasis Detection

In this work, we consider the task of automated emphasis detection for s...
research
01/25/2022

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

With recent advancements in voice cloning, the performance of speech syn...
research
06/28/2022

A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

Typically, singing voice conversion (SVC) depends on an embedding vector...
research
03/01/2021

AdaSpeech: Adaptive Text to Speech for Custom Voice

Custom voice, a specific text to speech (TTS) service in commercial spee...
research
08/24/2023

WavMark: Watermarking for Audio Generation

Recent breakthroughs in zero-shot voice synthesis have enabled imitating...

Please sign up or login with your details

Forgot password? Click here to reset