Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

01/25/2022
by   Artem Gorodetskii, et al.
0

With recent advancements in voice cloning, the performance of speech synthesis for a target speaker has been rendered similar to the human level. However, autoregressive voice cloning systems still suffer from text alignment failures, resulting in an inability to synthesize long sentences. In this work, we propose a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech and generalize to very long utterances as well. The proposed system is based on three independently trained components: a speaker encoder, synthesizer and universal vocoder. Generalization to long utterances is realized using an energy-based attention mechanism known as Dynamic Convolution Attention, in combination with a set of modifications proposed for the synthesizer based on Tacotron 2. Moreover, effective zero-shot speaker adaptation is achieved by conditioning both the synthesizer and vocoder on a speaker encoder that has been pretrained on a large corpus of diverse data. We compare several implementations of voice cloning systems in terms of speech naturalness, speaker similarity, alignment consistency and ability to synthesize long utterances, and conclude that the proposed model can produce intelligible synthetic speech for extremely long utterances, while preserving a high extent of naturalness and similarity for short texts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2019

Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

Despite the ability to produce human-level speech for in-domain text, at...
research
10/28/2022

Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

Text-based voice editing (TBVE) uses synthetic output from text-to-speec...
research
06/24/2022

Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

The cloning of a speaker's voice using an untranscribed reference sample...
research
08/28/2023

Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech

For personalized speech generation, a neural text-to-speech (TTS) model ...
research
09/09/2019

Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Text-to-speech systems are typically evaluated on single sentences. When...
research
06/05/2022

Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models

We present a novel way of conditioning a pretrained denoising diffusion ...
research
05/28/2023

Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS

Flow-based generative models are widely used in text-to-speech (TTS) sys...

Please sign up or login with your details

Forgot password? Click here to reset