Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows

06/10/2021
by   Iván Vallés-Pérez, et al.
0

Text-to-speech systems recently achieved almost indistinguishable quality from human speech. However, the prosody of those systems is generally flatter than natural speech, producing samples with low expressiveness. Disentanglement of speaker id and prosody is crucial in text-to-speech systems to improve on naturalness and produce more variable syntheses. This paper proposes a new neural text-to-speech model that approaches the disentanglement problem by conditioning a Tacotron2-like architecture on flow-normalized speaker embeddings, and by substituting the reference encoder with a new learned latent distribution responsible for modeling the intra-sentence variability due to the prosody. By removing the reference encoder dependency, the speaker-leakage problem typically happening in this kind of systems disappears, producing more distinctive syntheses at inference time. The new model achieves significantly higher prosody variance than the baseline in a set of quantitative prosody features, as well as higher speaker distinctiveness, without decreasing the speaker intelligibility. Finally, we observe that the normalized speaker embeddings enable much richer speaker interpolations, substantially improving the distinctiveness of the new interpolated speakers.

READ FULL TEXT
research
06/12/2018

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

We describe a neural network-based system for text-to-speech (TTS) synth...
research
10/12/2022

Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech

Several recently proposed text-to-speech (TTS) models achieved to genera...
research
07/13/2022

SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate

The mapping of text to speech (TTS) is non-deterministic, letters may be...
research
12/01/2017

Speaker identification from the sound of the human breath

This paper examines the speaker identification potential of breath sound...
research
07/04/2019

Fine-grained robust prosody transfer for single-speaker neural text-to-speech

We present a neural text-to-speech system for fine-grained prosody trans...
research
07/01/2022

Automatic Evaluation of Speaker Similarity

We introduce a new automatic evaluation method for speaker similarity as...
research
06/29/2022

Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

Generating expressive and contextually appropriate prosody remains a cha...

Please sign up or login with your details

Forgot password? Click here to reset