Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

11/28/2019
by   Vatsal Aggarwal, et al.
0

We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22 while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9 standard Neural Text-to-Speech, while also improving the perceived emotional intensity (59 compared to the 55 of neutral speech).

READ FULL TEXT

page 3

page 4

research
03/31/2021

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

Emotional voice conversion (EVC) aims to change the emotional state of a...
research
11/02/2022

Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis

A large part of the expressive speech synthesis literature focuses on le...
research
08/04/2020

Expressive TTS Training with Frame and Style Reconstruction Loss

We propose a novel training strategy for Tacotron-based text-to-speech (...
research
04/07/2022

Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder

This paper proposes a controllable singing voice synthesis system capabl...
research
07/20/2023

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

Expressive speech synthesis models are trained by adding corpora with di...
research
10/14/2019

The Theory behind Controllable Expressive Speech Synthesis: a Cross-disciplinary Approach

As part of the Human-Computer Interaction field, Expressive speech synth...
research
06/16/2021

Improving the expressiveness of neural vocoding with non-affine Normalizing Flows

This paper proposes a general enhancement to the Normalizing Flows (NF) ...

Please sign up or login with your details

Forgot password? Click here to reset