Learning latent representations for style control and transfer in end-to-end speech synthesis

12/11/2018
by   Ya-Jie Zhang, et al.
0

In this paper, we introduce the Variational Autoencoder (VAE) to an end-to-end speech synthesis model, to learn the latent representation of speaking styles in an unsupervised manner. The style representation learned through VAE shows good properties such as disentangling, scaling, and combination, which makes it easy for style control. Style transfer can be achieved in this framework by first inferring style representation through the recognition network of VAE, then feeding it into TTS network to guide the style in synthesizing speech. To avoid Kullback-Leibler (KL) divergence collapse in training, several techniques are adopted. Finally, the proposed model shows good performance of style control and outperforms Global Style Token (GST) model in ABX preference tests on style transfer.

READ FULL TEXT

page 3

page 4

research
10/13/2021

Multiple Style Transfer via Variational AutoEncoder

Modern works on style transfer focus on transferring style from a single...
research
06/07/2023

Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

With the demand for autonomous control and personalized speech generatio...
research
06/24/2022

End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue

The recent text-to-speech (TTS) has achieved quality comparable to that ...
research
04/04/2019

Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis

Speech style control and transfer techniques aim to enrich the diversity...
research
12/06/2021

VAE based Text Style Transfer with Pivot Words Enhancement Learning

Text Style Transfer (TST) aims to alter the underlying style of the sour...
research
06/08/2019

Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Recent work has explored sequence-to-sequence latent variable models for...
research
12/03/2022

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating...

Please sign up or login with your details

Forgot password? Click here to reset