Uncovering Latent Style Factors for Expressive Speech Synthesis

11/01/2017
by   Yuxuan Wang, et al.
0

Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We show that without annotation data or an explicit supervision signal, our approach can automatically learn a variety of prosodic variations in a purely data-driven way. Importantly, each style token corresponds to a fixed style factor regardless of the given text sequence. As a result, we can control the prosodic style of synthetic speech in a somewhat predictable and globally consistent way.

READ FULL TEXT
research
08/04/2018

Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

Global Style Tokens (GSTs) are a recently-proposed method to learn laten...
research
11/19/2021

Word-Level Style Control for Expressive, Non-attentive Speech Synthesis

This paper presents an expressive speech synthesis architecture for mode...
research
10/29/2018

Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention

Currently, there are increasing interests in text-to-speech (TTS) synthe...
research
11/05/2021

Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers

We present a new perspective of achieving image synthesis by viewing thi...
research
08/04/2021

Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

Expressive neural text-to-speech (TTS) systems incorporate a style encod...
research
06/26/2019

End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training

This paper proposes an end-to-end emotional speech synthesis (ESS) metho...
research
04/04/2019

Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis

Speech style control and transfer techniques aim to enrich the diversity...

Please sign up or login with your details

Forgot password? Click here to reset