Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

03/23/2018
by   Yuxuan Wang, et al.
0

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

READ FULL TEXT

page 5

page 6

page 7

page 8

page 9

research
08/04/2018

Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

Global Style Tokens (GSTs) are a recently-proposed method to learn laten...
research
08/19/2021

Controlled GAN-Based Creature Synthesis via a Challenging Game Art Dataset – Addressing the Noise-Latent Trade-Off

The state-of-the-art StyleGAN2 network supports powerful methods to crea...
research
08/30/2023

The DeepZen Speech Synthesis System for Blizzard Challenge 2023

This paper describes the DeepZen text to speech (TTS) system for Blizzar...
research
08/13/2020

Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion

The increased adoption of digital assistants makes text-to-speech (TTS) ...
research
06/26/2019

End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training

This paper proposes an end-to-end emotional speech synthesis (ESS) metho...
research
07/30/2020

Speaking Speed Control of End-to-End Speech Synthesis using Sentence-Level Conditioning

This paper proposes a controllable end-to-end text-to-speech (TTS) syste...
research
11/06/2018

Robust and fine-grained prosody control of end-to-end speech synthesis

We propose prosody embeddings for emotional and expressive speech synthe...

Please sign up or login with your details

Forgot password? Click here to reset