Robust and fine-grained prosody control of end-to-end speech synthesis

11/06/2018
by   Younggun Lee, et al.
0

We propose prosody embeddings for emotional and expressive speech synthesis networks. The proposed methods introduce temporal structures in the embedding networks, which enable fine-grained control of the speaking style of the synthesized speech. The temporal structures could be designed either in speech-side or text-side, which lead different control resolution in time. The prosody embedding networks are plugged into end-to-end speech synthesis networks, and trained without any other supervision except the target speech for synthesizing. The prosody embedding networks learned to extract prosodic features. By adjusting the learned prosody features, we could change the pitch and amplitude of the synthesized speech both in frame level and phoneme level. We also introduce temporal normalization of prosody embeddings, which shows better robustness against speaker perturbation in prosody transfer tasks.

READ FULL TEXT

page 3

page 4

research
03/14/2023

QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis

Recent expressive text to speech (TTS) models focus on synthesizing emot...
research
01/30/2021

Expressive Neural Voice Cloning

Voice cloning is the task of learning to synthesize the voice of an unse...
research
10/21/2020

Grapheme or phoneme? An Analysis of Tacotron's Embedded Representations

End-to-end models, particularly Tacotron-based ones, are currently a pop...
research
07/04/2019

Fine-grained robust prosody transfer for single-speaker neural text-to-speech

We present a neural text-to-speech system for fine-grained prosody trans...
research
03/23/2018

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

In this work, we propose "global style tokens" (GSTs), a bank of embeddi...
research
10/12/2021

Fine-grained style control in Transformer-based Text-to-speech Synthesis

In this paper, we present a novel architecture to realize fine-grained s...
research
06/08/2019

Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Recent work has explored sequence-to-sequence latent variable models for...

Please sign up or login with your details

Forgot password? Click here to reset