MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

01/17/2022
by   Yi Lei, et al.
0

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.

READ FULL TEXT

page 1

page 10

research
03/14/2023

QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis

Recent expressive text to speech (TTS) models focus on synthesizing emot...
research
11/17/2020

Fine-grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis

This paper proposes a unified model to conduct emotion transfer, control...
research
11/17/2020

Controllable Emotion Transfer For End-to-End Speech Synthesis

Emotion embedding space learned from references is a straightforward app...
research
11/19/2022

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

This paper aims to synthesize target speaker's speech with desired speak...
research
10/19/2021

Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation

Learning emotion embedding from reference audio is a straightforward app...
research
08/20/2020

Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning

Despite the growing interest for expressive speech synthesis, synthesis ...
research
06/25/2022

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

Expressive speech synthesis, like audiobook synthesis, is still challeng...

Please sign up or login with your details

Forgot password? Click here to reset