Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation

10/19/2021
by   Fengyu Yang, et al.
0

Learning emotion embedding from reference audio is a straightforward approach for multi-emotion speech synthesis in encoder-decoder systems. But how to get better emotion embedding and how to inject it into TTS acoustic model more effectively are still under investigation. In this paper, we propose an innovative constraint to help VAE extract emotion embedding with better cluster cohesion. Besides, the obtained emotion embedding is used as query to aggregate latent representations of all encoder layers via attention. Moreover, the queries from encoder layers themselves are also helpful. Experiments prove the proposed methods can enhance the encoding of comprehensive syntactic and semantic information and produce more expressive emotional speech.

READ FULL TEXT
research
11/17/2020

Controllable Emotion Transfer For End-to-End Speech Synthesis

Emotion embedding space learned from references is a straightforward app...
research
06/17/2021

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

Recently, there has been an increasing interest in neural speech synthes...
research
01/17/2022

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Expressive synthetic speech is essential for many human-computer interac...
research
08/03/2020

Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis

Attention-based seq2seq text-to-speech systems, especially those use sel...
research
10/07/2021

StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis

Recently, emotional speech synthesis has achieved remarkable performance...
research
06/29/2023

Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data

We propose a method for speech-to-speech emotionpreserving translation t...
research
07/25/2020

Multi-speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network

We propose a novel method for emotion conversion in speech based on a ch...

Please sign up or login with your details

Forgot password? Click here to reset