Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

11/19/2022
by   Xinfa Zhu, et al.
0

This paper aims to synthesize target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. Specifically, we address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labelled data, style-labelled data, and unlabeled data. To better transfer the fine-grained expressiveness from references to the target speaker in the non-parallel transfer, we introduce a reference-candidate pool and propose an attention based reference selection approach. Extensive experiments demonstrate the good design of our model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2022

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Expressive text-to-speech has shown improved performance in recent years...
research
06/29/2022

iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis based on Disentanglement between Prosody and Timbre

The capability of generating speech with specific type of emotion is des...
research
01/17/2022

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Expressive synthetic speech is essential for many human-computer interac...
research
03/14/2023

QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis

Recent expressive text to speech (TTS) models focus on synthesizing emot...
research
10/08/2021

Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech

In expressive speech synthesis, there are high requirements for emotion ...
research
08/28/2023

Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks

Human emotion understanding is pivotal in making conversational technolo...
research
12/14/2020

Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis

The style of the speech varies from person to person and every person ex...

Please sign up or login with your details

Forgot password? Click here to reset