ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

05/23/2023
by   Minki Kang, et al.
0

Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https://ZET-Speech.github.io/ZET-Speech-Demo/.

READ FULL TEXT
research
06/29/2022

iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis based on Disentanglement between Prosody and Timbre

The capability of generating speech with specific type of emotion is des...
research
11/07/2021

Emotional Prosody Control for Speech Generation

Machine-generated speech is characterized by its limited or unnatural em...
research
12/20/2022

Emotion Selectable End-to-End Text-based Speech Editing

Text-based speech editing allows users to edit speech by intuitively cut...
research
07/05/2019

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech – a Deep Learning approach

In this project, we aim to build a Text-to-Speech system able to produce...
research
02/24/2023

Pre-Finetuning for Few-Shot Emotional Speech Recognition

Speech models have long been known to overfit individual speakers for ma...
research
06/15/2022

Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning

Emotion classification of speech and assessment of the emotion strength ...
research
06/07/2022

FlexLip: A Controllable Text-to-Lip System

The task of converting text input into video content is becoming an impo...

Please sign up or login with your details

Forgot password? Click here to reset