DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech – A Study between English and Mandarin

09/02/2023
by   Tao Li, et al.
0

While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotion-discriminative embedding.

READ FULL TEXT
research
11/07/2022

ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech

Speech representation learning has improved both speech understanding an...
research
10/14/2021

Revisiting IPA-based Cross-lingual Text-to-speech

International Phonetic Alphabet (IPA) has been widely used in cross-ling...
research
05/12/2020

AdaDurIAN: Few-shot Adaptation for Neural Text-to-Speech with DurIAN

This paper investigates how to leverage a DurIAN-based average model to ...
research
06/29/2023

Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data

We propose a method for speech-to-speech emotionpreserving translation t...
research
12/21/2020

Unsupervised Cross-Lingual Speech Emotion Recognition Using DomainAdversarial Neural Network

By using deep learning approaches, Speech Emotion Recog-nition (SER) on ...
research
04/10/2018

An Estimation of Favorite Value in Emotion Generating Calculation by Fuzzy Petri Net

Emotion Generating Calculations (EGC) method based on the Emotion Elicit...
research
05/31/2019

Crowdsourcing and Validating Event-focused Emotion Corpora for German and English

Sentiment analysis has a range of corpora available across multiple lang...

Please sign up or login with your details

Forgot password? Click here to reset