Emotional Voice Conversion using multitask learning with Text-to-speech

11/11/2019
by   Tae-Ho Kim, et al.
0

Voice conversion (VC) is a task to transform a person's voice to different style while conserving linguistic contents. Previous state-of-the-art on VC is based on sequence-to-sequence (seq2seq) model, which could mislead linguistic information. There was an attempt to overcome it by using textual supervision, it requires explicit alignment which loses the benefit of using seq2seq model. In this paper, a voice converter using multitask learning with text-to-speech (TTS) is presented. The embedding space of seq2seq-based TTS has abundant information on the text. The role of the decoder of TTS is to convert embedding space to speech, which is same to VC. In the proposed model, the whole network is trained to minimize loss of VC and TTS. VC is expected to capture more linguistic information and to preserve training stability by multitask learning. Experiments of VC were performed on a male Korean emotional text-speech dataset, and it is shown that multitask learning is helpful to keep linguistic contents in VC.

READ FULL TEXT
research
03/31/2021

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

Emotional voice conversion (EVC) aims to change the emotional state of a...
research
10/07/2021

Sequence-To-Sequence Voice Conversion using F0 and Time Conditioning and Adversarial Learning

This paper presents a sequence-to-sequence voice conversion (S2S-VC) alg...
research
10/28/2020

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

Emotional voice conversion aims to transform emotional prosody in speech...
research
03/29/2022

An Overview Analysis of Sequence-to-Sequence Emotional Voice Conversion

Emotional voice conversion (EVC) focuses on converting a speech utteranc...
research
04/22/2017

Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks

Automatically assessing emotional valence in human speech has historical...
research
02/21/2020

Modelling Latent Skills for Multitask Language Generation

We present a generative model for multitask conditional language generat...
research
03/02/2022

U-Singer: Multi-Singer Singing Voice Synthesizer that Controls Emotional Intensity

We propose U-Singer, the first multi-singer emotional singing voice synt...

Please sign up or login with your details

Forgot password? Click here to reset