Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

by   Kun Zhou, et al.

Emotional voice conversion (EVC) aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. In this paper, we propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. We note that the proposed EVC framework leverages text-to-speech (TTS) as they share a common goal that is to generate high-quality expressive voice. In stage 1, we perform style initialization with a multi-speaker TTS corpus, to disentangle speaking style and linguistic content. In stage 2, we perform emotion training with a limited amount of emotional speech data, to learn how to disentangle emotional style and linguistic information from the speech. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.


Identity Conversion for Emotional Speakers: A Study for Disentanglement of Emotion Style and Speaker Identity

Expressive voice conversion performs identity conversion for emotional s...

An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation

Emotional Voice Conversion (EVC) aims to convert the emotional style of ...

Emotional Voice Conversion using multitask learning with Text-to-speech

Voice conversion (VC) is a task to transform a person's voice to differe...

An Overview Analysis of Sequence-to-Sequence Emotional Voice Conversion

Emotional voice conversion (EVC) focuses on converting a speech utteranc...

Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

We propose a Text-to-Speech method to create an unseen expressive style ...

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

Emotional voice conversion is to convert the spectrum and prosody to cha...

StarGAN-based Emotional Voice Conversion for Japanese Phrases

This paper shows that StarGAN-VC, a spectral envelope transformation met...