MASS: Multi-task Anthropomorphic Speech Synthesis Framework

05/10/2021
by   Jinyin Chen, et al.
0

Text-to-Speech (TTS) synthesis plays an important role in human-computer interaction. Currently, most TTS technologies focus on the naturalness of speech, namely,making the speeches sound like humans. However, the key tasks of the expression of emotion and the speaker identity are ignored, which limits the application scenarios of TTS synthesis technology. To make the synthesized speech more realistic and expand the application scenarios, we propose a multi-task anthropomorphic speech synthesis framework (MASS), which can synthesize speeches from text with specified emotion and speaker identity. The MASS framework consists of a base TTS module and two novel voice conversion modules: the emotional voice conversion module and the speaker voice conversion module. We propose deep emotion voice conversion model (DEVC) and deep speaker voice conversion model (DSVC) based on convolution residual networks. It solves the problem of feature loss during voice conversion. The model trainings are independent of parallel datasets, and are capable of many-to-many voice conversion. In the emotional voice conversion, speaker voice conversion experiments, as well as the multi-task speech synthesis experiments, experimental results show DEVC and DSVC convert speech effectively. The quantitative and qualitative evaluation results of multi-task speech synthesis experiments show MASS can effectively synthesis speech with specified text, emotion and speaker identity.

READ FULL TEXT

page 30

page 32

research
05/13/2020

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

Emotional voice conversion aims to convert the emotion of the speech fro...
research
11/25/2021

V2C: Visual Voice Cloning

Existing Voice Cloning (VC) tasks aim to convert a paragraph text to a s...
research
07/08/2021

Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer

Traditional voice conversion(VC) has been focused on speaker identity co...
research
09/20/2021

"Hello, It's Me": Deep Learning-based Speech Synthesis Attacks in the Real World

Advances in deep learning have introduced a new wave of voice synthesis ...
research
01/09/2021

Emotion transplantation through adaptation in HMM-based speech synthesis

This paper proposes an emotion transplantation method capable of modifyi...
research
11/13/2021

Direct Noisy Speech Modeling for Noisy-to-Noisy Voice Conversion

Beyond the conventional voice conversion (VC) where the speaker informat...
research
03/15/2022

Text-free non-parallel many-to-many voice conversion using normalising flows

Non-parallel voice conversion (VC) is typically achieved using lossy rep...

Please sign up or login with your details

Forgot password? Click here to reset