DeepAI AI Chat
Log In Sign Up

Cloning one's voice using very limited data in the wild

10/07/2021
by   Dongyang Dai, et al.
0

With the increasing popularity of speech synthesis products, the industry has put forward more requirements for personalized speech synthesis: (1) How to use low-resource, easily accessible data to clone a person's voice. (2) How to clone a person's voice while controlling the style and prosody. To solve the above two problems, we proposed the Hieratron model framework in which the prosody and timbre are modeled separately using two modules, therefore, the independent control of timbre and the other characteristics of audio can be achieved while generating speech. The practice shows that, for very limited target speaker data in the wild, Hieratron has obvious advantages over the traditional method, in addition to controlling the style and language of the generated speech, the mean opinion score on speech quality of the generated speech has also been improved by more than 0.2 points.

READ FULL TEXT

page 1

page 2

page 3

page 4

05/26/2020

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

With the popularity of deep neural network, speech synthesis task has ac...
12/03/2022

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating...
10/29/2019

Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System

In this study, we define the identity of the singer with two independent...
07/20/2017

VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop

We present a new neural text to speech (TTS) method that is able to tran...
02/19/2018

Voice Impersonation using Generative Adversarial Networks

Voice impersonation is not the same as voice transformation, although th...
09/01/2022

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

In this work, we address the problem of generating speech from silent li...
12/14/2020

Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis

The style of the speech varies from person to person and every person ex...