nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-shot Multi-speaker Text-to-Speech

02/22/2022
by   Botao Zhao, et al.
0

Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2022

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

Adaptive text to speech (TTS) can synthesize new voices in zero-shot sce...
research
12/03/2022

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating...
research
03/04/2020

A Robust Speaker Clustering Method Based on Discrete Tied Variational Autoencoder

Recently, the speaker clustering model based on aggregation hierarchy cl...
research
05/09/2023

Zero-shot personalized lip-to-speech synthesis with face image based voice control

Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speec...
research
03/29/2022

Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Training a text-to-speech (TTS) model requires a large scale text labele...
research
02/27/2023

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech

The goal of this work is zero-shot text-to-speech synthesis, with speaki...
research
10/22/2020

AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

In this paper, we present AISHELL-3, a large-scale and high-fidelity mul...

Please sign up or login with your details

Forgot password? Click here to reset