Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

12/13/2018
by   Yan Deng, et al.
0

Neural TTS has shown it can generate high quality synthesized speech. In this paper, we investigate the multi-speaker latent space to improve neural TTS for adapting the system to new speakers with only several minutes of speech or enhancing a premium voice by utilizing the data from other speakers for richer contextual coverage and better generalization. A multi-speaker neural TTS model is built with the embedded speaker information in both spectral and speaker latent space. The experimental results show that, with less than 5 minutes of training data from a new speaker, the new model can achieve an MOS score of 4.16 in naturalness and 4.64 in speaker similarity close to human recordings (4.74). For a well-trained premium voice, we can achieve an MOS score of 4.5 for out-of-domain texts, which is comparable to an MOS of 4.58 for professional recordings, and significantly outperforms single speaker result of 4.28.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2021

Adapting TTS models For New Speakers using Transfer Learning

Training neural text-to-speech (TTS) models for a new speaker typically ...
research
03/29/2022

VoiceMe: Personalized voice generation in TTS

Novel text-to-speech systems can generate entirely new voices that were ...
research
06/25/2021

Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

Generally speaking, the main objective when training a neural speech syn...
research
08/16/2021

GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints

Few-shot speaker adaptation is a specific Text-to-Speech (TTS) system th...
research
09/27/2018

Sample Efficient Adaptive Text-to-Speech

We present a meta-learning approach for adaptive text-to-speech (TTS) wi...
research
02/24/2022

Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

VoiceFilter-Lite is a speaker-conditioned voice separation model that pl...
research
10/01/2019

Latent space representation for multi-target speaker detection and identification with a sparse dataset using Triplet neural networks

We present an approach to tackle the speaker recognition problem using T...

Please sign up or login with your details

Forgot password? Click here to reset