GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints

08/16/2021
by   Ji-Hoon Kim, et al.
4

Few-shot speaker adaptation is a specific Text-to-Speech (TTS) system that aims to reproduce a novel speaker's voice with a few training data. While numerous attempts have been made to the few-shot speaker adaptation system, there is still a gap in terms of speaker similarity to the target speaker depending on the amount of data. To bridge the gap, we propose GC-TTS which achieves high-quality speaker adaptation with significantly improved speaker similarity. Specifically, we leverage two geometric constraints to learn discriminative speaker representations. Here, a TTS model is pre-trained for base speakers with a sufficient amount of data, and then fine-tuned for novel speakers on a few minutes of data with two geometric constraints. Two geometric constraints enable the model to extract discriminative speaker embeddings from limited data, which leads to the synthesis of intelligible speech. We discuss and verify the effectiveness of GC-TTS by comparing it with popular and essential methods. The experimental results demonstrate that GC-TTS generates high-quality speech from only a few minutes of training data, outperforming standard techniques in terms of speaker similarity to the target speaker.

READ FULL TEXT

page 2

page 5

research
11/08/2018

Speaker-adaptive neural vocoders for statistical parametric speech synthesis systems

This paper proposes speaker-adaptive neural vocoders for statistical par...
research
10/12/2021

Adapting TTS models For New Speakers using Transfer Learning

Training neural text-to-speech (TTS) models for a new speaker typically ...
research
11/07/2021

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

Personalizing a speech synthesis system is a highly desired application,...
research
10/28/2022

Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation

Adapting a neural text-to-speech (TTS) model to a target speaker typical...
research
10/08/2021

A study on the efficacy of model pre-training in developing neural text-to-speech system

In the development of neural text-to-speech systems, model pre-training ...
research
12/13/2018

Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

Neural TTS has shown it can generate high quality synthesized speech. In...
research
08/02/2021

Speaker Adaptation with Continuous Vocoder-based DNN-TTS

Traditional vocoder-based statistical parametric speech synthesis can be...

Please sign up or login with your details

Forgot password? Click here to reset