GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

06/29/2021
by   Jinhyeok Yang, et al.
0

Recent advances in neural multi-speaker text-to-speech (TTS) models have enabled the generation of reasonably good speech quality with a single model and made it possible to synthesize the speech of a speaker with limited training data. Fine-tuning to the target speaker data with the multi-speaker model can achieve better quality, however, there still exists a gap compared to the real speech sample and the model depends on the speaker. In this work, we propose GANSpeech, which is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model. In addition, we propose simple but efficient automatic scaling methods for feature matching loss used in adversarial training. In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models, and showed a better MOS score than the speaker-specific fine-tuned FastSpeech2.

READ FULL TEXT
research
11/08/2018

Speaker-adaptive neural vocoders for statistical parametric speech synthesis systems

This paper proposes speaker-adaptive neural vocoders for statistical par...
research
11/01/2022

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

Fine-tuning is a popular method for adapting text-to-speech (TTS) models...
research
02/07/2023

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that...
research
05/18/2023

FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs

This paper presents FastFit, a novel neural vocoder architecture that re...
research
08/04/2017

Improving Speaker-Independent Lipreading with Domain-Adversarial Training

We present a Lipreading system, i.e. a speech recognition system using o...
research
10/24/2022

Weak-Supervised Dysarthria-invariant Features for Spoken Language Understanding using an FHVAE and Adversarial Training

The scarcity of training data and the large speaker variation in dysarth...
research
06/24/2021

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Whilst recent neural text-to-speech (TTS) approaches produce high-qualit...

Please sign up or login with your details

Forgot password? Click here to reset