AILTTS: Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

04/05/2022
by   Hyungchan Yoon, et al.
0

The quality of end-to-end neural text-to-speech (TTS) systems highly depends on the reliable estimation of intermediate acoustic features from text inputs. To reduce the complexity of the speech generation process, several non-autoregressive TTS systems directly find a mapping relationship between text and waveforms. However, the generation quality of these system is unsatisfactory due to the difficulty in modeling the dynamic nature of prosodic information. In this paper, we propose an effective prosody predictor that successfully replicates the characteristics of prosodic features extracted from mel-spectrograms. Specifically, we introduce a generative model-based conditional discriminator to enable the estimated embeddings have highly informative prosodic features, which significantly enhances the expressiveness of generated speech. Since the estimated embeddings obtained by the proposed method are highly correlated with acoustic features, the time-alignment of input texts and intermediate features is greatly simplified, which results in faster convergence. Our proposed model outperforms several publicly available models based on various objective and subjective evaluation metrics, even using a relatively small number of parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2022

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

In neural text-to-speech (TTS), two-stage system or a cascade of separat...
research
12/12/2018

FPUAS : Fully Parallel UFANS-based End-to-End Acoustic System with 10x Speed Up

A lightweight end-to-end acoustic system is crucial in the deployment of...
research
11/21/2019

Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features

This paper presents a simple yet effective method to achieve prosody tra...
research
09/30/2021

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 ...
research
05/21/2019

Effective parameter estimation methods for an ExcitNet model in generative text-to-speech systems

In this paper, we propose a high-quality generative text-to-speech (TTS)...
research
07/07/2022

End-to-end Speech-to-Punctuated-Text Recognition

Conventional automatic speech recognition systems do not produce punctua...
research
04/03/2022

On incorporating social speaker characteristics in synthetic speech

In our previous work, we derived the acoustic features, that contribute ...

Please sign up or login with your details

Forgot password? Click here to reset