Towards Robust FastSpeech 2 by Modelling Residual Multimodality

06/02/2023
by   Fabian Kögel, et al.
0

State-of-the-art non-autoregressive text-to-speech (TTS) models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech. For expressive speech datasets however, we observe characteristic audio distortions. We demonstrate that such artefacts are introduced to the vocoder reconstruction by over-smooth mel-spectrogram predictions, which are induced by the choice of mean-squared-error (MSE) loss for training the mel-spectrogram decoder. With MSE loss FastSpeech 2 is limited to learn conditional averages of the training distribution, which might not lie close to a natural sample if the distribution still appears multimodal after all conditioning signals. To alleviate this problem, we introduce TVC-GMM, a mixture model of Trivariate-Chain Gaussian distributions, to model the residual multimodality. TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets as shown by both objective and subjective evaluation.

READ FULL TEXT
research
09/27/2021

FlowVocoder: A small Footprint Neural Vocoder based Normalizing flow for Speech Synthesis

Recently, non-autoregressive neural vocoders have provided remarkable pe...
research
06/01/2018

Performance Based Cost Functions for End-to-End Speech Separation

Recent neural network strategies for source separation attempt to model ...
research
04/23/2021

Improving Neural Silent Speech Interface Models by Adversarial Training

Besides the well-known classification task, these days neural networks a...
research
10/24/2022

High Fidelity Neural Audio Compression

We introduce a state-of-the-art real-time, high-fidelity, audio codec le...
research
12/08/2019

A Supervised Speech enhancement Approach with Residual Noise Control for Voice Communication

For voice communication, it is important to extract the speech from its ...
research
02/26/2022

Revisiting Over-Smoothness in Text to Speech

Non-autoregressive text to speech (NAR-TTS) models have attracted much a...
research
08/02/2018

Dirichlet Mixture Model based VQ Performance Prediction for Line Spectral Frequency

In this paper, we continue our previous work on the Dirichlet mixture mo...

Please sign up or login with your details

Forgot password? Click here to reset