WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

02/02/2020
by   Rui Liu, et al.
2

Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input. Such frameworks typically consist of a feature prediction network that maps character sequences to frequency-domain acoustic features, followed by a waveform reconstruction algorithm or a neural vocoder that generates the time-domain waveform from acoustic features. As the loss function is usually calculated only for frequency-domain acoustic features, that doesn't directly control the quality of the generated time-domain waveform. To address this problem, we propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the waveform loss, that measures the distortion between the natural and generated waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic feature loss between the natural and generated acoustic features. WaveTTS ensures both the quality of the acoustic features and the resulting speech waveform. To our best knowledge, this is the first implementation of Tacotron with joint time-frequency domain loss. Experimental results show that the proposed framework outperforms the baselines and achieves high-quality synthesized speech.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/25/2018

WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks

We propose a learning-based filter that allows us to directly modify a s...
research
03/30/2022

Combination of Time-domain, Frequency-domain, and Cepstral-domain Acoustic Features for Speech Commands Classification

In speech-related classification tasks, frequency-domain acoustic featur...
research
03/26/2022

A Neural Vocoder Based Packet Loss Concealment Algorithm

The packet loss problem seriously affects the quality of service in Voic...
research
06/22/2022

Radio2Speech: High Quality Speech Recovery from Radio Frequency Signals

Considering the microphone is easily affected by noise and soundproof ma...
research
04/02/2022

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

The mainstream neural text-to-speech(TTS) pipeline is a cascade system, ...
research
08/05/2023

A Systematic Exploration of Joint-training for Singing Voice Synthesis

There has been a growing interest in using end-to-end acoustic models fo...
research
10/16/2021

Towards Robust Waveform-Based Acoustic Models

We propose an approach for learning robust acoustic models in adverse en...

Please sign up or login with your details

Forgot password? Click here to reset