In recent years, Text-to-Speech (TTS) technology has advanced by leaps and bounds, from concatenative approach [hunt1996unit, ze2013statistical, merritt2016deep] to statistical parametric approach [zen2009statistical, tokuda2013speech, zen2015unidirectional, wu2016investigating]
, to deep learning[wang2017tacotron]. They produce high quality and natural speech that rival human vocal production [taylor2009text, zen2009statistical]. TTS is also widely used in human-machine communications, such as robotics, call centres, games, entertainments, and healthcare applications.
With the advent of deep learning, neural approaches to TTS become mainstream, such as Tacotron [wang2017tacotron], Tacotron2 [shen2018natural] and its varieties [skerry2018towards, hsu2018hierarchical, habib2019semi, liu2017mongolian, liu2019teacher]. The end-to-end TTS model is based on encoder-decoder framework, which has been widely adopted for sequence generation tasks, such as speech recognition [graves2014towards, bahdanau2016end, amodei2016deep, chen2019end]bahdanau2014neural, johnson2017google]. Tacotron-based TTS typically consists of two modules: 1) feature prediction, and 2) waveform generation. The main task of feature prediction network is to obtain frequency-domain acoustic features, while the waveform generation module is to convert frequency-domain acoustic features into time-domain waveform.
A typical Tacotron implementation adopts Griffin-Lim algorithm [griffin1984signal, masuyama2019deep] for phase reconstruction, that only uses a loss function derived from amplitude spectrogram in frequency domain. Such a loss function doesn’t take the resulting waveform into consideration in the optimization process. As a result, there exists a mismatch between the Tacotron optimization and the expected waveform. We note that such mismatch also exists in many other speech processing tasks, such as speech separation [wang2018supervised], where we observe that, by incorporating time-domain loss function [wang2015deep], one can improve the output speech quality. More recently, deep learning approach to speech enhancement methods with time-domain raw waveform outputs [fu2017raw, liu2019multichannel] have also been investigated. However, we note that time-domain loss function has not been well explored in speech synthesis, which will be the focus of this paper.
Tacotron2 [shen2018natural] has been proposed to achieve high quality synthesized voice. It addresses the waveform optimization problem by using WaveNet-based neural vocoder [oord2016wavenet, berrak-journal, sisman2018adaptive, berrak_is18, hayashi2017investigation]. We note that WaveNet avoids the artifacts and deterioration caused by deterministic vocoders. It generates time-domain waveform samples conditioned on the predicted mel-spectrum features. Although Tacotron2 allows end-to-end learning of TTS directly from character sequences and speech waveforms, its feature prediction network is trained independently of the WaveNet vocoder. At run-time, the feature prediction network and WaveNet vocoder are artificially joined together. As a result, the framework suffers from the mismatch between frequency-domain acoustic features and time-domain waveform. It is reported that the samples generated from WaveNet occasionally become unstable, especially when less accurately predicted acoustic features are used as the local condition parameters. To overcome such mismatch, we propose to use joint time-frequency domain loss for TTS that effectively improves the synthesized voice quality.
In this paper, we propose to add a time-domain loss function to the Griffin-Lim/ISTFT output of Tacotron-based TTS model at the training time. In other words, we use both frequency-domain loss and time-domain loss for the training of feature prediction model. We hypothesize that the feature prediction network will compensate the possible artifacts that Griffin-Lim process may introduce under the supervision of the time-domain loss. We use Griffin-Lim iteration followed by ISTFT to transform frequency-domain feature to time-domain waveform and use scale-invariant signal-to-distortion (SI-SDR) [le2019sdr, kolbaek2019loss] to measure the quality of the time-domain waveform. Our proposed idea shares a similar motivation with [zhao2018wasserstein] in terms of the use of waveform loss. However, it differs from [zhao2018wasserstein] in many ways, for example, we study Tacotron-based TTS, while [zhao2018wasserstein] mostly deals with Wasserstein GAN-based TTS.
The main contributions of this paper include: 1) we study the use of time-domain loss for speech synthesis; 2) we improve Tacotron-based TTS framework by proposing a new training scheme based on joint time-frequency domain loss; and 3) we propose to use SI-SDR metric to measure the distortion of time-domain waveform. The novel training scheme optimizes the frequency-domain acoustic features in a way that it leads to better time-domain waveform. To our best knowledge, this is the first implementation of joint training scheme on frequency and time domain for Tacotron-based TTS framework.
This paper is organized as follows: In Section 2, we present the Tacotron-based baseline TTS system. In Section 3, we present the novel idea of joint time-frequency domain loss, and formulate the training and run-time processes. We report the experimental results in Section 4. Section 5 concludes the study.
2 Baseline: Tacotron-based TTS
In this paper, we use a Tacotron-based framework [shen2018natural] as a reference baseline. We illustrate the overall architecture of the reference baseline in Figure 1, that includes feature prediction model which contains encoder, attention-based decoder and Griffin-Lim algorithm for waveform reconstruction. The encoder (blue box in Figure 1) consists of two components, a CNN-based module that has 3 convolution layers, and a LSTM-based module that has a bidirectional LSTM layer. The decoder (pale yellow box in Figure 1
) consists of four components: a 2-layer pre-net, 2 LSTM layers, a linear projection layer and a 5-convolution-layer post-net. The decoder is a standard autoregressive recurrent neural network that generates the mel-spectrum features and stop tokens frame by frame.
During training, we optimize the feature prediction model to minimized the frequency-domain loss between the generated mel-spectrum features () and the target mel-spectrum features ().
In this section, we study the use of a newly proposed time-domain loss function for Tacotron-based TTS. By applying a new training strategy that takes into account both time and frequency domain loss functions, we effectively reduce the mismatch between the frequency-domain features and the time-domain waveform, and improve the output speech quality. In addition, Griffin-Lim algorithm and SI-SDR metric are utilized to realize the calculation process of proposed loss term. The proposed framework is called as WaveTTS hereafter.
3.1 Time-domain and Frequency-domain Loss Functions
In WaveTTS, we define two objective functions during training: 1) frequency-domain loss, denoted as , that is calculated with the mel-spectrum features in a similar way described in [shen2018natural]; and 2) the proposed time-domain loss, denoted as
, that is obtained at waveform level at the output of Griffin-Lim iteration, that estimates time-domain signal from the mel-spectrum features. The two objective functions are illustrated in Figure2.
The entire process can be formulated as follows. The encoder takes the character sequence
as input and converts the one-hot vector to continuous features representation:
The decoder outputs a mel-spectrum feature at each step :
where represents a function to calculate the context vector by using location-sensitive attention mechanism [vaswani2017attention].
We first calculate in a similar way as that in [shen2018natural]. ensures that the generated mel-spectrum is close the natural mel-spectrum. is given as follows,
where is the total number of the sequences in training data. loss function is used to minimize the error which is the sum of the all the squared differences between the true value and the predicted value.
We then propose the use of a time-domain loss , that is applied to the Griffin-Lim output. This is to reduce the mismatch between the optimized frequency-domain output and the actual time-domain waveform [zhao2018wasserstein]. The implementation details of will be explained in Section 3.2.
Overall, the proposed WaveTTS framework has two loss functions. minimizes the loss between converted and original mel-spectrum, while minimizes this loss at waveform level. We add a weighting coefficient to balance the two losses. The training criterion of the whole model is defined as:
Algorithm 1 also shows the complete training process of our proposed WaveTTS. WaveTTS model predicts the mel-spectrum features from the given input character sequence , and then converts the estimated and target mel-spectrum to the time-domain signal , using Griffin-Lim based ISTFT algorithm (blue content in Algorithm 1). Finally, the joint loss function given in Equation 4 is used to optimize the WaveTTS model.
3.2 Implementation of time-domain Loss
3.2.1 Time-domain loss
We adopt Griffin-Lim algorithm [griffin1984signal], followed by ISTFT to generate the time-domain waveform. Griffin-Lim algorithm has been widely used in speech synthesis [shen2018natural, tjandra2019vqvae] for its simplicity, that can be formulated as follows:
where represents the predicted mel-spectrum sequences, and represents their amplitude. is the function that calculates the amplitude of the given input mel-spectrum sequences, which is followed by the Griffin-Lim algorithm, that estimates a complex valued spectrum, while minimizing the change to the input amplitude . ISTFT transforms the estimated complex valued spectrum to time-domain signals. The details of the Griffin-Lim algorithm are given in Algorithm 2. is the metric projection onto a set . Here, is the set of consistent spectrums, and is the set of spectrums whose amplitude is the same as the given one.
It’s worth mentioning that the Griffin-Lim algorithm usually requires many iterations, as shown in Algorithm 2, at run-time to obtain a high-quality audio signal. It is an optimization process independent of Tacotron training. We would like the Tacotron feature prediction network to generate acoustic features, that not only are close to those of natural speech in frequency-domain, but also allow Griffin-Lim to produce speech that is close to natural speech in time-domain.
Let’s denote the predicted and the original mel-spectrum as (). We apply Griffin-Lim and ISTFT to generate (), with and . We keep the same number of Griffin-Lim iterations to ensure that Griffin-Lim behaves the same between the two transformation pairs, and . We measure the distortion between and with a time-domain loss, that forces the speech waveform generated from the predicted network to be as close as possible to that generated from the mel-spectrum of natural speech.
3.2.2 Scale-Invariant Signal-to-Distortion (SI-SDR)
In speech synthesis, we optimize the feature prediction network to minimize the discrepancy between the synthesized waveform and the target natural speech that is supervised by a loss function. We propose a time-domain loss function, , that is based on scale-invariant signal-to-distortion (SI-SDR). SI-SDR has been introduced as a time-domain objective measure in source separation [luo2018tasnet, bahmaninezhad2019comprehensive, venkataramani2018performance] to compare two time-domain speech signals. We adopt SI-SDR to measure the discrepancy between the generated waveform and the target natural speech. To our best knowledge, this is the first implementation of SI-SDR for time-domain loss calculation to improve TTS quality.
We note that SI-SDR is evaluated only during training, and not required at run-time inference. During training, the predicted time-domain waveform and the target speech have identical duration. Similarly, the predicted mel-spectrum and target mel-spectrum also share the same frame length, that facilitates the SI-SDR calculation. As a greater SI-SDR indicates better quality, to turn it into a loss function, we take the negative value of SI-SDR as the loss function,
SI-SDR is expressed in decibel (dB) and defined in the range of , so is .
|Training Phase||Run-time Inference|
|Loss Function||Waveform Generation||Waveform Generation|
|Tacotron-WN||NA||WaveNet vocoder [hayashi2017investigation]|
|Proposed||WaveTTS-GL||Griffin-Lim [griffin1984signal]||Griffin-Lim [griffin1984signal]|
|WaveTTS-WN||Griffin-Lim [griffin1984signal]||WaveNet vocoder [hayashi2017investigation]|
We report the TTS experiments on LJSpeech database 111https://keithito.com/LJ-Speech-Dataset/, which consists of 13,100 short clips with a total of nearly 24 hours of speech from one single speaker reading about 7 non-fiction books. We develop four systems for a comparative study:
Tacotron-GL: Tacotron-based baseline model [shen2018natural], that has only frequency-domain loss function. Griffin-Lim algorithm is used to generate the waveform at run-time.
Tacotron-WN: Tacotron-based baseline model [shen2018natural], that has only frequency-domain loss function. Pre-trained WaveNet vocoder is used to generate the waveform at run-time.
WaveTTS-GL: proposed WaveTTS model is trained with joint time-frequency domain loss. Griffin-Lim algorithm is used during training and run-time phases.
WaveTTS-WN: proposed WaveTTS model is trained with joint time-frequency domain loss. Griffin-Lim algorithm is used during training and the pre-trained WaveNet vocoder is used to synthesize speech at run-time.
We also compare these systems with the ground truth speech, denoted as GT. The comparison of the systems is also summarized in Table 1.
4.1 Experimental Setup
The 80-channel mel-spectrum is extracted with 12.5ms frame shift and 50ms frame length. It is normalized to zero-mean and unit-variance as the reference target. The decoder predicts only one non-overlapping output frame at each decoding step. We use the Adam optimizer with= 0.9, = 0.999 and a learning rate of exponentially decaying to starting with 50k iterations. We also apply regularization with weight . Hyper-parameter in Equation 4 is empirically set as . All models are trained with a batch size of 32. The final models are trained with 100k steps for all systems. At run-time, Tacotron-GL and WaveTTS-GL use Griffin-Lim algorithm with 64 iterations, while Tacotron-WN and WaveTTS-WN use pretrained WaveNet vocoder.
4.2 Subjective Evaluation
We conduct listening experiments to evaluate the quality of the synthesized speech. We first evaluate the sound quality of the synthesized speech in terms of mean opinion score (MOS) among GT, Tacotron-GL, Tacotron-WN and the proposed WaveTTS-GL and WaveTTS-WN, that is reported in Figure 3.
The listeners rate of the quality is on a 5-point scale: “5” for excellent, “4” for good, “3” for fair, “2” for poor, and “1” for bad. The MOS values reported in Figure 3 are calculated by taking the arithmetic average of all scores assigned the subjects who have passed the validation question test. We keep the linguistic content the same among different models so as to exclude other interference factors. 15 subjects participate in these experiments, and each one of them listens to 120 synthesized speech samples. We have three observations through the experiments:
The importance of joint time-frequency domain loss: We compare Tacotron-GL and WaveTTS-GL to observe the effect of joint time-frequency domain loss. We believe that this is a fair comparison as both frameworks use Griffin-Lim algorithm for waveform generation during training and/or run-time. As can be seen in Figure 3, WaveTTS-GL outperforms Tacotron-GL by achieving 3.30 MOS value, while Tacotron-GL achieves only 3.18.
The performance of WaveTTS with a neural vocoder at run-time: We compare Tacotron-WN and WaveTTS-WN to investigate how well the predicted mel-spectrum features perform with WaveNet vocoder. We observe that even though WaveTTS is trained with Griffin-Lim algorithm, it performs better than Tacotron when WaveNet vocoder is available at run-time. This shows how well our proposed WaveTTS performs with other neural vocoders.
Griffin-Lim vs WaveNet vocoder at run-time: We compare WaveTTS-GL and WaveTTS-WN in terms of voice quality. We note that both frameworks are trained under the same conditions. However, WaveTTS-WN uses WaveNet vocoder for waveform generation at run-time. As expected, WaveTTS-WN outperforms WaveTTS-GL.
We also conduct A/B preference tests to assess speech quality of proposed frameworks. In A/B preference tests, the listeners are asked to compare the quality and naturalness of the synthesized speech samples from different systems, and select the better one. 15 listeners were invited to participate in all the tests. 80 samples were randomly selected from 200 converted samples from each system. Figure 4 shows the speech quality test results, which suggests that our proposed WaveTTS framework outperforms the baseline system for both Griffin-Lim and WaveNet vocoder settings at run-time.
We further conduct another A/B preference test to examine the effect of the number of Griffin-Lim iterations on the WaveTTS performance. To calculate the time-domain loss, WaveTTS needs to generate the synthesized waveform during training. For rapid turn-around, we only apply 1 and 2 Griffin-Lim iterations for phase reconstruction, and investigate the effect in terms of voice quality. Figure 5 shows A/B preference test results on both WaveTTS-GL and WaveTTS-WN. We observe that the single iteration of Griffin-Lim algorithm presents a better performance than 2 iterations.
In this paper, we propose a new Tacotron implementation, called WaveTTS. The traditional TTS frameworks calculates only frequency-domain loss to update the network parameters, that doesn’t directly control the quality of the generated time-domain waveform. The proposed WaveTTS is unique in a sense that it calculates both time-domain and frequency-domain loss, and optimizes the model for generating high-quality synthesized voice. We propose to use scale-invariant signal-to-distortion (SI-SDR) as the loss function. Even though the proposed model is trained with Griffin-Lim algorithm for time-domain loss calculation, it performs remarkable well with both Griffin-Lim and WaveNet vocoder at run-time. Experimental results show that the proposed framework outperforms the baselines and achieves high-quality synthesized speech. To our best knowledge, this is the first implementation of Tacotron-based TTS model with joint time-frequency domain loss.
As a future work, we will investigate the training phase of joint time-frequency domain loss with a neural vocoder for high-quality TTS.
This research was supports by the National Natural Science Foundation of China (No.61563040, No.61773224), Natural Science Foundation of Inner Mongolian (No.2018MS06006, No.2016ZD06).