1 Introduction
In recent years, TexttoSpeech (TTS) technology has advanced by leaps and bounds, from concatenative approach [hunt1996unit, ze2013statistical, merritt2016deep] to statistical parametric approach [zen2009statistical, tokuda2013speech, zen2015unidirectional, wu2016investigating]
, to deep learning
[wang2017tacotron]. They produce high quality and natural speech that rival human vocal production [taylor2009text, zen2009statistical]. TTS is also widely used in humanmachine communications, such as robotics, call centres, games, entertainments, and healthcare applications.With the advent of deep learning, neural approaches to TTS become mainstream, such as Tacotron [wang2017tacotron], Tacotron2 [shen2018natural] and its varieties [skerry2018towards, hsu2018hierarchical, habib2019semi, liu2017mongolian, liu2019teacher]. The endtoend TTS model is based on encoderdecoder framework, which has been widely adopted for sequence generation tasks, such as speech recognition [graves2014towards, bahdanau2016end, amodei2016deep, chen2019end]
and neural machine translation
[bahdanau2014neural, johnson2017google]. Tacotronbased TTS typically consists of two modules: 1) feature prediction, and 2) waveform generation. The main task of feature prediction network is to obtain frequencydomain acoustic features, while the waveform generation module is to convert frequencydomain acoustic features into timedomain waveform.A typical Tacotron implementation adopts GriffinLim algorithm [griffin1984signal, masuyama2019deep] for phase reconstruction, that only uses a loss function derived from amplitude spectrogram in frequency domain. Such a loss function doesn’t take the resulting waveform into consideration in the optimization process. As a result, there exists a mismatch between the Tacotron optimization and the expected waveform. We note that such mismatch also exists in many other speech processing tasks, such as speech separation [wang2018supervised], where we observe that, by incorporating timedomain loss function [wang2015deep], one can improve the output speech quality. More recently, deep learning approach to speech enhancement methods with timedomain raw waveform outputs [fu2017raw, liu2019multichannel] have also been investigated. However, we note that timedomain loss function has not been well explored in speech synthesis, which will be the focus of this paper.
Tacotron2 [shen2018natural] has been proposed to achieve high quality synthesized voice. It addresses the waveform optimization problem by using WaveNetbased neural vocoder [oord2016wavenet, berrakjournal, sisman2018adaptive, berrak_is18, hayashi2017investigation]. We note that WaveNet avoids the artifacts and deterioration caused by deterministic vocoders. It generates timedomain waveform samples conditioned on the predicted melspectrum features. Although Tacotron2 allows endtoend learning of TTS directly from character sequences and speech waveforms, its feature prediction network is trained independently of the WaveNet vocoder. At runtime, the feature prediction network and WaveNet vocoder are artificially joined together. As a result, the framework suffers from the mismatch between frequencydomain acoustic features and timedomain waveform. It is reported that the samples generated from WaveNet occasionally become unstable, especially when less accurately predicted acoustic features are used as the local condition parameters. To overcome such mismatch, we propose to use joint timefrequency domain loss for TTS that effectively improves the synthesized voice quality.
In this paper, we propose to add a timedomain loss function to the GriffinLim/ISTFT output of Tacotronbased TTS model at the training time. In other words, we use both frequencydomain loss and timedomain loss for the training of feature prediction model. We hypothesize that the feature prediction network will compensate the possible artifacts that GriffinLim process may introduce under the supervision of the timedomain loss. We use GriffinLim iteration followed by ISTFT to transform frequencydomain feature to timedomain waveform and use scaleinvariant signaltodistortion (SISDR) [le2019sdr, kolbaek2019loss] to measure the quality of the timedomain waveform. Our proposed idea shares a similar motivation with [zhao2018wasserstein] in terms of the use of waveform loss. However, it differs from [zhao2018wasserstein] in many ways, for example, we study Tacotronbased TTS, while [zhao2018wasserstein] mostly deals with Wasserstein GANbased TTS.
The main contributions of this paper include: 1) we study the use of timedomain loss for speech synthesis; 2) we improve Tacotronbased TTS framework by proposing a new training scheme based on joint timefrequency domain loss; and 3) we propose to use SISDR metric to measure the distortion of timedomain waveform. The novel training scheme optimizes the frequencydomain acoustic features in a way that it leads to better timedomain waveform. To our best knowledge, this is the first implementation of joint training scheme on frequency and time domain for Tacotronbased TTS framework.
This paper is organized as follows: In Section 2, we present the Tacotronbased baseline TTS system. In Section 3, we present the novel idea of joint timefrequency domain loss, and formulate the training and runtime processes. We report the experimental results in Section 4. Section 5 concludes the study.
2 Baseline: Tacotronbased TTS
In this paper, we use a Tacotronbased framework [shen2018natural] as a reference baseline. We illustrate the overall architecture of the reference baseline in Figure 1, that includes feature prediction model which contains encoder, attentionbased decoder and GriffinLim algorithm for waveform reconstruction. The encoder (blue box in Figure 1) consists of two components, a CNNbased module that has 3 convolution layers, and a LSTMbased module that has a bidirectional LSTM layer. The decoder (pale yellow box in Figure 1
) consists of four components: a 2layer prenet, 2 LSTM layers, a linear projection layer and a 5convolutionlayer postnet. The decoder is a standard autoregressive recurrent neural network that generates the melspectrum features and stop tokens frame by frame.
During training, we optimize the feature prediction model to minimized the frequencydomain loss between the generated melspectrum features () and the target melspectrum features ().
3 WaveTTS
In this section, we study the use of a newly proposed timedomain loss function for Tacotronbased TTS. By applying a new training strategy that takes into account both time and frequency domain loss functions, we effectively reduce the mismatch between the frequencydomain features and the timedomain waveform, and improve the output speech quality. In addition, GriffinLim algorithm and SISDR metric are utilized to realize the calculation process of proposed loss term. The proposed framework is called as WaveTTS hereafter.
3.1 Timedomain and Frequencydomain Loss Functions
In WaveTTS, we define two objective functions during training: 1) frequencydomain loss, denoted as , that is calculated with the melspectrum features in a similar way described in [shen2018natural]; and 2) the proposed timedomain loss, denoted as
, that is obtained at waveform level at the output of GriffinLim iteration, that estimates timedomain signal from the melspectrum features. The two objective functions are illustrated in Figure
2.The entire process can be formulated as follows. The encoder takes the character sequence
as input and converts the onehot vector to continuous features representation
:(1) 
The decoder outputs a melspectrum feature at each step :
(2) 
where represents a function to calculate the context vector by using locationsensitive attention mechanism [vaswani2017attention].
We first calculate in a similar way as that in [shen2018natural]. ensures that the generated melspectrum is close the natural melspectrum. is given as follows,
(3) 
where is the total number of the sequences in training data. loss function is used to minimize the error which is the sum of the all the squared differences between the true value and the predicted value.
We then propose the use of a timedomain loss , that is applied to the GriffinLim output. This is to reduce the mismatch between the optimized frequencydomain output and the actual timedomain waveform [zhao2018wasserstein]. The implementation details of will be explained in Section 3.2.
Overall, the proposed WaveTTS framework has two loss functions. minimizes the loss between converted and original melspectrum, while minimizes this loss at waveform level. We add a weighting coefficient to balance the two losses. The training criterion of the whole model is defined as:
(4) 
Algorithm 1 also shows the complete training process of our proposed WaveTTS. WaveTTS model predicts the melspectrum features from the given input character sequence , and then converts the estimated and target melspectrum to the timedomain signal , using GriffinLim based ISTFT algorithm (blue content in Algorithm 1). Finally, the joint loss function given in Equation 4 is used to optimize the WaveTTS model.
3.2 Implementation of timedomain Loss
3.2.1 Timedomain loss
We adopt GriffinLim algorithm [griffin1984signal], followed by ISTFT to generate the timedomain waveform. GriffinLim algorithm has been widely used in speech synthesis [shen2018natural, tjandra2019vqvae] for its simplicity, that can be formulated as follows:
(5) 
(6) 
where represents the predicted melspectrum sequences, and represents their amplitude. is the function that calculates the amplitude of the given input melspectrum sequences, which is followed by the GriffinLim algorithm, that estimates a complex valued spectrum, while minimizing the change to the input amplitude . ISTFT transforms the estimated complex valued spectrum to timedomain signals. The details of the GriffinLim algorithm are given in Algorithm 2. is the metric projection onto a set . Here, is the set of consistent spectrums, and is the set of spectrums whose amplitude is the same as the given one.
It’s worth mentioning that the GriffinLim algorithm usually requires many iterations, as shown in Algorithm 2, at runtime to obtain a highquality audio signal. It is an optimization process independent of Tacotron training. We would like the Tacotron feature prediction network to generate acoustic features, that not only are close to those of natural speech in frequencydomain, but also allow GriffinLim to produce speech that is close to natural speech in timedomain.
Let’s denote the predicted and the original melspectrum as (). We apply GriffinLim and ISTFT to generate (), with and . We keep the same number of GriffinLim iterations to ensure that GriffinLim behaves the same between the two transformation pairs, and . We measure the distortion between and with a timedomain loss, that forces the speech waveform generated from the predicted network to be as close as possible to that generated from the melspectrum of natural speech.
3.2.2 ScaleInvariant SignaltoDistortion (SISDR)
In speech synthesis, we optimize the feature prediction network to minimize the discrepancy between the synthesized waveform and the target natural speech that is supervised by a loss function. We propose a timedomain loss function, , that is based on scaleinvariant signaltodistortion (SISDR). SISDR has been introduced as a timedomain objective measure in source separation [luo2018tasnet, bahmaninezhad2019comprehensive, venkataramani2018performance] to compare two timedomain speech signals. We adopt SISDR to measure the discrepancy between the generated waveform and the target natural speech. To our best knowledge, this is the first implementation of SISDR for timedomain loss calculation to improve TTS quality.
We note that SISDR is evaluated only during training, and not required at runtime inference. During training, the predicted timedomain waveform and the target speech have identical duration. Similarly, the predicted melspectrum and target melspectrum also share the same frame length, that facilitates the SISDR calculation. As a greater SISDR indicates better quality, to turn it into a loss function, we take the negative value of SISDR as the loss function,
(7) 
where
(8) 
SISDR is expressed in decibel (dB) and defined in the range of , so is .
Training Phase  Runtime Inference  
Loss Function  Waveform Generation  Waveform Generation  
Baseline  TacotronGL  NA  GriffinLim [griffin1984signal]  
TacotronWN  NA  WaveNet vocoder [hayashi2017investigation]  
Proposed  WaveTTSGL  GriffinLim [griffin1984signal]  GriffinLim [griffin1984signal]  
WaveTTSWN  GriffinLim [griffin1984signal]  WaveNet vocoder [hayashi2017investigation] 
4 Experiments
We report the TTS experiments on LJSpeech database ^{1}^{1}1https://keithito.com/LJSpeechDataset/, which consists of 13,100 short clips with a total of nearly 24 hours of speech from one single speaker reading about 7 nonfiction books. We develop four systems for a comparative study:

TacotronGL: Tacotronbased baseline model [shen2018natural], that has only frequencydomain loss function. GriffinLim algorithm is used to generate the waveform at runtime.

TacotronWN: Tacotronbased baseline model [shen2018natural], that has only frequencydomain loss function. Pretrained WaveNet vocoder is used to generate the waveform at runtime.

WaveTTSGL: proposed WaveTTS model is trained with joint timefrequency domain loss. GriffinLim algorithm is used during training and runtime phases.

WaveTTSWN: proposed WaveTTS model is trained with joint timefrequency domain loss. GriffinLim algorithm is used during training and the pretrained WaveNet vocoder is used to synthesize speech at runtime.
We also compare these systems with the ground truth speech, denoted as GT. The comparison of the systems is also summarized in Table 1.
4.1 Experimental Setup
The 80channel melspectrum is extracted with 12.5ms frame shift and 50ms frame length. It is normalized to zeromean and unitvariance as the reference target. The decoder predicts only one nonoverlapping output frame at each decoding step. We use the Adam optimizer with
= 0.9, = 0.999 and a learning rate of exponentially decaying to starting with 50k iterations. We also apply regularization with weight . Hyperparameter in Equation 4 is empirically set as . All models are trained with a batch size of 32. The final models are trained with 100k steps for all systems. At runtime, TacotronGL and WaveTTSGL use GriffinLim algorithm with 64 iterations, while TacotronWN and WaveTTSWN use pretrained WaveNet vocoder.4.2 Subjective Evaluation
We conduct listening experiments to evaluate the quality of the synthesized speech. We first evaluate the sound quality of the synthesized speech in terms of mean opinion score (MOS) among GT, TacotronGL, TacotronWN and the proposed WaveTTSGL and WaveTTSWN, that is reported in Figure 3.
The listeners rate of the quality is on a 5point scale: “5” for excellent, “4” for good, “3” for fair, “2” for poor, and “1” for bad. The MOS values reported in Figure 3 are calculated by taking the arithmetic average of all scores assigned the subjects who have passed the validation question test. We keep the linguistic content the same among different models so as to exclude other interference factors. 15 subjects participate in these experiments, and each one of them listens to 120 synthesized speech samples. We have three observations through the experiments:

The importance of joint timefrequency domain loss: We compare TacotronGL and WaveTTSGL to observe the effect of joint timefrequency domain loss. We believe that this is a fair comparison as both frameworks use GriffinLim algorithm for waveform generation during training and/or runtime. As can be seen in Figure 3, WaveTTSGL outperforms TacotronGL by achieving 3.30 MOS value, while TacotronGL achieves only 3.18.

The performance of WaveTTS with a neural vocoder at runtime: We compare TacotronWN and WaveTTSWN to investigate how well the predicted melspectrum features perform with WaveNet vocoder. We observe that even though WaveTTS is trained with GriffinLim algorithm, it performs better than Tacotron when WaveNet vocoder is available at runtime. This shows how well our proposed WaveTTS performs with other neural vocoders.

GriffinLim vs WaveNet vocoder at runtime: We compare WaveTTSGL and WaveTTSWN in terms of voice quality. We note that both frameworks are trained under the same conditions. However, WaveTTSWN uses WaveNet vocoder for waveform generation at runtime. As expected, WaveTTSWN outperforms WaveTTSGL.
We also conduct A/B preference tests to assess speech quality of proposed frameworks. In A/B preference tests, the listeners are asked to compare the quality and naturalness of the synthesized speech samples from different systems, and select the better one. 15 listeners were invited to participate in all the tests. 80 samples were randomly selected from 200 converted samples from each system. Figure 4 shows the speech quality test results, which suggests that our proposed WaveTTS framework outperforms the baseline system for both GriffinLim and WaveNet vocoder settings at runtime.
We further conduct another A/B preference test to examine the effect of the number of GriffinLim iterations on the WaveTTS performance. To calculate the timedomain loss, WaveTTS needs to generate the synthesized waveform during training. For rapid turnaround, we only apply 1 and 2 GriffinLim iterations for phase reconstruction, and investigate the effect in terms of voice quality. Figure 5 shows A/B preference test results on both WaveTTSGL and WaveTTSWN. We observe that the single iteration of GriffinLim algorithm presents a better performance than 2 iterations.
5 Conclusion
In this paper, we propose a new Tacotron implementation, called WaveTTS. The traditional TTS frameworks calculates only frequencydomain loss to update the network parameters, that doesn’t directly control the quality of the generated timedomain waveform. The proposed WaveTTS is unique in a sense that it calculates both timedomain and frequencydomain loss, and optimizes the model for generating highquality synthesized voice. We propose to use scaleinvariant signaltodistortion (SISDR) as the loss function. Even though the proposed model is trained with GriffinLim algorithm for timedomain loss calculation, it performs remarkable well with both GriffinLim and WaveNet vocoder at runtime. Experimental results show that the proposed framework outperforms the baselines and achieves highquality synthesized speech. To our best knowledge, this is the first implementation of Tacotronbased TTS model with joint timefrequency domain loss.
As a future work, we will investigate the training phase of joint timefrequency domain loss with a neural vocoder for highquality TTS.
6 Acknowledgements
This research was supports by the National Natural Science Foundation of China (No.61563040, No.61773224), Natural Science Foundation of Inner Mongolian (No.2018MS06006, No.2016ZD06).
Comments
There are no comments yet.