With the advent of deep learning, end-to-end TTS has shown improved performance over the conventional TTS techniques[27, 30]. For example, Tacotron-based approaches [28, 21, 25, 11] with an encoder-decoder architecture and attention mechanism have shown to achieve remarkable performance. In these techniques, the key idea is to integrate the conventional TTS pipeline into a unified network and learn the mapping only directly from the text, wav pair [11, 3, 7, 13, 31]. Furthermore, together with a neural vocoder [15, 6, 21, 14, 22, 23, 24], natural-sounding human-like speech can be generated.
However, neural end-to-end TTS is still far from perfect. A typical neural TTS system suffers from the exposure bias problem [18, 20] in the autoregressive model  that is used by the decoder module. Specifically, in training stage, the decoder generates a frame with the previous frames of natural speech as input, that is called teacher forcing mode. However, in inference stage, the decoder predicts a frame with the previously predicted frames as input, that is also called free running mode. There exists a mismatch between the natural speech frames and the predicted frames especially for out-of-domain test data, that leads to unpredictable outcomes, such as skipping, repeating words, incomplete synthesis and inappropriate prosody phrase breaks [7, 19, 32, 12].
The techniques to improve in-domain performance of neural TTS frameworks include attention mechanism [16, 2, 26] and scheduled sampling [1, 9]. The use of scheduled sampling comes with negative effects that include misalignment between the natural speech frames and the predicted frames due to the fact that the temporal dependency of the acoustic sequence is disrupted. The techniques to improve out-of-domain performance include the GAN-based TTS framework  that introduces both real and generated data sequences in discriminator training, and more recently, stepwise monotonic attention for neural TTS .
In this paper, we propose a novel training scheme, the teacher-student training scheme, for neural end-to-end TTS framework, that performs remarkably well for out-of-domain inference. In the teacher-student training scheme, a teacher model learns the text-speech mapping from training data in teacher forcing mode, while a student model learns from both the probability distribution of the teacher model and the same training data for teacher model in free running mode. The process of student learning from teacher model is called knowledge distillation, and its learning objective is called distillation loss.
The main contributions of this paper are summarized as follows: 1) we propose a compact method for end-to-end TTS model; and 2) we propose a teacher-student training scheme for Tacotron-based TTS model. To our best knowledge, this is the first implementation of teacher-student training scheme for Tacotron2 based TTS framework. The proposed training scheme is validated with out-of-domain test data in Chinese and English TTS systems.
This paper is organized as follows: In Section 2, we re-visit the Tacotron2-based TTS framework that serves as a baseline reference. In Section 3, we study the proposed teacher-student training scheme for robust end-to-end TTS. In Section 4, we report the evaluation results. We conclude the paper in Section 5.
2 Tacotron2 based TTS
In this paper, we adopt Tacotron2  with scheduled sampling in the training stage, as a reference baseline. For rapid turn-around, we use Griffin-Lim  waveform reconstruction instead of WaveNet vocoder in this study. We note that the selection of waveform generation technique will not affect our judgment of the effectiveness of the proposed training scheme.
We illustrate the overall architecture of the reference baseline in Figure 1
, that includes encoder, attention-based decoder and Griffin-Lim algorithm. The encoder consists of two components, a CNN-based module that has 3 convolution layers, and a LSTM-based module that has a bidirectional LSTM layer. The decoder consists of four components: a 2-layer pre-net, 2 LSTM layers, a linear projection layer and a 5-convolution-layer post-net. The decoder is a standard autoregressive recurrent neural network that generates the mel-spectrogram features and stop tokens frame by frame.
During training, the decoder generates a frame in the scheduled sampling mode. However, at run-time inference, the decoder can only adopt free running mode to predict the future frames. Such trained decoder encounters the mismatch between the natural speech frames and the predicted speech frames, and the adverse effect of scheduled sampling on the temporal dependency of natural acoustic sequence. To address the above issues during training, we study a teacher-student training scheme in Section 3.
3 Teacher-student training for Tacotron2 based TTS
In this section, we discuss in detail the teacher model, the student model, and the teacher-student training scheme. While both the teacher model and the student model have identical network architecture as the reference baseline , they adopt different decoding strategies as illustrated in Figure 2.
In practice, we first train a standard Tacotron2 teacher model for a end-to-end TTS system under the teacher forcing mode, that is regarded as the teacher model. As the teacher model learns under the teacher forcing mode, it is expected to represent the true distribution of the natural speech data. We then train another Tacotron2 student model under the free running mode. The student model is trained by learning from both ground-truth sequence and the hidden states of the teacher model simultaneously. By learning from the hidden states of the teacher model via knowledge distillation, the student model learns the true distribution of the natural speech data effectively. As the student model is trained under the free running mode by using the predicted speech frames as the input of the decoder, it is expected to accustom itself to the run-time inference condition.
3.1 Teacher Model
For the decoder in teacher model, we implement the teacher forcing mode that predicts a speech frame by taking the previous natural speech frames in the sequence as the input.
Given a input character sequence and its target mel-spectrogram features , let is the teacher model of which is the model parameters. Teacher model with teacher forcing mode takes the previous frames from the target natural speech as input to predict the feature frame at time step , as formulated next,
where is the predicted value and is from the target natural speech.
With such decoding mode, the teacher model is expected to learn the true probability distribution from natural speech data, that would be very informative for the student model.
3.2 Student Model
The student model has the same network architecture as the teacher model, except that it has a completely different decoding mode: free running mode. In this mode, the decoder predicts a speech frame by taking the previous predicted speech frames in the sequence as the input. The decoding process of the student model is defined as:
where is the predicted value.
3.3 Knowledge Distillation
Typically, knowledge distillation is a process where a small model is trained to mimic a pre-trained, larger model [8, 29]. In this paper, we borrow the idea of knowledge distillation in the implementation of the teacher-student training scheme.
The idea is to use a teacher model, that has been trained under the teacher forcing mode, to guide the training of the student model, that runs under free running mode. As the teacher model is trained using natural speech frames as the input of decoder, we expect the output probability distribution of the teacher model to reflect the true distribution of the natural speech data. The student model is trained under the free running mode. Therefore, it is closer to the actual inference condition. At the same time, the hidden states of the student model are optimized to be close to those of the teacher model by way of knowledge distillation. As can be seen in Figure 2, we define one objective function for the teacher model, the feature loss. We devise two objective functions for the student model, one for the feature loss that is the same as in the teacher model, and another for the knowledge distillation, or distillation loss.
We formulate the entire process next. The encoder takes the input character sequence
from the given text and converts the one-hot vector to continuous high-level features representation:
The teacher decoder, Decoder_T outputs a hidden state at each step :
where represents a function to calculate the context vector by using location-sensitive attention mechanism.
Similarly, the student decoder Decoder_S processes the same input sequence and generates the hidden state at each step at the same time:
In both the teacher model and the student model, the feature loss function ensures that the generated speech is close the the target speech,
In the student model, to minimize the discrepancy between the hidden states and of the teacher model and the student model, we introduce the distillation loss ,
Then the total loss function for the student model is therefore,
where is a trade-off parameter for the two loss terms.
With knowledge distillation, the proposed 2-step teacher-student training scheme allows for a more compact End-to-End network than others such as generative adversarial network. The teacher model is trained with the objective function under the teacher forcing mode, while the student model is trained with a combination of two loss functions under the free running mode.
We develop two systems on Chinese (12 hours of Data Baker 111https://www.data-baker.com/open_source.html) and English (LJSpeech 222https://keithito.com/LJ-Speech-Dataset/) corpora separately. To verify the effectiveness of knowledge distillation, denoted as Tacotron2-KD, we choose 2 baseline frameworks: 1) Tacotron2 with scheduled sampling, denoted as Tacotron2-SS, and 2) Tacotron2 with free running mode, denoted as Tacotron2-FR. In all experiments, we use Griffin-Lim algorithm  for waveform generation for rapid turn-around.
4.1 Experimental Setup
For Chinese experiments, the encoder input is pinyin sequence with tones and the decoder output is an 160-channel Mel spectrum, two frames at a time. For English experiments, the encoder input is character sequence and the decoder output is an 80-channel Mel spectrum, two frames at a time. These two encoder input forms are collectively referred as character in this paper. For both systems, we use the Adam optimizer with = 0.9, = 0.999 and a learning rate of exponentially decaying to starting after 50k iterations. We also apply regularization with weight . Hyper-parameter in Equation 8 is set as 1.0 and all the models are trained with a batch size of 32. In teacher-student model training, we adopt the teacher model trained with 150k steps as the teacher decoder “”, and train the student decoder “” for 150k steps with the proposed knowledge distillation method.
4.2 Subjective Evaluation
We conduct experiments with out-of-domain test data to evaluate the naturalness and robustness of synthesized speech. For Chinese, we select 500 test samples from the Blizzard Challenge 2019 Chinese dataset . For English, we select 50 test samples from FastSpeech , which are particularly hard for TTS system. In addition to the 50 test samples, that are single letters, spellings, repeated numbers, we also include 30 long sentences, each having 128 characters on average. 20 English speakers and 15 Chinese speakers participated in the listening tests. Each subject listens to 80 converted utterances of his/her native language.
4.2.1 Naturalness Evaluation
We first evaluate the sound quality of the synthesized speech with mean opinion score (MOS) among Tacotron2-SS, Tacotron2-FR and the proposed Tacotron2-KD, that is reported in Table 1. The listeners rate the quality on a 5-point scale: “5” for excellent, “4” for good, “3” for fair, “2” for poor, and “1” for bad. It is observed that the proposed Tacotron2-KD clearly outperforms the baseline Tacotron2-SS for both English and Chinese data. As we observe that Tacotron2-FR achieves MOS of 1.33 for English and 2.32 for Chinese, that is significantly lower than those of Tacotron2-SS, we exclude Tacotron2-FR results from the experiment report.
4.2.2 Robustness Evaluation
We further conduct experiments to evaluate the robustness of synthesized speech for Tacotron2-SS and the proposed Tacotron2-KD, as reported in Table 1. We measure the robustness by Word Error Rate (WER %), that reports the sum of repeats (insertions) and skips (deletions) over the total number of characters in the listening tests . Repeats and skips represent the two types of errors that Tacotron2 faces. It is shown that Tacotron2-KD effectively reduces the errors by 8.77% and 21.65% over the Tacotron2-SS baseline.
A detailed analysis finds that Tacotron2-SS generates 528 skips and 9 repeats for Chinese data, and 251 skips and 12 repeats for English data, while Tacotron2-KD generates only 38 skips for Chinese data and 24 skips for English data. We don’t observe any repeats from the Tacotron2-KD outputs, that we believe is remarkable.
We have studied a training scheme for Tacotron2 to perform high-quality speech synthesis for out-of-domain text, that overcomes the exposure bias problem. We implement the teacher-student training scheme through a knowledge distillation objective function. We have conducted a series of experiments on both Chinese and English in terms of naturalness and robustness. The proposed Tacotron2-KD framework consistently outperforms the baseline systems in both languages.
In addition to the naturalness and robustness improvement, we also discover that Tacotron2-KD delivers improved prosody renderings especially. We will report the prosody analysis of Tacotron2-KD system in the future.
-  (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: §1.
-  (2015) Attention-based models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: §1.
-  (2019) Semi-supervised training for improving data efficiency in end-to-end speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6940–6944. Cited by: §1.
-  (1984) . IEEE Transactions on Acoustics, Speech, and Signal Processing 32 (2), pp. 236–243. Cited by: §2, §4.
-  (2019) A New GAN-Based End-to-End TTS Training Algorithm. In Proc. Interspeech 2019, pp. 1288–1292. External Links: Cited by: §1, §3.3, §4.2.2.
An investigation of multi-speaker training for wavenet vocoder.
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 712–718. Cited by: §1.
-  (2019) Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS. In Proc. Interspeech 2019, pp. 1293–1297. External Links: Cited by: §1, §1, §1.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.3.
-  (2015) How (not) to train your generative model: scheduled sampling, likelihood, adversary?. arXiv preprint arXiv:1511.05101. Cited by: §1.
Mixture autoregressive hidden markov models for speech signals. IEEE Transactions on Acoustics, Speech, and Signal Processing 33 (6), pp. 1404–1413. Cited by: §1.
-  (2019) Robust and fine-grained prosody control of end-to-end speech synthesis. In ICASSP 2019-IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5911–5915. Cited by: §1.
-  (2019) Maximizing mutual information for tacotron. arXiv preprint arXiv: 1909.01145. Cited by: §1.
-  (2019) Training Multi-Speaker Neural Text-to-Speech Systems Using Speaker-Imbalanced Speech Corpora. In Proc. Interspeech 2019, pp. 1303–1307. External Links: Cited by: §1.
-  (2019) Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders. In Proc. Interspeech 2019, pp. 1308–1312. External Links: Cited by: §1.
-  (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §1.
Online and linear-time attention by enforcing monotonic alignments.
Proceedings of the 34th International Conference on Machine Learning(ICML), Vol. 70, pp. 2837–2846. Cited by: §1.
-  (2019) Submission from cmu for blizzard challenge 2019. In Proceedings of Blizzard_Challenge 2019, Cited by: §4.2.
-  Sequence level training with recurrent neural networks. In Proc. ICLR 2016, Cited by: §1.
-  (2019) FastSpeech: fast, robust and controllable text to speech. In Proc. NeurIPS 2019, Cited by: §1, §4.2.
-  (2019) Generalization in generation: a closer look at exposure bias. arXiv preprint arXiv:1910.00292. Cited by: §1.
-  (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §1, §2.
-  (2018) A voice conversion framework with tandem feature sparse representation and speaker-adapted wavenet vocoder. In Proc. Interspeech 2018, pp. 1978–1982. Cited by: §1.
-  (2019) Group Sparse Representation with WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion. IEEE/ACM Transactions on Audio, Speech and Language Processing. Cited by: §1.
-  (2018) Adaptive wavenet vocoder for residual compensation in gan-based voice conversion. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 282–289. Cited by: §1.
-  (2018) Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. Proceedings of the 35th International Conference on Machine Learning. PMLR 80, pp. 4693–4702. Cited by: §1.
-  (2018) Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4788. Cited by: §1.
-  (2013) Speech synthesis based on hidden markov models. Proceedings of the IEEE 101 (5), pp. 1234–1252. Cited by: §1.
-  (2017) Tacotron: a fully end-to-end text-to-speech synthesis model. In INTERSPEECH, pp. 4006–4010. Cited by: §1.
A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In , pp. 4133–4141. Cited by: §3.3.
-  (2013) Statistical parametric speech synthesis using deep neural networks. In 2013 ieee international conference on acoustics, speech and signal processing, pp. 7962–7966. Cited by: §1.
-  Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet. In Proc. Interspeech 2019, pp. 1298–1302. Cited by: §1.
-  (2019) Pre-alignment guided attention for improving training efficiency and model stability in end-to-end speech synthesis. IEEE Access 7, pp. 65955–65964. Cited by: §1.