A state-of-the-art text-to-speech (TTS) system can synthesize speech that is almost indistinguishable from human speech. However, the inference speed of such a TTS system is usually too slow to be used in real-world applications [25, 13, 19]. Several TTS systems (i.e., fast TTS systems) have been proposed to reduce the inference time, but there is room for improvement in speech quality [22, 23].
Meanwhile, there is a limitation in training modern TTS systems. Most of them are trained with or
loss functions between predicted and ground-truth Mel-spectrograms (we denote this loss as “Mel loss”) and evaluated using the mean opinion score (MOS) to measure the perceptual quality of synthesized speech. Such loss functions cannot directly reflect the perceptual quality of the synthesized speech. Even though conventional TTS models have succeeded in generating high-fidelity speech, we can expect that using a loss function directly related to the perceptual quality may further improve the performance of TTS models. Many studies using the perceptual loss have been conducted in various tasks where there is a mismatch between the training loss and evaluation metric. We will discuss them in detail in Section2.
In this paper, we propose a novel method to improve the speech quality of the fast TTS systems while keeping the inference fast. To do so, we incorporate the perceptual quality into the loss function during training. More specifically, we train a text-to-Mel-spectrogram conversion model using a perceptual loss calculated by a MOS prediction model.
2 Related Work
Many studies have proposed a perceptual loss to improve the quality of the outputs generated by a deep-learning-based model. In this section, we will first introduce those studies and compare them with ours. There are generally two orthogonal approaches to define a perceptual loss. The first approach is based on a style reconstruction loss proposed by Gatyset al. 
. It assumes that a neural network pre-trained for classification already has the perceptual information which the generative models need to learn. Then, it tries to make the feature representations of the generative model similar to those of the pre-trained classification model. Here, the perceptual loss is defined as the distance between the feature representations from the generative model and those from the pre-trained classification model. This approach has been successfully applied to various fields, including image style transfer[7, 10], audio inpainting , speech enhancement , neural vocoding [20, 11], and expressive TTS .
The second approach uses the perceptual evaluation metric, such as the perceptual evaluated speech quality (PESQ)  or short-time objective intelligibility (STOI) , to learn the perceptual information more directly. For the image enhancement task, Talebi and Milanfar 
proposed to maximize the aesthetic score of the enhanced image that was predicted by a pre-trained convolutional neural network (CNN). For the speech enhancement task, Zhaoet al.  and Fu et al.  proposed to fine-tune a pre-trained speech enhancement model by maximizing the modified STOI and approximated PESQ function, respectively. Using the PESQ-inspired criterion was also proposed for voice conversion  to increase the overall quality of the converted speech. For unit selection TTS, Peng et al.  optimized the concatenative cost function concerning its correlation with the MOS. For deep-learning-based TTS, Baby et al.  proposed to select a TTS model that has the lowest phone error rate.
As can be seen above, despite a large number of prior works using the perceptual loss, only a few of them have been proposed for deep-learning-based TTS. One of them is the work of Baby et al. , which uses the phone error rate as a perceptual metric. The authors use the metric as a criterion for selecting the best model after training, not as a loss function for training. In this paper, we directly train a TTS model using the MOS. Our proposed method differs from their work in two ways: 1) we use the MOS, not the phone error rate, and 2) we use the perceptual metric during training, not after. As we use the perceptual metric, MOS, to learn the perceptual information, our method follows the second approach above. The work of Talebi and Milanfar , proposed for the image enhancement task, is particularly relevant to ours. They calculate a perceptual loss using a pre-trained model for the quality prediction and use the perceptual loss to train a generative model. Similarly, we define a perceptual loss using a pre-trained MOS prediction model and use the perceptual loss to train a TTS model. To the best of our knowledge, this is the first work that uses the perceptual loss based on MOS to train a deep-learning-based TTS model.
A general state-of-the-art TTS system consists of a text-to-Mel-spectrogram conversion model and a neural vocoder. From now on, we will call the text-to-Mel-spectrogram conversion model the “TTS model” for simplicity. In this paper, we focus on generating more natural Mel-spectrograms with a TTS model.
3.1 TTS model
We first point out that our method can be applied to any TTS model since it only needs the predicted Mel-spectrogram as an input, regardless of the existing model architecture or training method. In this paper, we use FastSpeech  as a baseline TTS model, which is one of the most famous and fastest end-to-end TTS models. Our TTS model is almost the same as the original FastSpeech. The only difference is that our model uses the Korean characters, not English phonemes, as an input. The size of the character vocabulary is 74, including the punctuation.
FastSpeech enables the parallel text-to-Mel-spectrogram conversion based on a feed-forward Transformer architecture with a length regulator. The duration prediction model of the length regulator predicts the length for each input time step based on the alignment calculated by the pre-trained Transformer. Then, the length regulator expands the hidden states of the input sequence according to the predicted length. Therefore, the parallel computation in FastSpeech is possible during both training and inference. In terms of the text-to-Mel-spectrogram conversion speed, Ren et al.  reported that FastSpeech is about 269 times faster than Transformer TTS . FastSpeech can also improve the robustness of the synthesized speech to the complicated input text and can control the speed of the synthesized speech by scaling the predicted duration.
3.2 Perceptual loss using the MOS prediction model
To directly improve the perceptual quality of the synthesized speech, we propose to combine a perceptual loss with the conventional loss during the training of FastSpeech. We use a predicted MOS of the generated Mel-spectrogram to calculate the perceptual loss. For MOS prediction, we train three models based on two types of MOS prediction models. The first type is MOSNet 
, which was proposed to predict a MOS from 257-dim linear spectrogram. It consists of 12 convolutional layers, one bidirectional long short-term memory (BLSTM) layer, two fully connected layers, and a global pooling layer. The second type is MOSNet+STC+SD from our prior work, which is an advanced version of MOSNet. It uses multi-task learning (MTL) with spoofing type classification (STC) and spoofing detection (SD) to improve the generalization ability of a MOS prediction model. For simplicity of notation, we denote MOSNet+STC+SD as MTL-MOSNet. All the MOS prediction models in this paper have almost the same architecture with either MOSNet or MTL-MOSNet. The only difference comes from that we combine the MOS prediction model with FastSpeech. To use the 80-dim Mel-spectrogram predicted by FastSpeech instead of a 257-dim linear spectrogram as an input, we change the number of BLSTM units from 128 to 32. Please refer to our prior work  for a detailed explanation about MOSNet and MTL-MOSNet.
Among the three MOS prediction models, the first and second models are MOSNet and MTL-MOSNet, respectively. The third MOS prediction model is MTL-MOSNet trained on an augmented dataset. Since there is a domain mismatch between the training data for the MOS prediction model and the TTS model, we augment the dataset for MOS prediction with the audio samples in the TTS dataset. We train MTL-MOSNet on the augmented training data and call this model “AUG.” Since running a subjective MOS test for all the audio samples in the TTS dataset is expensive and time-consuming, we assume that all the ground-truth MOSs for them are 5. We believe this is reasonable since the TTS dataset is recorded by a professional speaker in a clean environment.
Now, we need to define the perceptual loss so that minimizing the perceptual loss is equal to maximizing the predicted MOS. We define the perceptual loss as the loss between 5 (the maximum MOS) and the predicted MOS. After that, we combine the perceptual loss with the conventional loss. In the case of FastSpeech, the conventional loss () is defined as follows:
where and indicate the Mel loss and duration loss of FastSpeech, respectively. One of the biggest problems in using perceptual loss occurs in the early stages of training. The purpose of using MOS is to evaluate complete systems, not incompletely-trained systems. Therefore, the dataset to train MOS prediction models does not contain the audio samples that are generated by incompletely-trained speech generation systems. As a result, the mel-spectrograms predicted in the early stages of training are totally unseen to the pre-trained MOS prediction model, which means that the predicted MOS values of them are unreliable.
To address this problem, we propose two strategies as follows. In the first strategy, we add the perceptual loss () to if the Mel loss is equal or lower than a fixed threshold. Otherwise (i.e., at the early stages of training), we only use . When we apply the perceptual loss, we add it to the scaled . We call this method the “thresholding strategy.” In the second strategy, we propose to use a weighted sum of and , which is motivated by . The total loss function is defined as follows:
Then, we begin training with a large and gradually reduce
as the epoch increases. We call this method the “weighted sum strategy.” Since we can addto using one of these strategies, we denote these as “addition strategies.” Fig. 1 represents the diagram of training our perceptually guided TTS. Under the supervision of the perceptual loss, the TTS model learns to maximize the speech quality directly.
|Naturalness||High||The audio flow sounds natural and human-like.|
|Low||The audio flow sounds unnatural.|
|Intelligibility||High||The pronunciation is accurate and natural.|
|There is no deleted or repeated sound.|
|The duration of each phoneme or between the phonemes is natural.|
|It requires no listening effort.|
|Low||The pronunciation is inaccurate and unnatural.|
|There are many deleted or repeated sounds.|
|The duration of each phoneme or between the phonemes is too short or too long.|
|No meaning is understood with any feasible effort.|
|Sound quality||High||The sound is clear without noise or distortion.|
|Degradation of the sound quality is not noticeable.|
|Low||The sound is annoying with severe noise or distortion.|
We use our internal Korean dataset 111https://github.com/emotiontts/emotiontts_open_db for training all the end-to-end TTS models. It consists of 13000 utterances recorded by a professional female speaker. Each utterance is recorded in a mono channel 16-bit PCM Wave format with a sampling rate of 22.05 kHz. The total length of the dataset is about 18 hours. We exclude especially long 48 utterances among them and use 12822, 65, and 65 utterances for training, validation, and testing, respectively.
To train all the MOS prediction models, we use the evaluation results of the Voice Conversion Challenge (VCC) 2018 . For more details about the dataset, please refer to our previous works [5, 4]. As explained in Section 3.2, we train a MOS prediction model called “AUG” using the augmented dataset consisting of both the evaluation results of the VCC 2018 and our internal Korean dataset.
4.2 Implementation details
We first train a Transformer TTS model on our internal Korean dataset using two GTX 1080 Ti GPUs. In the case of FastSpeech-based models, we train them on a single GTX 1080 Ti GPU using pre-trained Transformer as the teacher model . We implement all the models using ESPNet , which is an end-to-end speech processing toolkit.
For the implementation details of the MOS prediction models, please refer to our previous work . In the case of MOSNet and MTL-MOSNet, we choose the best model with the lowest validation mean squared error (MSE). For AUG, we choose the best model with the lowest validation loss. To generate the waveform, we train Parallel WaveGAN  on the same Korean dataset and use it as the neural vocoder for all the TTS models.
When we use the “thresholding strategy,” we set the threshold to 0.5 and define the loss function as follows:
When we use the “weighted sum strategy,” we set the maximum and minimum value of to 30 and 3, respectively, and reduce by 3 per every epoch. That is, is defined as follows:
To investigate the effect of perceptual loss on the perceptual speech quality, we compared eight TTS models: Transformer, FastSpeech, and six perceptually guided TTS models. To train the proposed models, we used different MOS prediction models and addition strategies. To evaluate each TTS model, we chose the model with the lowest validation loss. Similar to , we performed subjective MOS tests in terms of naturalness, intelligibility, and sound quality, respectively. Naturalness and intelligibility are the most widely used quality factors for the evaluation of a TTS system . Based on previous works [3, 18, 9], we prepared the instructions for each factor of speech quality, as shown in Table 1. Specifically, prosody and duration are related to naturalness and intelligibility, respectively. A total of 30 utterances per TTS model were evaluated by a total of 14 native Korean speakers. Each utterance was evaluated by at least four raters with five possible responses: 1 = Bad; 2 = Poor; 3 = Fair; 4 = Good; and 5 = Excellent. Audio samples are available online 222https://wkadldppdy.github.io/perceptualTTS/index.html..
|TTS Model||MOS prediction model||Addition strategy||MOS||Average MOS|
|Transformer||-||-||-||-||-||4.64 0.11||4.66 0.16||4.23 0.13||4.51|
|FastSpeech||-||-||-||-||-||4.12 0.17||4.10 0.32||4.00 0.18||4.07|
|FastSpeech||✓||-||-||✓||-||4.24 0.21||4.21 0.27||3.86 0.17||4.10|
|FastSpeech||✓||-||-||-||✓||4.27 0.19||4.34 0.23||4.00 0.18||4.20|
|FastSpeech||-||✓||-||✓||-||4.27 0.22||4.36 0.26||3.86 0.20||4.16|
|FastSpeech||-||✓||-||-||✓||4.33 0.17||4.24 0.23||3.95 0.15||4.17|
|FastSpeech||-||-||✓||✓||-||3.97 0.19||3.67 0.33||3.89 0.18||3.85|
|FastSpeech||-||-||✓||-||✓||4.39 0.16||4.50 0.22||4.21 0.14||4.37|
The MOS test results. The MOS of each quality factor is shown with 95% confidence interval. MTL-MOSNet stands for MOSNet using MTL with STC and SD. AUG stands for MTL-MOSNet trained on the augmented dataset. TH and WS stand for “thresholding strategy” and “weighted sum strategy,” respectively. The results of our best model are shown in bold.
4.4 Results of perceptually guided TTS models
Before discussing the TTS results, we first present the MOS prediction results. When we evaluate MOSNet on 2000 samples from VCC 2018, the utterance-level linear correlation coefficient (LCC), Spearman’s rank correlation coefficient (SRCC), and mean squared error (MSE) are 0.624, 0.580, and 0.478, respectively. For MTL-MOSNet, utterance-level LCC, SRCC, and MSE are 0.640, 0.597, and 0.458, respectively. Both models show lower performance compared to when using the 257-dim linear spectrogram as an input (see  for detailed results). We assume that this is because the lower dimensional input contains less information. Nevertheless, both models still have reasonable MOS prediction performance.
Table 2 shows the results of eight different TTS models. The results of Transformer in the first row serve as the upper bound on the performance. It is because the Transformer TTS model was used as the teacher model for FastSpeech. The results of FastSpeech in the second row are the baseline results. The last column represents the average MOS of three quality factors. “TH” and “WS” stand for “thresholding strategy” and “weighted sum strategy,” respectively.
First, we analyze the effects of different MOS prediction models. When we use MOSNet as the MOS prediction model, the average MOS is higher when using “WS” than when using “TH” as the addition strategy. The average MOS is slightly higher when using “TH” than the baseline. In terms of both naturalness and intelligibility, the MOSs of two models using different addition strategies are higher than the MOS of FastSpeech. However, the sound quality of the two models is similar to or even worse than that of FastSpeech.
When we use MTL-MOSNet as the MOS prediction model, the average MOS is almost the same regardless of the addition strategy, being higher than that of FastSpeech. Comparing two addition strategies, naturalness and sound quality is higher when using “WS,” while intelligibility is higher when using “TH.” In any case, the proposed model outperforms FastSpeech in terms of naturalness and intelligibility. The performance gap is even larger than when using MOSNet. Again, the MOS of the sound quality does not improve compared to FastSpeech for both addition strategies.
Finally, the last two rows show the results of AUG as the MOS prediction model. Here, we can observe that they show two extreme results according to the addition strategy. The model using “WS” achieves a much higher average MOS than the baseline, which obtains the best result among the proposed models. On the other hand, the performance of the model using “TH” is even worse than the baseline. We observe the same trend consistently across all the quality factors (i.e., naturalness, intelligibility, and sound quality).
Based on these results, we can see that the TTS model trained with the perceptual loss synthesizes the speech with better quality. Such improvement occurs most consistently in terms of naturalness and intelligibility. As well known, naturalness and intelligibility are the most important factors for the MOS test . Even though the MOS responses for VCC 2018 were collected by asking the raters to evaluate the naturalness , our results show that the MOS responses are closely related to intelligibility. Also, in , it was reported that there is a correlation between naturalness and intelligibility. We can conclude the following two things from these results. First, the MOS prediction models learn the relation between the two quality factors and MOS. Second, using the perceptual loss calculated by the MOS prediction models helps the TTS model learn to synthesize more natural and intelligible speech.
Meanwhile, the improvement in terms of sound quality only occurs when using AUG as the MOS prediction model and “WS” as the addition strategy. From this result, we can infer that the ground-truth MOSs used by the MOS prediction models for training are not significantly related to the sound quality. As mentioned earlier, VCC 2018 conducted the MOS test in terms of naturalness. Therefore, the raters would not care much about sound quality. When we augment the MOS prediction data with high-quality TTS data (labeled as the maximum value of the MOS), the MOS prediction model can learn the information related to the sound quality. Then FastSpeech using AUG can synthesize the speech with better sound quality.
Among the three different types of MOS prediction models, MTL-MOSNet shows the best robustness to the addition strategy, and AUG has the best performance when using the appropriate addition strategy.
4.5 Results of our best model
In this section, we focus on our best model: FastSpeech using AUG as the MOS prediction model and “WS” as the addition strategy. Compared to FastSpeech, our model improves the naturalness from 4.12 to 4.39, the intelligibility from 4.10 to 4.50, and the sound quality from 4.00 to 4.21. During inference, our model works as fast as FastSpeech because the MOS prediction model is unnecessary for inference. Therefore, our model improves the speech quality in terms of all the quality factors with no additional inference time.
We also compare our best model with Transformer, which shows the upper-bound performance. In the case of FastSpeech, performance degradation is 0.52, 0.56, and 0.23 compared to Transformer in terms of naturalness, intelligibility, and sound quality, respectively. In the case of our best model, performance degradation is 0.25, 0.16, and 0.02 in terms of naturalness, intelligibility, and sound quality, respectively. We can see that our best model significantly reduces the performance gap with Transformer. In terms of naturalness, intelligibility, and sound quality, it achieves relative performances of 95%, 97%, and 100% to the upper bound, respectively.
We proposed a novel approach for a fast TTS model to improve the speech quality while maintaining the inference speed. We first trained the MOS prediction model and then used the model to calculate the perceptual loss for the TTS model. To combine the perceptual loss with the conventional loss, we proposed two addition strategies. Under the supervision of the perceptual loss, the TTS model was trained to maximize the speech quality directly. We performed various experiments using three MOS prediction models. In terms of naturalness, intelligibility, and sound quality, our best model improved FastSpeech, achieving the MOS of 4.39, 4.50, and 4.21, respectively. For future work, we will further validate our approach by using various TTS models and investigate advanced addition strategies.
-  (2020) An ASR guided speech intelligibility measure for TTS model selection. arXiv preprint arXiv:2006.01463. Cited by: §2, §2.
-  (2019) Deep long audio inpainting. arXiv preprint arXiv:1911.06476. Cited by: §2.
-  (2020) Korean singing voice synthesis based on auto-regressive boundary equilibrium gan. In Proc. of ICASSP, pp. 7234–7238. Cited by: §4.3.
-  (2020) Deep MOS predictor for synthetic speech using cluster-based modeling. arXiv preprint arXiv:2008.03710. Cited by: §4.1.
-  (2020) Neural MOS prediction for synthesized speech using multi-task learning with spoofing detection and spoofing type classification. arXiv preprint arXiv:2007.08267. Cited by: §3.2, §4.1, §4.2, §4.4.
-  (2019) Learning with learned loss function: speech enhancement with Quality-Net to improve perceptual evaluation of speech quality. IEEE Signal Processing Letters 27, pp. 26–30. Cited by: §2.
-  (2015) A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576. Cited by: §2.
Espnet-TTS: unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. In Proc. of ICASSP, pp. 7654–7658. Cited by: §4.2.
-  (2014) Text-to-speech synthesis. In Quality of experience, pp. 179–193. Cited by: §4.3.
Perceptual losses for real-time style transfer and super-resolution. In Proc. of ECCV, pp. 694–711. Cited by: §2.
MelGAN: generative adversarial networks for conditional waveform synthesis. In Advances in NeurIPS, pp. 14910–14921. Cited by: §2.
-  (2020) Voice conversion using a perceptual criterion. Applied Sciences 10 (8). Cited by: §2.
Neural speech synthesis with Transformer network. In
Proc. of the AAAI Conference on Artificial Intelligence, pp. 6706–6713. Cited by: §1, §3.1.
-  (2020) Expressive TTS training with frame and style reconstruction loss. arXiv preprint arXiv:2008.01490. Cited by: §2.
-  (2016) Large-margin softmax loss for convolutional neural networks. In Proc. of ICML, pp. 507–516. Cited by: §3.2.
-  (2019) MOSNet: deep learning based objective assessment for voice conversion. In Proc. of Interspeech, pp. 1541–1545. Cited by: §3.2.
-  (2018) The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. In Proc. of Odyssey The Speaker and Language Recognition Workshop, pp. 195–202. Cited by: §4.1, §4.4.
-  (2020) Four-features evaluation of text to speech systems for three social robots. Electronics 9 (2). Cited by: §4.3, §4.4.
-  (2016) WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §1.
-  (2018) Parallel WaveNet: fast high-fidelity speech synthesis. In Proc. of ICML, Cited by: §2.
-  (2002) Perceptually optimizing the cost function for unit selection in a TTS system with one single run of MOS evaluation. In Seventh International Conference on Spoken Language Processing, Cited by: §2.
-  (2019) Parallel neural text-to-speech. arXiv preprint arXiv:1905.08459. Cited by: §1.
-  (2019) FastSpeech: fast, robust and controllable text to speech. In Advances in NeurIPS, pp. 3171–3180. Cited by: §1, §3.1, §3.1.
-  (2001) Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proc. of ICASSP, pp. 749–752. Cited by: §2.
-  (2018) Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In Proc. of ICASSP, pp. 4779–4783. Cited by: §1.
HiFi-GAN: high-fidelity denoising and dereverberation based on speech deep features in adversarial networks. arXiv preprint arXiv:2006.05694. Cited by: §2.
-  (2010) A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proc. of ICASSP, pp. 4214–4217. Cited by: §2.
-  (2018) Learned perceptual image enhancement. In Proc. of 2018 IEEE ICCP, pp. 1–13. Cited by: §2, §2.
-  (2020) Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In Proc. of ICASSP, pp. 6199–6203. Cited by: §4.2.
-  (2018) Perceptually guided speech enhancement using deep neural networks. In Proc. of ICASSP, pp. 5074–5078. Cited by: §2.