Data Processing for Optimizing Naturalness of Vietnamese Text-to-speech System

04/20/2020 ∙ by Viet Lam Phung, et al. ∙ 0

Abstract End-to-end text-to-speech (TTS) systems has proved its great success in the presence of a large amount of high-quality training data recorded in anechoic room with high-quality microphone. Another approach is to use available source of found data like radio broadcast news. We aim to optimize the naturalness of TTS system on the found data using a novel data processing method. The data processing method includes 1) utterance selection and 2) prosodic punctuation insertion to prepare training data which can optimize the naturalness of TTS systems. We showed that using the processing data method, an end-to-end TTS achieved a mean opinion score (MOS) of 4.1 compared to 4.3 of natural speech. We showed that the punctuation insertion contributed the most to the result. To facilitate the research and development of TTS systems, we distributed the processed data of one speaker at



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text-to-speech (TTS) systems play an important role in widely accepted, interactive systems like Siri, Microsoft Cortana, and Amazon’s Alexa. However, collecting data to build those systems is costly. Typically, a professional voice talents is recruited to read dozens of hours of text with good coverage of target domain in an anechoic room with high-quality microphone. The speakers should maintain constant fundamental frequency (F0), energy, speaking rate, and articulation throughout. There are 7000 languages in the world, and most do not receive as much research attention as English, Spanish, Mandarin, and Japanese. The so called low-resource language like Vietnamese have no carefully recorded and annotated corpora which can be used for TTS systems. The available speech corpora such as VOV (radio broadcast news) [1], VNSpeechCorpus [2], VIVOS [3]

are small, and dedicated for automatic speech recognition (ASR). The VAIS-1000 

[4], which is a latest database for TTS, only consists of 1000 sentences of a speaker. Another approach is to make use of various sources of found data (e.g. radio broadcast news, automatic speech recognition (ASR) corpora, and audiobooks) as TTS corpora [5]. The approach became the main challenge in the TTS evaluation of the Vietnamese Language and Speech Processing (VLSP) 2019  [6]. In the paper, we proposed a data processing scheme to integrate the VLSP evaluation’s broadcast news corpora into our end-to-end Vietnamese TTS system consisting of Tacotron 2 [7] and WaveGlow vocoder [8]. Our data processing scheme consists of two key elements: 1) utterance selection using different metrics, and 2) prosodic punctuation insertion into text. In our experiments, we significantly improved the naturalness of our TTS system by applying our data processing method on training data. In the TTS evaluation of the VLSP 2019, our system achieved a MOS of 4.1 (compared to 4.3 of natural speech); which was the best result among all participants [9].

2 Background

Researchers have attempted to build high-quality Vietnamese TTS systems in the last two decades. A text normalization method was investigated utilizing regular expressions and language model [10]. Prosodic features such as phrase breaks proved their efficacy in improving naturalness of Vietnamese TTS system [11, 12]

. Different types of acoustic models were investigated such as hidden Markov model (HMM) 

[13, 14]

, and deep neural network (DNN) 

[15]. These HMM- and DNN-based TTS systems are limited by the oversmoothing of generated parameters [16]. A post-filtering method was proposed to compensate for the oversmoothing effect using non-negative matrix factorization [17]. Recently, the use of sequence-to-sequence model in acoustic modeling [7] in combination with neural vocoders such as WaveGlow [8] have enormously reduced the oversmoothing effect; thus, achieving human quality TTS [18, 9].

We need dozens of hours of training data for a speaker to build a high-quality TTS system. One solution is to use available sources of found data. Three types of found data: radio broadcast news, automatic speech recognition (ASR) corpora, and audiobooks were compared; showing that radio broadcast news is a good match as TTS corpora [5]

. Different criteria such as standard deviation of fundamental frequency (F0), speaking rate , hypo- and hyper-articulation were explored in utterance selection for HMM-based TTS 

[5] and DNN-based TTS [19]. We are not aware of any attempts at utterance selection for end-to-end TTS. In the paper, we explored different metrics for utterance selection such as misalignment errors, articulation, standard deviation of syllable duration, non-fluency, and standard deviation of F0.

Traditionally, an end-to-end TTS system receives a sequence of syllables, or words as input; thus, it has no explicit prosodic features incorporating in the input. The prosodic features such as phrase breaks is important for the naturalness of Vietnamese TTS system [11]. In the paper, we insert the prosodic punctuations, which corresponds to pauses in utterance, into text. We realized that the inserted prosodic punctuations led to stable, faster convergence of the training of Tacotron 2. Moreover, making use of prosodic punctuation derived from utterances is a novel way to help Tacotron 2 model learn a speaker-dependent prosodic pattern.

3 Data

We used a so-called “big training dataset” provided by the TTS evaluation of the VLSP 2019 [6]. There are 15000 utterances of a single speaker (approximately 23 hours) with corresponding text; which cover broadcast news. The speaker was recorded at home instead of an anechoic room. The speaker was instructed to stay in a place as quite as possible during each recording session. Therefore, the data features the three types of errors 1) variation in channel conditions, 2) mismatch between text and utterance content, and 3) variation in articulation. The variation in channel conditions are caused by microphone conditions, recording environments, channel noise, … As a result, some utterances have mild background noise. The mismatch between text and utterance content is due to misspelling, tricky text, … When the text was meaningless and hard to read, the speaker often gave up and said random things. The variation in articulation features hyper-articulation, inconsistency in speaking rate, F0. We downsampled the speech data to 16kHz.

4 Proposed Method

In the section, we present our data processing scheme as shown in Figure 1

. Given the raw utterances and corresponding raw text, we applied a noise reduction method on the audio files using minimum mean-squared-error estimator 

[20]. The text was normalized and tokenized [10]. The denoised audio and corresponding normalized text were aligned using an audio alignment system. Using the time-stamps obtained from the alignment, we can 1) calculate different metrics of utterance selection and 2) identify and insert prosodic punctuation to text. We selected the utterances satisfying some experimental thresholds of the metrics. Each sentence was spliced into phrases by prosodic punctuation; reflecting the prosody pattern of the speaker.

Figure 1: Data Processing Scheme

4.1 Audio Alignment System

Vietnamese is a mono-syllabic language. We develop an audio alignment system [21, 22, 23, 24, 25, 26, 27] for Vietnamese to identify time-stamps

from audio for syllables. We first used a voice activity detection module to segment audio into speech and non-speech segments using adaptive context attention model 

[28]. It is different from [19] where an HMM-based voice activity detection was used. A Time Delay Neural Network (TDNN)-based ASR system [29] was trained over-fittingly on the speech segments and corresponding normalized text in our data. We also biased a language model to the normalized text [23]. Each speech segment was decoded using the TDNN-based acoustic model and the biased language model. The resulting time-aligned transcription was aligned with the original normalized text; associating the obtained time-stamps to the original text. On the other hand, matching sequences of syllables (also called anchor points [22]) between decoding output and original text transcript can be a good indicator that the speaker read the sentence correctly. Moreover, we can use the time-stamps of pauses and silences to identify prosodic punctuation.

4.2 Utterance Selection Metrics

We introduce different metrics addressing three type of errors in our data:

  • Word-error rate (WER) of decoding output when comparing to original text addresses the mismatch between text and utterance content. Every utterances with a WER less than 90% were removed from our data; thus, we removed 800 utterances.

  • Articulation [5] is used to address the variation in articulation or abnormal articulation. It is calculated as in Equation 1 where is the power of speech segments extracted from voice activity detection module; the average syllable duration (avg.syl.dur) is calculated based on the aligned time-stamps (obtained in 4.1). Speech segment with high articulation is hyper articulated. The hyper-articulated speech is unnatural because it has slow speaking rate and high energy [30].

  • Standard deviation of syllable duration (std.syl.dur) is used to address the inconsistency of speaking rate. The duration of each syllable is calculated according to the aligned time-stamps. Given a speech segment, a high value of std.syl.dur indicates that the narrator spoke sometimes fast and sometimes slow within the segment. Speech segments with high inconsistency of speaking rate are unnatural.

  • Non-fluency is used to address the reading non-fluency or variation in articulation. Moreover, the alignment procedure in 4.1 can have misalignment errors. We can also use the non-fluency metric to address the misalignment errors. The non-fluency is calculated as in Equation 2 where maximum duration of internal silence and average syllable duration are calculated according to aligned time-stamps. The internal silence is silence or pause other than the start and end ones. A high value of non-fluency indicates a long pause within an utterance; reflecting the non-fluency.

  • Standard deviation of F0 (std.F0) is used to address inconsistency of F0. A high value of std.F0 can be due to more expressive speech. We removed utterances with high values of std.F0.

For each metric, we rejected the 5% of data corresponding to the segments with the worst values of the metric.

4.3 Prosodic Punctuation Insertion

We detected four types of prosodic punctuation from speech based on the duration of internal silences. The internal silence duration can be calculated using aligned time-stamps. We represent each type of prosodic punctuation with a special character. We then insert the special characters into text at the positions of corresponding silences. By experiments, we determined four ranges of silence durations to identify the prosodic punctuations: , , and more than second. The prosodic punctuations are also used to mark the bad pauses caused by non-fluency. Thus, we can prevent the models to align the bad pause frames to any syllables.

5 Vietnamese TTS system

Our end-to-end TTS system have two components: 1) a encoder-decoder acoustic model and 2) a neural vocoder. The encoder-decoder acoustic model converts a sequence of syllables with prosodic punctuations to a 80-dimensional Mel-spectrogram. The neural vocoder generates speech from the 80-dimensional Mel-spectrogram. In normalized text, we consider tokens, inserted prosodic punctuation as syllables. In the paper, we utilized Tacotron 2 [7] for acoustic modeling, and WaveGlow vocoder [8]. As a result, our end-to-end system can achievee a real-time inference speed.

5.1 Encoder-decoder acoustic model

Generally, acoustic model in neural TTS system has an attention-based encoder-decoder structure. Given encoder outputs as memory entries and previous decoder hidden state , for each decoder output step , an energy value is calculated for each by a trainable attention mechanism [31, 7] as in Equation 3

. The energy is normalized to obtain alignment vector

as in Equation 4; then, we produce the context vector from the alignment vector as in Equation 5.


5.2 WaveGlow vocoder

In the paper, we used WaveGlow [8] as for neural vocoding. WaveGlow is a deep generative model for waveform generation that incorporates Glow, a generative model for image processing, [32] with WaveNet [33]. During training, a speech waveform y

is converted to a Gaussian white noise

z. Conversely, a Gaussian white noise is converted to a speech waveform by the inverse operation during inference process. By introducing the invertible 1

1 convolution and affine coupling layers, the loss function of the WaveGlow vocoder is calculated as in Equation 



The denotes network parameters; f is conditional acoustic features. The , , and are output coefficients of th WaveNet in the affine coupling layers, the th weighting matrix of the invertible 1

1 convolution layer, and the assumed variance of the Gaussian distribution, respectively.

Figure 2: Multiple dimensional scaling results. Closer to NAT is better

6 Experiment

In the section, we trained a end-to-end TTS system on provided original data (or “big training dataset”) as our Baseline system. We then evaluated the efficacy of our data processing scheme by comparing systems trained on the processed data to the baseline system. The US denotes that the original data was processed by utterance selection alone. The Punc denotes that only prosodic punctuation was used on original data. The Punc&US denotes that both utterance selection and prosodic punctuation insertion were used. In total, we trained four end-to-end TTS systems. We leave-out 32 sentences for testing. With each case of data processing, we used 90% or remaining sentences for training and the other 10% validation. Only training and validation text has prosodic punctuations but not testing text. Our training configurations were the same as in [7, 8]. We submitted the Punc&US system to the VLSP 2019’s evaluation. The Nat denotes target natural speech.

A \ B Punc US Punc&US NAT
Baseline 0.81* 0.31* 1.03* 1.16*
Punc 0.72* 0.28* 0.75*
US 0.56* 0.97*
Punc&US 0.53*
Table 1: Comparative MOS results. Positive values indicate A is better than B. Results marked with an asterisk are significantly different ( < 0.05) as compared to (representing no preference) in a -sample -test.

6.1 VLSP 2019’s Evaluation

A MOS test was conducted to compare our submitted system to other participants in the evaluation. There are more than 30 participating groups from academy and industry in the evaluation. There are 20 test sentences issued by the organizer of VLSP. We did not insert prosodic punctuations into test sentence because we do not have audio files to detect the punctuation. There are 24 testers. At each trial, a listener was asked to rate the quality of a utterance in a 5-point scale: “excellent” (5), “good” (4), “fair” (3), “poor” (2), “bad” (1). Our system achieved a MOS result of 4.1 (compared to 4.3 of target NAT) [9]. The result suggested that using our data processing method is efficient in optimize the naturalness of end-to-end TTS system. Moreover, the MOS result is the best result among the 30 participating groups in the evaluation.

6.2 Comparative Evaluation

We conducted a comparative MOS (CMOS) test to explore the contribution of the proposed method to the naturalness of end-to-end TTS systems. At each trial, participants listen to samples A and B in sequence and were then asked: “Is A more natural than B?” Responses were selected from a 5-point scale that consisted of “definitely better” (2), “better” (1), “same better” (0), “worse” (1), “definitely worse” (1). The test involved 32 sentences, and 10 system pairs; resulting in 32 10 = 320 unique trials. We limited each listener to hear each unique sentence once (presentation order was randomized); therefore we need 320 32 = 10 listener to cover all trials. We recruited 20 participants who are native Vietnamese speakers. Table 1 shows the pair-wise relative quality of the systems.

To approximate the ordering between all systems, we projected the non-negative pair-wise relative quality matrix to a single dimension using multiple dimensional scaling (MDS). Figure 2 shows the results. All data processing methods can improve the quality of TTS system. The results suggested that using our proposed method is efficient in optimizing the naturalness of end-to-end TTS system. By using our data processing method (Punc&US), which includes both utterance selection and punctuation insertion, we achieved close quality to natural speech (NAT). Interestingly, the prosodic punctuation insertion (Punc) is more efficient than utterance selection (US) in improving the quality of TTS system.

7 Conclusion

In this paper, we proposed a data processing technique including utterance selection and prosodic punctuation insertion. We showed that using the data processing method can improve the quality of end-to-end TTS system trained on found data. In a VLSP 2019’s evaluation, our system trained on processed data achieved a MOS result of 4.1; which the best MOS result among participants in the evaluation. Our CMOS test showed that the punctuation insertion contributed more to the result than the utterance selection. All of the processing methods can improve the quality of end-to-end TTS system trained on found data. In future works, we will predict the prosodic punctuation for test sentence from text. We distributed the processed data of one speaker at


  • [1] L. C. Mai and N. D. Dung, “Design of vietnamese speech corpus and current status,” in Proceeding of ISCSLP, San Diego, Kent Rigde, Singapore, 2006.
  • [2] V.-B. Le, D.-D. Tran, E. Castelli, L. Besacier, and J.-F. Serignat, “Spoken and written language resources for vietnamese,” in Proceedings of LREC, Lisbon, Portugal, 2004.
  • [3] H.-T. Luong and H.-Q. Vu, “A non-expert kaldi recipe for vietnamese speech recognition system,” in Proceedings WLSI-3 & OIAF4HLT-2, Osaka, Japan, 2016.
  • [4] Q. T. Do and L. C. Mai, “Vais-1000: a vietnamese speech synthesis corpus,” in IEEE Dataport, 2017.
  • [5] E. Cooper, “Text-to-speech synthesis using found data for low-resource languages,” Ph.D. dissertation, Columbia University, 2019.
  • [6] “Vietnamese language and speech processing 2019,”
  • [7] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” CoRR, vol. abs/1712.05884, 2017. [Online]. Available:
  • [8] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 3617–3621.
  • [9] P. V. Lam, P. H. Kinh, D. A. Tuan, T. K. Duy, and N. Q. Bao, “Development of zalo vietnamese text-to-speech for vlsp 2019,”, accessed: Oct, 2019.
  • [10] D. A. Tuan, P. T. Lam, and P. D. Hung, “A study of text normalization in vietnamese for text-to-speech system,” in Proceedings of Oriental COCOSDA Conference, Macau, China, 2012.
  • [11] A. T. Dinh, T. S. Phan, T. T. Vu, and C. M. Luong, “Vietnamese hmm-based speech synthesis with prosody information,” in Eighth ISCA Workshop on Speech Synthesis, Barcelona, Spain, 2013.
  • [12] T. T. T. Nguyen, “Hmm-based vietnamese text-to-speech : Prosodic phrasing modeling, corpus design system design, and evaluation,” Ph.D. dissertation, Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur (UPR 3251), 2015.
  • [13] T. S. Phan, A. T. Dinh, T. T. Vu, and C. M. Luong, “An improvement of prosodic characteristics in vietnamese text to speech system,” in Knowledge and Systems Engineering, V. N. Huynh, T. Denoeux, D. H. Tran, A. C. Le, and S. B. Pham, Eds.   Cham: Springer International Publishing, 2014, pp. 99–111.
  • [14] D. K. Ninh, “A speaker-adaptive hmm-based vietnamese text-to-speech system,” in 2019 11th International Conference on Knowledge and Systems Engineering (KSE), Oct 2019, pp. 1–5.
  • [15] T. V. Nguyen, B. Q. Nguyen, K. H. Phan, and H. V. Do, “Development of vietnamese speech synthesis system using deep neural networks,” in Journal of Computer Science and Cybernetics, vol. 34, no. 4, 2019, pp. 349–363.
  • [16] T. Toda, A. W. Black, and K. Tokuda, “Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter,” in ICASSP, 2005.
  • [17] A.-T. Dinh, T.-S. Phan, and M. Akagi, “Quality improvement of vietnamese hmm-based speech synthesis system based on decomposition of naturalness and intelligibility using non-negative matrix factorization,” in International Conference on Advances in Information and Communication Technology, 2016, pp. 490–499.
  • [18] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou, “Close to human quality TTS with transformer,” CoRR, vol. abs/1809.08895, 2018. [Online]. Available:
  • [19] F. . Kuo, S. Aryal, G. Degottex, S. Kang, P. Lanchantin, and I. Ouyang, “Data selection for improving naturalness of tts voices trained on small found corpuses,” in 2018 IEEE Spoken Language Technology Workshop (SLT), Dec 2018, pp. 319–324.
  • [20] D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero, “A minimum-mean-square-error noise reduction algorithm on mel-frequency cepstra for robust speech recognition,” in ICASSP, 2008, pp. 4041–4044.
  • [21] J. Robert-Ribes and R. Mukhtar, “Automatic generation of hyperlinks between audio and transcript,” in Eurospeech, 1997.
  • [22] P. J. Moreno, C. Joerg, J.-M. V. Thong, and O. Glickman, “A recursive algorithm for the forced alignment of very long audio segments,” in International Conference on Spoken Language Processing, 8, Ed., 1998.
  • [23] L. Lamel, J. Gauvain, and G. Adda, “Lightly supervised and unsupervised acoustic model training,” Computer Speech and Language, vol. 16, pp. 115–129, 2002.
  • [24] A. Venkataraman, A. Stolcke, W. Wang, D. Vergyri, V. Gadde, and J. Zheng, “n efficient repair procedure for quick transcriptions,” in ICSLP, 2004.
  • [25] H. Chan and P. Woodland, “Improving broadcast news transcription by lightly supervised discriminative training,” in ICASSP, 2004, pp. 737–740.
  • [26] O. Boeffard, L. Charonnat, S. L. Maguer, D. Lolive, and G. Vidal, “Towards fully automatic annotation of audiobooks for tts,” in International Conference on Language Resources and Evaluation, 2012.
  • [27] P. Lanchantin, P. Karanasou, M. J. F. Gales, X. Liu, L. Wang, Y. Qian, and C. Zhang, “The development of the cambridge university alignment systems for the multi-genre broadcast challenge,” in ASRU, 2015.
  • [28] J. Taekim and M. Hahn, “Voice activity detection using an adaptive context attention model,” EEE Signal Processing Letters, vol. 25, no. 8, pp. 1181–1185, 2018.
  • [29] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [30] J. Hirschberg, D. Litman, and M. Swerts, “Prosodic and other cues to speech recognition failures,” Speech Communication, vol. 43, pp. 155–175, 2004.
  • [31] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: A fully end-to-end text-to-speech synthesis model,” CoRR, vol. abs/1703.10135, 2017. [Online]. Available:
  • [32] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” NeurIPS, pp. 10 215–10 224, 2018.
  • [33] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016. [Online]. Available: