In recent years, advanced neural text-to-speech (TTS) models have been widely researched [13, 15, 16, 11, 9]. These are highly dependent on a large speech database, which is not easy to obtain in practice. Recently, as the popularity of personal broadcasting on social networks increases, the opportunity to obtain data on the internet is increasing. However, the amount of clean speech in broadcasted data is typically limited and most part is often noisy due to background music (BGM). Consequently, the importance of constructing a TTS system utilizing noisy speech data is increasing.
One approach to solve this issue is to preprocess speech data before using it for training the TTS model.  and  processed noisy speech data using speech enhancement techniques for noise-robust TTS systems. Speech enhancement research has mainly focused on removing other types of noise rather than BGM, but the primary removal target for broadcasted data is BGM. In our work, we define a BGM signal as a music noise. Methods to remove music noise, including waveform-based or spectral masking-based methods, have been proposed [14, 3]. However, there is an issue in directly using pre-processed speech and clean speech data for TTS training because the sound quality is slightly different.
Another approach uses a latent variable that embeds the quality of speech. The global style token (GST) defines a finite number of tokens and outputs a style embedding vector representing the style of given the reference speech in an unsupervised manner . In , it was reported that GST-TTS can be trained utilizing clean, and noisy speech data. GST learns to represent the speech quality condition (clean or noisy) as the reference speech style. During the inference step, the GST-TTS system generates a synthesized speech of clear sound by selecting a clean
token. However, we found experimentally that GST-TTS does not generate synthesized speech successfully when the total amount of speech data is insufficient. Probably because the limited number of tokens makes it difficult to express various types of music and there is no guarantee that one of the tokens can representclean speech when the amount of clean speech data is typically limited.
To solve this problem, we propose the GST-TTS learning method to utilize a personal broadcast data. First, the music noise is removed by introducing a music filter inspired by . Next, GST-TTS with an auxiliary quality classifier (AQC) is trained using clean speech and filtered speech (music-removed) from the broadcasted data. GST-TTS can prevent the degradation of synthesized speech caused by using the two types of data together. When the amount of training data is insufficient, the AQC helps the style embedding vector to focus on representing the sound quality of input reference speech rather than the elements of speech such as prosody or speed (we call this the quality embedding vector in the remainder of this paper). In the inference step, the GST-TTS generates a synthesized speech of clear sound by selecting a clean speech sample as the reference speech.
2 Related Works
A similar method, that trains multi-speaker TTS model using crowd-sourced data, was proposed in . Since same speaker’s speech is usually recorded in identical environment, they decorrelated speaker identity and noise by using noise-augmented speech data and auxiliary classifiers. In our proposed method, a music noise is removed first and then an auxiliary quality classifier is adopted for the pre-processed speech and clean speech in TTS training step.
In , VoiceFilter was proposed to separate target speaker’s speech from multi-speakers’ speech. We introduce a music filter inspired by this. The problem is similar in that target speaker’s voice is separated from noisy speech. The difference is that in personal broadcast data, there is no need for networks or a feature vector to identify the target speaker because there is no inference audio from other speakers but only music noise.
3 Proposed Method
Fig. 1 depicts the proposed learning method. The trained music filter converts the noisy speech (i.e., speech mixed with BGM) data into filtered speech data, after which the GST-TTS with the AQC is trained using the filtered and clean speech data together.
3.1 Pre-Processing Using the Music Filter
The left side of Fig. 1
shows the music filtering process. In the music-filter training phase, the noisy speech dataset is artificially generated by mixing clean speech with randomly selected BGM in a pre-defined signal-to-noise ratio (SNR) range that can be changed according to the target situation. The magnitude spectrogram of the noisy speech is input into the music-filter network and the spectral mask is predicted. The mask is multiplied element-wise to the magnitude spectrogram of the noisy speech to filter out the music noise. Then, whole music-filter network is repeatedly updated to reduce the mean squared error between the magnitude spectrograms of the music-filtered-out speech and the original clean speech. In the inference phase, the target speaker’s noisy speech dataset for TTS training is pre-processed using the trained music-filter, i.e., the music noise is filtered out.
3.2 GST-TTS with the AQC
The right side of Fig. 1. depicts the GST-TTS framework with the AQC. We used deep convolutional TTS (DCTTS) stably trained even under a quite small amount of speech data . For the GST layer, we used the same architecture as in .
In the TTS training phase, the text sequence and corresponding speech having a mel-spectrogram format are input to the TTS model (we refer to a temporally downsampled mel-spectrogram in DCTTS as a mel-spectrogram for general explanation). The text encoder and audio encoder output the text and audio embedding, respectively, after which an attention matrix is calculated between them. The mel-spectrogram of input speech is also used as the reference speech to represent the speech quality (clean or filtered). The reference encoder outputs a reference embedding and the GST layer calculates the weights between it and the multiple tokens via a multi-head attention module (we refer to the convex combination of tokens as quality embedding). The quality embedding is concatenated to the context embedding (the product of the attention matrix and text embedding) and is also used as input for the AQC. Finally, the audio decoder estimates the mel-spectrogram by taking the context, quality, and audio embeddings.
To train GST-TTS, the summation of L1 loss () and binary divergence loss () between the estimated and input mel-spectrograms are used :
followed by a softmax layer to predict the speech quality. It is trained using binary cross-entropy loss to determine whether the reference speech is clean or filtered. The final loss function of the proposed learning method can be expressed as
where and are the loss function and loss weight of the AQC, respectively.
In the TTS inference phase, the GST-TTS generates the mel-spectrogram by inputting text and clean speech as the reference speech. Then, the spectrogram super-resolution network (SSRN) of DCTTS predicts a spectrogram from the generated mel-spectrogram and the Griffin-Lim algorithm estimates phase information. Finally, the time-domain signal is converted from the magnitude and phase of the spectrogram.
4 Experiments with the Music Filter
4.1 Training Setup of the Music Filter
We used the KsponSpeech  dataset comprising approximately 1,000 h of spontaneous speech samples recorded with 2,000 speakers to train the speaker-independent music filter. In addition, we collected 68 license-free music samples often used for personal broadcasting from YouTube Studio 
. The music samples were mixed with clean speech with an SNR in the range 0–20 dB. Spectrograms were calculated using a short-time Fourier transform with a window length of 64 ms and a frame interval of 16 ms.
We followed the architectures and hyperparameters of the music filter in
except that batch normalization was applied to all CNN layers and a ReLU activation function was applied to all layers except for the last one. The networks were trained using the ADAM optimizer  with parameters , , and .
4.2 Performance Evaluation of the Music Filter
The test set for the music filter comprised 550 speech samples from 11 unseen speakers in a quiet office using mobile devices.) For the filtered speech dataset, we measured the perceptual evaluation of speech quality (PESQ)  and syllable error rate (SER) for pronunciation accuracy. We used a speech recognizer that had 3.48% SER for the clean speech dataset.
Table 1 indicates that the filtered speech had a higher PESQ score than the noisy speech at each SNR. Indeed, the filtered speech had a PESQ score of over 3 points at low SNR, which means that the music noise was removed sufficiently well. The results in Table 2 of the SER for the clean and filtered speech show that the filtered speech had lower SERs than noisy speech for SNR of 0–5 dB but not for 10–20 dB. At high SNR, distortion caused by the music filter had a more effect on the speech recognizer performance than the relatively low sound noise.
Fig. 2 depicts the mel-spectrogram of noisy, filtered, and clean speech samples. Comparing the areas in the red box, it can be observed that the filtered speech was slightly blurred compared to the clean speech, but music noise was clearly removed when compared to the noisy speech.
5 Performance Evaluation
Ablation tests were conducted to validate the effect of the proposed method. We compared the following five models:
TTS: the TTS model (DCTTS)
GST: the TTS model with quality embedding
GST+Aux.: GST with the AQC
GST+MF: GST with the music filter
GST+MF+Aux.: GST+MF with the AQC
GST and GST+Aux. were trained using the clean and music-mixed speech dataset while GST+MF and GST+MF+Aux. were trained using the clean and music-filtered speech dataset.
5.1 Training Setup
We trained the models for various ratios of clean and noisy/filtered speech data. We used approximately 5 h of speech data recorded from a single Korean female speaker at 16 kHz sampling frequency and artificially generated noisy and filtered speech datasets using the clean speech data as described in the previous section. After excluding 1% of the 5 h speech dataset as the test set, we configured the training set so that the amount of clean speech was , , and h and the remainder was noisy or filtered speech. Text input is a character sequence and input speech and reference speech are 80-bin mel-spectrograms computed with a fast Fourier transform size of 1024, a hop size of 256, and a window size of 1024.
For the TTS network, we followed the same network architecture and hyperparameters as in . except that we trained the SSRN as a universal model using a multi-speaker speech corpus recorded by 66 speakers (1 h for each speaker) because the amount of clean speech was insufficient (e.g., 0.5 h) to train the SSRN. For the GST network, we used the same architecture (except for the TTS model) and hyperparameters as that in . The number of style tokens and heads of multi-head attention were set to 10 and 4, respectively, as was carried out in . The loss weight of the AQC, , was , , and for , , and h amounts of clean speech, respectively. We set a lower loss weight value for h of clean speech so that the AQC learned slowly. For , the AQC converged quickly because the data imbalance problem was worse than with the other two cases. And the quality embedding did not affect the synthesized sample.
5.2 Visualization of the Quality Embedding
We visualized the quality embedding using principal component analysis (PCA). The quality embedding was extracted from models trained with 0.5 h clean speech. As shown in Fig 3, only the GST+MF+Aux. model clustered the clean and filtered speech clearly. It means that the AQC is crucial for ensuring that the quality embedding represents the quality of reference speech by separating the clean and filtered speech.
5.3 Pronunciation Accuracy
We measured the SER to confirm the intelligibility of the synthesized speech. Unfortunately, GST and GST+Aux. using clean and noisy speech were not trained successfully because the training loss diverged. Accordingly, TTS, GST+MF, and GST+MF+Aux. were evaluated. As the baseline, SER values for TTS trained with 5 h clean speech and filtered speech were % and %, respectively. The other results are listed in Table 3. Both of the other models showed a tendency for SER to decrease as the amount of clean speech data increased. The results for GST+MF were extremely poor in all cases, while even when trained with h of clean speech, the SER for GST+MF+Aux. was only % higher than for TTS trained with h of clean speech.
|Clean / Filtered (h)|
5.4 Subjective Evaluation
We conducted mean opinion score (MOS) tests on speech quality and naturalness. Note that the models that had a higher SER than % were not used for the MOS test because the synthesized speech from these models was either very noisy or difficult to understand. Sixteen native Korean speakers participated and were asked to give scores from (bad) to (excellent). Fifty synthesized speech samples (10 samples for each model) were randomly played.
Table 4 reports the MOS test results. The larger the amount of clean speech data, the higher the MOS scores for both quality and naturalness. Especially, the proposed GST+MF+Aux. model trained with h of clean speech achieved the highest scores of and , respectively, which were close to those of TTS trained with h of clean speech. Moreover, GST+MF+Aux. trained with h of clean speech outperformed GST+MF trained with even h of clean speech.
Audio samples can be found online111https://nc-ai.github.io/speech/publications/tts-with-bgm-data/.
|GST + MF||/ 2.5|
|GST + MF + Aux. (Proposed)||/ 4.5|
MOS test results for speech quality and naturalness with 95% confidence intervals. ‘C’ and ‘F’ denote the amount of clean and filtered speech, respectively.
We proposed a learning method of TTS that can generate clean synthesized speech under the limitation of personal broadcast data. To successfully train the TTS model, the music noise is removed by introducing a music filter and GST-TTS with the AQC is trained using the filtered speech and a small amount of clean speech. The AQC makes the quality embedding be effectively learned to represent the speech quality. The model learned by the proposed method generated natural and intelligible speech using a small amount of clean speech data that was almost comparable to baseline TTS trained using much more speech data.
-  (2019) Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN. In Annual Conference of the International Speech Communication Association (Interspeech), pp. 1821–1825. Cited by: §1.
KsponSpeech: Korean Spontaneous Speech Corpus for Automatic Speech Recognition. In Applied Sciences, Vol. 10. Cited by: §4.1.
Monoaural audio source separation using deep convolutional neural networks. In 13th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), pp. . Cited by: §1.
-  (1984) Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32, pp. 236––243. Cited by: §3.2.
-  (1933) Analysis of a complex of statistical variables into principal components. In Journal of educational psychology, Vol. , pp. 417. External Links: Cited by: §5.2.
-  (2019) Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5901–5905. Cited by: §2.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
The 32nd International Conference on Machine Learning (ICML), pp. 448–456. Cited by: §4.1.
-  (2015) Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), pp. . Cited by: §4.1.
Neural speech synthesis with transformer network. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6706–6713. Cited by: §1.
-  (2010) Rectified linear units improve restricted boltzmann machines. In The 27th International Conference on Machine Learning (ICML), pp. 807–814. Cited by: §3.2.
-  (2018) Deep voice 3: scaling text-to-speech with convolutional sequence learning. In International Conference on Learning Representations (ICLR), pp. . Cited by: §1.
-  (2001) Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. In Rec. ITU-T. 862, Vol. , pp. . External Links: Cited by: §4.2.
-  (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §1.
Wave-U-Net: a multi-scale neural network for end-to-end audio source separation. In 19th International Society for Music Information Retrieval Conference (ISMIR), pp. 334–340. Cited by: §1.
-  (2018) Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4788. External Links: Cited by: §1, §3.2, §3.2, §5.1.
-  (2018) VoiceLoop: voice fitting and synthesis via a phonological loop. In International Conference on Learning Representations (ICLR), pp. . Cited by: §1.
Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks. In Annual Conference of the International Speech Communication Association (Interspeech), Vol. , pp. 342–356. Cited by: §1.
-  (2019) VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. In Annual Conference of the International Speech Communication Association (Interspeech), pp. 2728–2732. Cited by: §1, §2, §4.1.
-  (2018) Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proc. of the 35th International Conference on Machine Learning, Vol. 80, pp. 5180–5189. Cited by: §1, §3.2, §5.1.
-  () YouTube Studio. Note: https://studio.youtube.com Cited by: §4.1.