Adversarial Feature Learning and Unsupervised Clustering based Speech Synthesis for Found Data with Acoustic and Textual Noise

04/28/2020 ∙ by Shan Yang, et al. ∙ 0

Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance. But a studio-quality corpus with manual transcription is necessary to train such seq2seq systems. In this paper, we propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data, where training speech contains noisy interferences (acoustic noise) and texts are imperfect speech recognition transcripts (textual noise). To deal with text-side noise, we propose a VQVAE based heuristic method to compensate erroneous linguistic feature with phonetic information learned directly from speech. As for the speech-side noise, we propose to learn a noise-independent feature in the auto-regressive decoder through adversarial training and data augmentation, which does not need an extra speech enhancement model. Experiments show the effectiveness of the proposed approach in dealing with text-side and speech-side noise. Surpassing the denoising approach based on a state-of-the-art speech enhancement model, our system built on noisy found data can synthesize clean and high-quality speech with MOS close to the system built on the clean counterpart.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, text-to-speech (TTS) has been significantly advanced with the wide use of deep neural networks (DNN). With the success of attention-based sequence-to-sequence (seq2seq) approach in machine translation 

[1, 14], DNN based speech synthesis has evolved into an end-to-end (E2E) framework, which unifies acoustic and duration modeling in a compact seq2seq paradigm, discarding frame-wise linguistic-acoustic mapping [19, 12, 15, 10, 20]. To achieve the best performance from a seq2seq system, studio-quality speech recordings with manual transcripts are necessary. Leveraging huge amount of speech resources available in public domain, or so-called found data, has drawn much interests lately. However, it is challenging to build a TTS system on low-quality found data as 1) speech may be contaminated by channel and environmental noises – acoustic noise and 2) transcripts generated by an automatic speech recognizer (ASR) contain inevitable errors – textual noise.

There are several recent studies addressing acoustic noise for seq2seq-based E2E TTS. A straightforward idea is to use de-noised audio from speech enhancement to build the acoustic model [16]. But the inevitable distortion on training speech will propagate to the synthesized speech, resulting in clear quality deterioration. This conclusion has been further confirmed by another unsupervised source separation approach [8], where multi-node variational auto-encoder (VAE) was introduced to remove background music from the found speech for speech synthesis. The unstable separation directly affects the TTS quality. Another solution is to disentangle the speaker and noise attributes directly in the speech synthesis model. The approach in [9] first encodes the reference audio to disentangle speaker and noise, where adversarial factorization is used to encourage such disentanglement, and then inject the encodings into an acoustic decoder to produce clean speech. This approach can generate clean speech with the help of a clean reference audio, but there is a strong assumption to conduct domain adversarial training– the audio of one speaker has a fixed type of acoustic noise, which greatly limits its application. In more practical situation, recording conditions may vary and the collected data for the target speaker may come from different sources with different noise interferences.

We only find one recent study investigating the textual noise. In [7], the robustness of E2E systems to textual noise has been studied by manually corrupting text and using erronous ASR transcripts. Results suggest that E2E systems only partially robust to training on imperfectly-transcribed data, and substitutions and deletions pose a serious problem. To the best our knowledge, there is still no solution to deal with textual noise in E2E TTS. Moreover, in many circumstances, building TTS systems on noisy found data has to deal with both textual and acoustic noise simultaneously – noisy speech transcribed by an ASR system with text errors.

This paper addresses both acoustic and textual noise interferences for building seq2seq-based speech synthesis system on noisy found data. To deal with textual noise, we propose a heuristic method to compensate the erroneous linguistic feature with phonetic information learned directly from speech. Specifically, VQVAE-based unsupervised clustering on the training speech is adopted to obtain latent phonetic representation, which is combined with the context vector from the text encoder output to produce synthesized speech. As for the acoustic noise, we propose to learn a noise-independent feature in the auto-regressive decoder through adversarial training and data augmentation, which does not need an extra speech enhancement model or strong assumption for noise conditions in 

[9]. Specifically, with the help of the clean data from another speaker, we adopt a domain classification network with a gradient reversal layer in the auto-regressive decoder to disentangle the noise conditions in latent feature space. Experiments show the effectiveness of the proposed approaches in dealing with textual and acoustic noise. Surpassing the denoising approach based on a state-of-the-art speech enhancement model, our system built on noisy data (speech SNR = 4dB, text CER=23.3%) can synthesize clean and high-quality speech with MOS close to the system built on the clean counterpart.

Ii Proposed methods for found data

Fig. 1 illustrates our seq2seq-based speech synthesis framework, which shares the similar architecture with Tacotron [19, 12]. It is composed of a CBHG-based text encoder [19] and an auto-regressive decoder for mel-spectrogram generation, while attention mechanism serves as the bridge. The WaveGlow [11] vocoder is used to reconstruct the waveforms from Mel-spectrogram. Our approaches dealing with text and speech noise are built on this baseline system. Below we briefly describe the seq2seq-based speech synthesis.

Fig. 1: Proposed methods for found data

For seq2seq TTS framework, suppose each speech utterance with frames of acoustic features has corresponding frames of golden character- or phoneme-level transcript

. The goal is to maximize the log probability

. And in the basic attention-based seq2seq framework [19], the decoder output is computed from


where is the ground-truth acoustic frame at time , and is the context vector computed from the attention function , which includes a content- or non-content based score function to measure the contribution of each memory  [1, 4, 2]. The objective function is to minimize the distance between predicted and target .

Ii-a Unsupervised clustering for textual noise

In the found data scenario, the golden transcript is unavailable for model training. Thus we need an extra speech recognizer conducting auto-transcription to get . Compared to the reference , may have irregular insertion, deletion or substitution errors. So the key problem is how to model the relations between unmatched speech and text . As described in Eq (1), given a previous speech frame , the attention mechanism computes the contribution of each in hidden space to generate . Due to the recognition error of ASR and the monotonic nature of speech generation, the speech may focus on the unrelated rather than the correct , which mostly causes the mispronunciation problem according to our experiments.

It’s almost impossible to directly handle the text noise in the speech synthesis task as the supervision labels for text and speech are totally unavailable. Our approach dealing with text noise is motivated by recent works on unsupervised speech unit discovery, which has shown that phoneme-like clusterings can be automatically learned from speech in an unsupervised manner [17, 6]. Specifically, similar latent features from waveforms tend to be categorized into different clusterings that act as a high-level speech descriptors closely related to phonemes [17]. To deal with text noise in found-data TTS, we propose a heuristic approach to conduct unsupervised clustering in the auto-regressive decoder to guide the speech generation with learnable latent phoneme representation, as shown in the unsupervised clustering module in Fig. 1.

In details, the context vector in the basic seq2seq framework is only computed from the output of text encoder, which inevitably contains text noise due to the inaccurate speech recognizer. In the proposed method, we compensate such errors with phonetic representation learned directly from speech. The context vector and the phonetic latent features are both injected to the decoder to produce synthesized speech, reducing the mismatch between speech and noisy text. There are several ways to learn the above discrete phonetic space [17, 5]

. In our work, we adopt vector quantized variational autoencoder (VQVAE) to obtain a learnable discrete clusterings space

, which we assume is related to phoneme-like units [17].

Along with the basic auto-regressive process, the latent representation of is also fed into the VQVAE encoder to obtain latent . we can obtain the discrete latent feature through


Here is treated as the latent phoneme representation clustered from speech and can be utilized to reconstruct back to speech through the VQVAE decoder.

Besides, the selected latent clustering is also fed into the auto-regressive decoder along with the context vector . Therefore Eq. (1) is updated to


The objective function of the whole network is:


where includes the reconstruction loss of both auto-regressive decoder and the VQVAE model, and is a stop-gradient operator which has zero partial derivatives at the operation. is the weight of the commitment loss to make sure the encoder commits to an embedding [17]. Since there is no real gradient defined for Eq. (2), we copy the gradient from VQVAE decoder input to the encoder output, as shown in the red line in Fig. 1.

Ii-B Adversarial feature learning for acoustic noise

Speech utterance in found data may contain different types of background noise, which directly affects the performance of attention function and the whole model. We can apply an external speech enhancement module to obtain de-noised speech feature from for downstream speech synthesis model training, but it may cause distortion problem in the generated speech [16, 8] .

In order to mitigate the negative effects from speech noise in , we propose to use adversarial training to obtain the noise-independent latent feature , where is the proposed adversarial module. As shown in Fig. 1

, the adversarial module contains a pre-net, a single unidirectional gated recurrent unit (GRU) network, and a classification network with a gradient reverse layer (GRL) 


. The classification task is designed to classify the speech sample into clean/noisy. Here, since we do no have clean samples, similar to the data augmentation strategy in 

[9], we use another clean speech dataset along with the noisy samples to train the classification network. For a common classification network, the logistics of the last latent layer often represent the classification information (noise/clean condition in this work). When conducting the gradient reverse operation, its aim becomes disentangling the noise information to obtain the noise-independent features [13], or encouraging not to be informative about the acoustic condition (noisy or clean). In  [9], GRL is also adopted to disentangle noise from a reference audio to control the condition of speech synthesis. But we learn the noise-independent features directly from the input speech . Therefore, in the speech synthesis stage, we do not need a clean reference audio to generate clean speech. With the GRL, the context vector in Eq. (1) becomes


Since there is an extra classification network, the final objective function is


where denotes the Mel-spectrogram reconstruction loss, is the cross entropy loss for noise classification, and is the weight for the classification loss.

Iii Experiments

In our experiments, we use an open-source Chinese corpus, which contains 10 hours speech of a female speaker. To obtain the target noisy dataset, we mix the clean speech with random types of noises from the CHiME-4 challenge 


. We use a speech recognition module to transcribe the noisy speech, where the character error rate (CER) depends on the signal-to-noise ratio (SNR). In order to do the adversarial training, we use another clean corpus with 11 hours speech from another Chinese female speaker as the

clean data. Another copy of this corpus is mixed with random noise from CHiME-4, together with the target noisy corpus above to form the noisy data. We use an internal speech recognition system to obtain transcripts. The CER is 8.9% for the clean target speaker. As for the speech enhancement baselines, we test an unsupervised model Separabl [8] and a state-of-the-art supervised model DCUnet [3]. The Perceptual Evaluation of Speech Quality (PESQ) for Separabl and DCUnet are 2.02 and 3.00.

We mainly analyze the phone, tone and prosody information through our text analysis module to obtain text representation. For the speech representation, 80-band mel-scale spectrogram is treated as for attention based decoder. For evaluation, we reserve 400 sentences from the target corpus to conduct objective and subjective testing. There are 20 listeners attending the mean opinion score (MOS) test as subjective evaluation111Samples can be found at Since the length of predicted acoustic features is different from the target one, we conduct dynamic time warping (DTW) to align the two sequences and then compute the mel-cepstral distortion (MCD) for objective evaluation.

Iii-a Model details

Iii-A1 Basic architecture

For the basic seq2seq system, we follow the architecture of Tacotron and Tacotron2 [19, 12]. In the encoder, we adopt three feed-forward layers as pre-net followed by a CBHG module [19]. As for the decoder, the acoustic feature is firstly fed into the decoder pre-net. And a unidirectional LSTM with GMM based attention mechanism [2] is adopted on the latent features. The basic architecture is applied to the baseline systems for noisy found data and the topline system for the original clean recordings.

Iii-A2 Unsupervised clustering

In the proposed unsupervised clustering for dealing with textual noise, the output of decoder pre-net is fed into the VQVAE encoder, which contains two layers feed-forward networks with 256 units followed by ReLu activation. The VQVAE decoder shares the similar architecture with the above encoder. For the vector quantization module, there are 256 code vectors with 128 dimensions in the code book. The weight of commit loss is set to 0.25.

Iii-A3 Adversarial feature learning

For the noise-independent feature learning to deal with the acoustic noise, we add a 256-unit unidirectional GRU layer on the top of pre-net in the decoder. The output of GRU layer is fed into the following decoder LSTM layer with controllable speaker and noise condition, as well as the classification network.

Iii-B Experimental results

Iii-B1 Basic systems

We first evaluate the effects of textual and acoustic noise in training data on the baseline systems. Table I shows the objective and subjective results, where CGER means the character-level generation error. There are 7338 Chinese characters in total in the 400 test sentences.

R - - - 4.41 - -
A 0 clean GMM 4.21 3.08 0.05
B 8.8 clean GMM 3.62 4.29 0.07
C 11.7 clean GMM 3.51 4.34 0.10
D 23.3 clean GMM 3.04 4.55 3.24
E 23.3 clean LSA 2.63 4.63 9.69
F 0 8 dB GMM 2.10 7.16 0.05
G 0 4 dB GMM 1.79 8.78 0.04
H 23.3 4 dB GMM 0.78 - -
TABLE I: The performance of basic architecture for different types of found data, where R means recordings.

System A is the topline trained using golden transcripts and clean speech. We use System B to E to examine the test noise only, while using clean speech data to train the model. Compared to the topline, System B to D get worse as the WER increases in both objective and subjective tests, which confirms the negative effects of textual noise. Comparing System A, B and C, although the text noise affects both objective and subjective results, we find the seq2seq based model has few pronunciation errors when CER. Since there is no punctuation in the noisy ASR transcription, the prosody of generated speech of system B and C is unsatisfactory, which causes the worse MOS values. This can be further solved through a more robust prosody model. For more noisy text, the generated speech of System D suffers from the mispronunciation problem. Since the context vector directly depends on the attention alignment, we also compare the content and non-content score function. In System E, we conduct content-based location sensitive attention (LSA) [4], where text memory is taken into account. System D with non-content-based GMM attention outperforms the LSA-based System E. This is because noisy text is used to obtain the alignments in LSA, which affects the attention accuracy.

As for the acoustic noise, we use the golden transcripts to build System F and G with noisy training speech at 8dB and 4dB SNR, respectively. It’s obvious that System F outperforms System G since the speech data used in F contains less noise. But the synthesized speech of both systems is noisy. System H represents the real found data condition without correct transcription and clean speech, which achieves the lowest MOS of 0.78. Note that we do not have MCD for system H, since it always crashes during generation.

Iii-B2 Unsupervised clustering

To overcome the textual noise, we then build a system with proposed unsupervised clustering to mitigate the mispronunciation problem caused by wrong transcriptions. Table II shows the performance, where VQVAE based unsupervised clustering is used in the topline System A and baseline System D. Table II shows that the proposed VQVAE_D significantly outperforms System D in both subjective and objective metrics. The proposed method decreases the character generation error rate from 3.24% to 1.29%. It’s because that each output also depends on the unsupervisedly discovered units from speech, which can mitigate the textual noise during training. Besides, comparing System A with VQVAE_A, we find that unsupervised clustering will not degrade the performance of the topline system with golden transcription.

Index CER (%) MOS MCD CGER (%)
A 0 4.21 3.08 0.05
VQVAE_A 0 4.25 3.10 0.06
D 23.3 3.04 4.55 3.24
VQVAE_D 23.3 3.47 4.42 1.29
TABLE II: The performance for individual textual noise.

Iii-B3 Adversarial feature clustering

In real applications, we may have to deal with both acoustic and textual noises. Here we use adversarial feature clustering to improve System H. The performances of different approaches are summarized in Table III. We can see that the two systems that use speech enhancement to remove noises before TTS model training can obviously improve TTS performance. In System Separabl, we do not need external data for speech enhancement model training, where the de-noised speech is directly obtained through the pre-trained unsupervised multi-node VAE model from noisy data [8]. For System DCUnet, we need extra multi-speaker speech data and noise data to train the speech enhancement model. We notice that the proposed adversarial feature learning method significantly outperforms the speech enhancement methods. System Separabl indeed decreases the noise interference in the generated speech, but the performance of such unsupervised speech enhancement is not stable, which causes obvious speech distortions in the synthesized speech. Although System DCUnet, which adopts supervised speech enhancement, shows better ability of de-noising, it also suffers a lot from mispronunciation errors. Besides, there are also some noticable distortions in the generated speech. Note that for the proposed adversarial feature clustering method, we need extra clean speech data from another speaker as augmentation, but speech enhancement model is not needed.

H 23.3 4 dB 0.78 - -
Separabl [8] 2.35 8.70 3.67%
DCUnet [3] 3.23 5.32 2.11%
Adv-sen 3.50 6.51 0.88%
Adv-frame 4.05 5.03 0.23%
TABLE III: The performance for textual and acoustic noise.

With the help of another clean speech synthesis dataset, we conduct adversarial training to obtain noise-independent features in both sentence- and frame-level, named Adv-sen and Adv-frame respectively. For System Adv-sen, classification is conducted on the mean and variance of latent features to obtain sentence-level representation. We find although it can produce clean speech with good prosody, generated speech is not stable on pronunciations. With frame-level adversarial feature learning, System Adv-frame achieves the best performance among all systems. We assume the result benefits from two aspects: 1) the auxiliary dataset decreases the CER in the whole training texts and guides the model how to generate clean speech, 2) the adversarial feature learning can disentangle the noise information from speech, hence the control vector can directly control the generation process for producing clean speech.

Iv Conclusions and Future Work

This paper proposes an unsupervised clustering method to handle textual noise in found data, and an adversarial feature learning method to generate clean synthesized speech with noisy training speech. Experiment shows that the proposed methods are effective to build high-quality and stable seq2seq based speech synthesis model for noisy found data. Future work will try more robust methods to handle textual noise and for multi-speaker found data.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §I, §II.
  • [2] E. Battenberg, R. Skerry-Ryan, S. Mariooryad, D. Stanton, D. Kao, M. Shannon, and T. Bagby (2019) Location-relative attention mechanisms for robust long-form speech synthesis. arXiv preprint arXiv:1910.10288. Cited by: §II, §III-A1.
  • [3] H. Choi, J. Kim, J. Huh, A. Kim, J. Ha, and K. Lee (2019) Phase-aware speech enhancement with deep complex u-net. arXiv preprint arXiv:1903.03107. Cited by: TABLE III, §III.
  • [4] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. In Proc. NPIS, pp. 577–585. Cited by: §II, §III-B1.
  • [5] N. Dilokthanakul, P. A. Mediano, M. Garnelo, M. C. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan (2016) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648. Cited by: §II-A.
  • [6] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea, X. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W. Black, et al. (2019) The zero resource speech challenge 2019: tts without t. arXiv preprint arXiv:1904.11469. Cited by: §II-A.
  • [7] J. Fong, P. O. Gallegos, Z. Hodari, and S. King (2019) Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data. In Proc. Interspeech, Cited by: §I.
  • [8] N. Gurunath, S. K. Rallabandi, and A. Black (2019) Disentangling speech and non-speech components for building robust acoustic models from found data. arXiv preprint arXiv:1909.11727. Cited by: §I, §II-B, §III-B3, TABLE III, §III.
  • [9] W. Hsu, Y. Zhang, R. J. Weiss, Y. Chung, Y. Wang, Y. Wu, and J. Glass (2019) Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5901–5905. Cited by: §I, §I, §II-B.
  • [10] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou (2019) Close to Human Quality TTS with Transformer. In Proc. AAAI, Cited by: §I.
  • [11] R. Prenger, R. Valle, and B. Catanzaro (2019) Waveglow: a flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. Cited by: §II.
  • [12] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, et al. (2017) Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. arXiv preprint arXiv:1712.05884. Cited by: §I, §II, §III-A1.
  • [13] S. Sun, B. Zhang, L. Xie, and Y. Zhang (2017) An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257, pp. 79–87. Cited by: §II-B.
  • [14] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Proc. NIPS, pp. 3104–3112. Cited by: §I.
  • [15] H. Tachibana, K. Uenoyama, and S. Aihara (2018) Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. In Proc. ICASSP, pp. 4784–4788. Cited by: §I.
  • [16] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi (2016) Investigating rnn-based speech enhancement methods for noise-robust text-to-speech.. In SSW, pp. 146–152. Cited by: §I, §II-B.
  • [17] A. van den Oord, O. Vinyals, et al. (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §II-A, §II-A, §II-A.
  • [18] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer (2017) An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech & Language 46, pp. 535–557. Cited by: §III.
  • [19] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, et al. (2017) Tacotron: Towards end-to-end speech synthesis. In Proc. INTERSPEECH, pp. 4006–4010. Cited by: §I, §II, §II, §III-A1.
  • [20] S. Yang, H. Lu, S. Kang, L. Xue, J. Xiao, D. Su, L. Xie, and D. Yu (2020) On the localness modeling for the self-attention based end-to-end speech synthesis.. Neural networks 125, pp. 121–130. Cited by: §I.