Improving Pseudo-label Training For End-to-end Speech Recognition Using Gradient Mask

by   Shaoshi Ling, et al.
ByteDance Inc.

In the recent trend of semi-supervised speech recognition, both self-supervised representation learning and pseudo-labeling have shown promising results. In this paper, we propose a novel approach to combine their ideas for end-to-end speech recognition model. Without any extra loss function, we utilize the Gradient Mask to optimize the model when training on pseudo-label. This method forces the speech recognition model to predict from the masked input to learn strong acoustic representation and make training robust to label noise. In our semi-supervised experiments, the method can improve the model performance when training on pseudo-label and our method achieved competitive results comparing with other semi-supervised approaches on the Librispeech 100 hours experiments.



There are no comments yet.


page 1

page 2

page 3

page 4


Self-Training for End-to-End Speech Recognition

We revisit self-training in the context of end-to-end speech recognition...

Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization

In this paper, we present a semi-supervised training technique using pse...

Self-Training for End-to-End Speech Translation

One of the main challenges for end-to-end speech translation is data sca...

Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition

In Uyghur speech, consonant and vowel reduction are often encountered, e...

Automatic Pronunciation Generation by Utilizing a Semi-supervised Deep Neural Networks

Phonemic or phonetic sub-word units are the most commonly used atomic el...

Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment

Current leading mispronunciation detection and diagnosis (MDD) systems a...

Conditional independence for pretext task selection in Self-supervised speech representation learning

Through solving pretext tasks, self-supervised learning (SSL) leverages ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pseudo-labeling [33, 13, 23, 15, 27, 14, 36, 25, 16]

is one of the most popular semi-supervised learning approaches and recently demonstrated its efficacy in automatic speech recognition. In this approach, a smaller labeled set is used to train an initial seed model, which is applied to a larger amount of unlabeled data to generate hypotheses. The unlabeled data with the most reliable hypotheses are added to the training data for re-training. This process can be repeated iteratively to improve the quality of pseudo labels

[36]. However, pseudo-label training is sensitive to the quality of the hypotheses. Errors or noise in labels can cause training unstable and resulted in sub-optimal states, especially for end-to-end speech recognition models [14]. Thus, pseudo-label training usually requires a careful calibration by the confidence measures [14, 25]. But confidence-based data filtering will not always work perfectly since most pseudo-label sequences would contain errors.

Starting from BERT [4], masked prediction has becoming a new principle to solve problems in self-supervised settings in NLP. The core idea of masked prediction is to force the model to learn good high-level representations of unmasked inputs to infer the targets of masked ones correctly. In speech, the approaches sharing the same spirit have been proposed: masked prediction of audio acoustic features [21, 20], masked prediction of quantized acoustic features [18] and masked prediction of unsupervised clusters [11]. Experiments in [11] also showed that computing loss only from the masked regions achieves better performance than all regions.

We draw inspiration from masked prediction and integrate its idea into pseudo-label training. We propose the Gradient Mask to improve pseudo-label training in end-to-end speech recognition. In our approach, we first train a seed model to generate pseudo labels and then use the Gradient Mask to train a student model on the pseudo labels. The model only allows gradients corresponding to masked input back-propagate through the model encoder by masking the gradients corresponding to unmask input. The model is trained by jointly minimizing the loss on labeled and pseudo-label data while the Gradient Mask is turned off on labeled data.

Our training method can force the model to learn strong acoustic representation in order to infer from masked input. Moreover, it can also improve pseudo-label training by making the model less affected by label noise. The intuition is that only gradients of the masked part are used when updating the model’s parameters, so it can avoid the sudden dramatic change in gradients caused by errors and also alleviate the overfit to corrupted labels. Our approach is simple and efficient since it doesn’t require any extra parameters, extra loss or data filtering steps. We run our experiments using the Transducer model [8]. The experiment showed that our method is robust to label noise and can achieve competitive results comparing with other self/semi-supervised approaches in the Librispeech 100 hours experiments.

2 Related work

2.1 Combating noisy labels

DNNs are known to be susceptible to noisy labels [6, 31]

and the errors in labels could be extremely harmful to models. Beyond conventional data filtering/cleaning techniques, deep learning techniques have recently gained vast interest. There are several works to investigate supervised learning under noisy labels


in computer vision. However, these models cannot be directly applied to ASR and fewer studies have proposed to combat noise labels for ASR. In

[10], the phonetic sequence was inferred from several noisy transcriptions made by non-native transcribers using a misperception model, and then used to train a conventional hybrid ASR model. [5] propose a novel loss function that jointly learns the ASR model and a transcription graph that can search for better transcriptions of the training data.

2.2 Joint training with self-supervised and ASR tasks

The idea of self-supervised learning [21, 20, 18, 11, 29, 3, 17, 1] is to learn speech representations that are useful for ASR. By first pre-training on a large amount of unlabeled data using a proxy task, the model can be fine-tuned on labeled data and achieved impressive results. This process is a two-stage process as it requires running separate pre-training and fine-tuning. Joint training with speech recognition and self-supervised representation [19, 32, 35, 34] is the line of work to simplify this process and is the closest to our method. Those methods typically have two training objectives: one is for the ASR task on the labeled data while the other is to train self-supervised representation (e.g. masked feature prediction [19]), on the unlabeled data. Our method is much simpler and uses only one loss on both label and unlabeled data (pseudo-label data).

3 Method

In speech recognition, an E2E model predicts the conditional distribution of token sequences given a speech-feature sequence as the input, where and

is acoustic feature vectors at time t. V is the set of all possible output tokens. We will explain and show our method in transducer model

[8], but it can be perfectly adapted to other end-to-end ASR models (e.g. CTC [7], seq2seq [2]) as well.

3.1 Transducer model

Transducer model [8] consists of encoder, prediction network and joint network. The encoder encode the inputs X to higher-level representation .

The prediction network takes embedding vectors of previous non-blank labels as input to produce its output at step

. Then the logits over vocabulary at frame

and step can be computed by the joint network:


The probability distribution over vocabulary at frame

and step

is calculated using a soft-max layer. With forward-backward algorithm, the sum probability

of all alignment paths is adopted as the objective function.

3.2 Gradient mask

For sequence which has pseudo labels . The objective is to enable model to predict the labels from the masked features. In another word, is trained to be a strong acoustic representation model which can benefit the ASR tasks.

Before feeding features to the encoder, we randomly generated a sequence representing the mask positions for the input sequence . Specifically, is 1 if features are masked at time , otherwise is 0. The features, for example , are masked by replacing it with a learnt mask embedding . Then the encoder encode this mask sequence as :


Our mask strategy is the same as [1], where we randomly sample without replacement a certain proportion of all time steps to be starting indices and then mask the subsequent consecutive time steps from every sampled index with overlap spans.

When the gradient is back-propagated to the encoder, we masked the gradients corresponding to the non-masked inputs using sequence:


And the prediction network takes the pseudo labels sequence as the input. The joint network then produces output as in (1). But when we do back-propagation, we also block the gradient flow into the predictor network. This process can be expressed in the following functions:


where , is the stop gradient operator. The objectice function is still the same transducer loss where we trying to minimize of all alignment paths.

Method Model Size Criterion LM G/TPU-days dev test
clean other clean other
NST [25] 360M S2S lstm 1600 3.9 8.8 4.2 8.6
w2v2-base[1] 95M CTC None 102.4 6.1 13.5 6.1 13.3
w2v2-large[1] 317M CTC None 294.4 4.6 9.3 4.7 9.0
IPL [36] 322M CTC None 192 5.5 9.3 6.0 10.3
slimIPL [16] 322M CTC None 83.2 3.7 7.3 3.8 7.5
NST-iter1 118M transducer None 14 5.3 12.7 5.4 12.9
GM-iter1 118M transducer None 14 4.8 11.1 4.9 11.2
GM-iter5 118M transducer None 54 4.1 8.8 4.3 8.8
Table 1: Semi-supervised LibriSpeech results using 100 hours as labeled data and 860 hours as unlabled data. Our experiments are in the lower part of the table.

3.3 Training procedure

The whole training process is similar to the standard pseudo-labeling approach. Let be a labeled dataset and be a large unlabeled dataset. We first train a seed acoustic model M on the labeled dataset . We use this seed acoustic model M to generate pseudo-labeled on dataset and we then combine it with all the label data in L to form new dataset = .

The next step is to train a student model using both the datasets and . The model is trained by alternately minimizing the losses on and . When updating the model parameters using a minibatch from the pseudo-labels dataset , we apply the gradient mask method as described in 3.2 on the model. While on a minibatch from the labeled dataset, we do parameters update in the standard way for transducer in 3.1. This process is repeated until convergence of the word error rate on the validation dataset. Since the loss function is the same for both datasets, we only use one momentum optimizer and the same learning rates for simplicity. The ratio of minibatch from to minibatch from is a hyper-paramtete to be tuned.

Method LM dev test
clean other clean other
Hybrid [22] 4-gram 5.0 19.5 5.8 18.6
LAS [25] lstm 5.3 16.5 5.5 16.9
CTC [16] None 6.2 16.8 6.2 16.8
Transducer None 6.3 16.8 6.4 16.7
Table 2: WER on the Librispeech 100 hours for supervised system

4 Experiments

4.1 Data

We conducted our experiments on the LibriSpeech [24] datasets. The labeled dataset is a 100 hours subset (train-clean-100) of Librispeech, and the remaining 860 hours (train-clean-360, train-other-500) is the unlabeled dataset. During training, samples in the dataset that are longer than 20 seconds are filtered out. The performance of the trained model is validated on the dev-clean and dev-other datasets of Librispeech and tested on the test-clean/other dataset. We did not use any extra text or LM information for any of our experiments.

We use around 5k subword [30]

units as our prediction targets. We extracted 80-channel filterbank features computed from a 25ms window with a stride of 10ms. When training on labeled data, we use speed perturbation and SpecAugment

[26, 9] with mask parameter (F = 27), and ten time masks with maximum time-mask ratio (pS = 0.05), where the maximum-size of the time mask is set to pS times the length of the utterance.

4.2 Setup

The filterbank features are first passed into 2 blocks of 2d-conv layers, time reduction layers are added after each block to down-sample the frame rate to 4 before passing into the encoder. The encoder model consists of 17 layers of conformer block, where we set the model dimension to 512, the inner dimension in feed forward layer to 2048, with 8 attention heads, 32 kernal size in convolution block, with the same setting as Conformer-L [9]

. We use LSTM as our predictor and the LSTM predictor contains 1 layer with 640 units and a projection layer with 640 units. The Transducer’s joint network is a simple feed-forward layer. The total number of parameters is about 130M. Our model is implemented in Pytorch and we optimized our model using Adam. We use this same model in all of our experiments.

For the 100 hours seed model, we first train the GMM-based model in Kaldi [28] to obtain the alignment results on the 100 hours subset, and we use the frame-wise phoneme label to pre-train the encoder. Then we use the pre-trained encoder to initialize our transducer model [12]. For training the transducer model, we use learning rate warm-up for the first 10k updates to a peak of 1e-4, and hold for 60k steps, then linearly decayed it. We grouped the input sequences by length with a batch size of 10k frames per GPU, and trained the models on 4 GPUs for 160k steps in total.

For training the student model, the mask is set to 0.065 and is set to 3 (equal to 12 frames or 0.12 second). This masking schema is similar to [1], and it resulted in around half of frames being masked. We set the ratio of minibatch from labeled data to pseudo-label data to 1:9. This ratio is the same as the ratio of amount of data and it produces an ASR model with the best performance. We used learning rate warm-up for the first 10k updates to a peak of 2e-4, and hold for 80k, then linearly decayed it. We grouped the input sequences by length with a batch size of 10k frames per GPU, and trained the models on 8 GPUs for 180k steps.

4.3 Results

4.3.1 Supervised baseline

Table 2 shows the results of our seed model and the comparison with the Librispeech 100 hours supervised model from other papers. We use this seed model to generate the first version of the pseudo-label. The resulted 860 hours pseudo-label have WER around 9.

4.3.2 Semi-supervised experiments

Table 1 shows the results from semi-supervised experiments. NST-iter1 is the results of the experiment where we simply mixed the pseudo-label data and labeled data to form the new training dataset, and we train the student model using this dataset. This process is a simplified version of noise student training since we did not do any filtering, LM fusion, or data selection [14, 25].

GM-Iter1 and GM-Iter5 is the model using gradient mask method. For GM-Iter1 in the table are the results from the student model directly trained from the pseudo labels generated by the seed model. Our proposed approach significantly outperforms the 100 hours supervised baselines in table 2 and also the noisy student training baseline. For GM-Iter5, we iterate the pseudo labeling process 5 times. In particular, the model of GM-Iter5 achieved highly competitive performance, with a WER of 4.1/8.8 for dev-clean/dev-other and 4.3/8.9 for test-clean/test-other. It is worth noting that our method is highly efficient. We use much fewer computing resources or a much smaller model size compared with other approaches in the table 1.

4.4 Ablation study and analysis

4.4.1 Pseudo-labeling iterations

To study the performance from different pseudo-labeling iterations. The table 3 shows the WER on test-clean/other of each training iterations using the gradient mask method. The results are in table 3. We stopped this process after the 5th iteration since the improvement is already minimum at the iter5.

interations test clean test other
100h seed 6.2 16.8
iter1 4.9 11.2
iter2 4.6 9.7
iter3 4.4 9.2
iter4 4.3 8.9
iter5 4.3 8.8
Table 3: Ablation on each pseudo-labeling iterations

4.4.2 Gradient mask on labels of different qualities

We conduct an ablation study to investigate the effect of the gradient mask on labels of different qualities. We run the experiments with and without the gradient mask method on those labels. The training is the same as the standard transducer training when we do not use the gradient mask. The training data includes 860 hours pseudo-label data and the 100h labeled data. Pseudo(WER-9) is the pseudo-label generated by the 100h seed model which has around WER 9. Pseudo(WER-15) is generated by the same supervised system but from an early epoch that has WER around 15. Pseudo(WER-5) is generated by the student model from the 3rd iteration. And Pseudo(WER-2) is generated by an intermediate model trained on 960 hours labeled data.

When pseudo-label contains a lot of errors (WER 15), simply adding pseudo label will cause the model performance to degrade comparing with 100h baseline in table 2. Even when we have high-quality pseudo-label (WER 5), the noise in labels still hurts the model performance. On the other hand, the gradient mask method can be robust to bad quality labels and work well consistently on the label of different quality. We found that the worse pseudo-label’s quality, the better performance we can obtain using the gradient mask method comparing with the standard training. The standard training will perform comparably to the gradient mask method when pseudo-label has WER around 2, and it would perform better when we use the ground truth reference labels.

data dev-other
gm w/o gm
Reference 7.5 6.7
Pseudo(WER-2) 7.8 7.6
Pseudo(WER-5) 8.8 9.4
Pseudo(WER-9) 11.1 12.7
Pseudo(WER-15) 14.2 18.9
Table 4: With and without gradient mask on different pseudo-label

5 conclusion

In this paper, we present the Gradient Mask method, a simple and efficient method to improved pseudo-label training for end-to-end speech recognition. Our method can force the model to learn acoustic representation and also be robust to errors in labels. This method can be used to combat label noise in pseudo-label training. In semi-supervised experiments, our method achieved much better performance than the conventional pseudo label training approach and performed comparably to the SOTA approach while being much computation-efficient. Future work includes exploring the extension to other end-to-end ASR systems like LAS and other sequence to sequence tasks like machine translation.


  • [1] A. Baevski, H. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477. Cited by: §2.2, §3.2, Table 1, §4.2.
  • [2] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016)

    Listen, attend and spell: a neural network for large vocabulary conversational speech recognition

    In ICASSP, pp. 4960–4964. Cited by: §3.
  • [3] Y. Chung, W. Hsu, H. Tang, and J. Glass (2019)

    An unsupervised autoregressive model for speech representation learning

    In Interspeech, pp. 146–150. Cited by: §2.2.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp. 4171–4186. Cited by: §1.
  • [5] A. Dufraux, E. Vincent, A. Hannun, A. Brun, and M. Douze (2019) Lead2Gold: towards exploiting the full potential of noisy transcriptions for speech recognition. In ASRU, pp. 78–85. Cited by: §2.1.
  • [6] B. Frénay and M. Verleysen (2013) Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems 25 (5), pp. 845–869. Cited by: §2.1.
  • [7] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    In ICML, pp. 369–376. External Links: Document Cited by: §3.
  • [8] A. Graves (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. Cited by: §1, §3.1, §3.
  • [9] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §4.1, §4.2.
  • [10] M. A. Hasegawa-Johnson, P. Jyothi, D. McCloy, M. Mirbagheri, G. M. Di Liberto, A. Das, B. Ekin, C. Liu, V. Manohar, H. Tang, et al. (2016) ASR for under-resourced languages from probabilistic transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (1), pp. 50–63. Cited by: §2.1.
  • [11] W. Hsu, B. Bolte, Y. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed (2021) HuBERT: self-supervised speech representation learning by masked prediction of hidden units. arXiv preprint arXiv:2106.07447. Cited by: §1, §2.2.
  • [12] H. Hu, R. Zhao, J. Li, L. Lu, and Y. Gong (2020) Exploring pre-training with alignments for rnn transducer based end-to-end speech recognition. In ICASSP, pp. 7079–7083. Cited by: §4.2.
  • [13] Y. Huang, Y. Wang, and Y. Gong (2016) Semi-supervised training in deep learning acoustic model. In Interspeech, pp. 3848–3852. Cited by: §1.
  • [14] J. Kahn, A. Lee, and A. Hannun (2020) Self-training for end-to-end speech recognition. In ICASSP, pp. 7084–7088. Cited by: §1, §4.3.2.
  • [15] S. Karita, S. Watanabe, T. Iwata, A. Ogawa, and M. Delcroix (2018) Semi-supervised end-to-end speech recognition.. In Interspeech, pp. 2–6. Cited by: §1.
  • [16] T. Likhomanenko, Q. Xu, J. Kahn, G. Synnaeve, and R. Collobert (2020) Slimipl: language-model-free iterative pseudo-labeling. arXiv preprint arXiv:2010.11524. Cited by: §1, Table 1, Table 2.
  • [17] S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff (2020) Deep contextualized acoustic representations for semi-supervised speech recognition. In ICASSP, pp. 6429–6433. Cited by: §2.2.
  • [18] S. Ling and Y. Liu (2020) Decoar 2.0: deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659. Cited by: §1, §2.2.
  • [19] S. Ling, J. Salazar, Y. Liu, K. Kirchhoff, and A. Amazon (2020) BERTphone: phonetically-aware encoder representations for utterance-level speaker and language recognition. In Proc. Odyssey, pp. 9–16. Cited by: §2.2.
  • [20] A. T. Liu, S. Li, and H. Lee (2020) TERA: Self-supervised Learning of Transformer Encoder Representation for Speech. arXiv preprint arXiv:2007.06028. Cited by: §1, §2.2.
  • [21] A. T. Liu, S. Yang, P. Chi, P. Hsu, and H. Lee (2020) Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP, pp. 6419–6423. Cited by: §1, §2.2.
  • [22] C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter, and H. Ney (2019) RWTH asr systems for librispeech: hybrid vs attention–w/o data augmentation. arXiv preprint arXiv:1905.03072. Cited by: Table 2.
  • [23] V. Manohar, H. Hadian, D. Povey, and S. Khudanpur (2018) Semi-supervised training of acoustic models using lattice-free mmi. In ICASSP, pp. 4844–4848. Cited by: §1.
  • [24] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In ICASSP, pp. 5206–5210. Cited by: §4.1.
  • [25] D. S. Park, Y. Zhang, Y. Jia, W. Han, C. Chiu, B. Li, Y. Wu, and Q. V. Le (2020) Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629. Cited by: §1, Table 1, Table 2, §4.3.2.
  • [26] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) SpecAugment: a simple data augmentation method for automatic speech recognition. In Interspeech, pp. 2613–2617. Cited by: §4.1.
  • [27] S. H. K. Parthasarathi and N. Strom (2019) Lessons from building acoustic models with a million hours of speech. In ICASSP, pp. 6670–6674. Cited by: §1.
  • [28] D. Povey, A. Ghoshal, and G. Boulianne (2011) The Kaldi speech recognition toolkit. In ASRU, Cited by: §4.2.
  • [29] S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) Wav2vec: unsupervised pre-training for speech recognition. In Interspeech, pp. 3465–3469. Cited by: §2.2.
  • [30] R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §4.1.
  • [31] H. Song, M. Kim, D. Park, Y. Shin, and J. Lee (2020) Learning from noisy labels with deep neural networks: a survey. arXiv preprint arXiv:2007.08199. Cited by: §2.1.
  • [32] C. Talnikar, T. Likhomanenko, R. Collobert, and G. Synnaeve (2021) Joint masked cpc and ctc training for asr. In ICASSP, pp. 3045–3049. Cited by: §2.2.
  • [33] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky (2013) Deep neural network features and semi-supervised training for low resource speech recognition. In ICASSP, pp. 6704–6708. Cited by: §1.
  • [34] C. Wang, Y. Wu, S. Liu, J. Li, Y. Qian, K. Kumatani, and F. Wei (2021) UniSpeech at scale: an empirical study of pre-training method on large-scale speech recognition dataset. arXiv preprint arXiv:2107.05233. Cited by: §2.2.
  • [35] C. Wang, Y. Wu, Y. Qian, K. Kumatani, S. Liu, F. Wei, M. Zeng, and X. Huang (2021) Unispeech: unified speech representation learning with labeled and unlabeled data. arXiv preprint arXiv:2101.07597. Cited by: §2.2.
  • [36] Q. Xu, T. Likhomanenko, J. Kahn, A. Hannun, G. Synnaeve, and R. Collobert (2020) Iterative pseudo-labeling for speech recognition. arXiv preprint arXiv:2005.09267. Cited by: §1, Table 1.