Advancing Momentum Pseudo-Labeling with Conformer and Initialization Strategy

10/11/2021 ∙ by Yosuke Higuchi, et al. ∙ MERL 0

Pseudo-labeling (PL), a semi-supervised learning (SSL) method where a seed model performs self-training using pseudo-labels generated from untranscribed speech, has been shown to enhance the performance of end-to-end automatic speech recognition (ASR). Our prior work proposed momentum pseudo-labeling (MPL), which performs PL-based SSL via an interaction between online and offline models, inspired by the mean teacher framework. MPL achieves remarkable results on various semi-supervised settings, showing robustness to variations in the amount of data and domain mismatch severity. However, there is further room for improving the seed model used to initialize the MPL training, as it is in general critical for a PL-based method to start training from high-quality pseudo-labels. To this end, we propose to enhance MPL by (1) introducing the Conformer architecture to boost the overall recognition accuracy and (2) exploiting iterative pseudo-labeling with a language model to improve the seed model before applying MPL. The experimental results demonstrate that the proposed approaches effectively improve MPL performance, outperforming other PL-based methods. We also present in-depth investigations to make our improvements effective, e.g., with regard to batch normalization typically used in Conformer and LM quality.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent progress in automatic speech recognition (ASR) has shifted towards the end-to-end (E2E) framework, which aims to model direct speech-to-text conversion using a single deep neural network 

[1, 2, 3]. With well-established sequence-to-sequence modeling techniques [4, 5, 6, 7] and more sophisticated neural network architectures [8, 9], E2E ASR models have shown promising results on various benchmarks [10, 11, 12]. However, the performance often depends on the availability of a large quantity of labeled (transcribed) speech data, which is not always feasible with high annotation costs.

To compensate for the limited amount of labeled data, semi-supervised learning (SSL) [13] methods that make use of a large amount of unlabeled data to improve the model performance can be applied. While various efforts have been made to perform SSL in E2E ASR[14, 15, 16, 17, 18, 19], a pseudo-labeling (PL) [20] (or self-training [21])-based approach has been attracting increasing attention due to its effectiveness and simplicity [22, 23, 24, 25, 26, 27, 28, 29]. In PL, a seed model is first trained on labeled data and used to generate pseudo-labels for unlabeled data. Both the labeled and pseudo-labeled data are then used to train a better-performing model. In our previous work [29], we proposed a PL-based method for E2E ASR, called momentum pseudo-labeling (MPL). MPL trains a pair of online and offline models that interact and learn from each other, inspired by the mean teacher framework [30]. The online model is trained to predict pseudo-labels generated on the fly by the offline model, which maintains an exponential moving average of the online model parameters. Through the interaction between the two models, MPL effectively stabilizes the training with unlabeled data and significantly improves over seed model performance.

One of the crucial factors for making PL-based approaches successful is to avoid generating severely erroneous pseudo-labels, which can lead to limiting the improvement of an E2E ASR model. In a typical SSL setting in ASR, the amount of labeled data is quite small and the quality of pseudo-labels is not necessarily guaranteed. To this end, an external language model (LM) and beam-search decoding are often incorporated into the labeling process [22, 23]. In [25, 26], low-quality pseudo-labels are excluded via confidence-based filtering to promote model training with high-quality labels. And in [27], an N-best list of pseudo-labels is leveraged to incorporate more appropriate supervision from alternative ASR hypotheses.

We believe that MPL still has room for further improvement by making the models capable of generating pseudo-labels with higher quality. Thus, in this work, we propose to enhance MPL by (1) introducing the Conformer architecture to boost the overall recognition accuracy, and (2) using iterative pseudo-labeling [23] to transfer LM knowledge into a seed model before performing MPL. The key contributions of this work are summarized as follows. (a) We show that vanilla Conformer suffers from generalizing to unlabeled data, especially when there is a domain mismatch against labeled data. We mitigate this issue by substituting batch normalization with group normalization for the convolution module. (b) We demonstrate that the improved MPL is robust to over-fitting to an LM training text set, which has been reported as problematic for using an LM in PL [26, 28]. We also investigate the importance of LM quality in our framework. (c) We show the proposed approaches effectively enhance MPL, conducting experiments on a variety of SSL scenarios with varying amounts of unlabeled data or domain mismatch.

2 Momentum Pseudo-Labeling

In this section, we review the MPL method proposed in our prior work [29]. MPL is described in two steps: 1) the supervised training of a seed E2E ASR model, and 2) the MPL-based semi-supervised training of the model using unlabeled data.

2.1 Supervised training of a seed model

E2E ASR is formulated as a sequence mapping problem between a -length input sequence and an -length output sequence . Here, is a -dimensional acoustic feature at frame , an output token at position , and a vocabulary. This work focuses on the connectionist temporal classification (CTC)-based E2E ASR model [4, 1], which is less prone to the looping and early stopping issues often caused by autoregressive decoder networks [31, 22]. CTC predicts a frame-level latent sequence , which is obtained by augmenting with a special blank token

. Based on the conditional independence assumption between token predictions, CTC models the conditional probability

by marginalizing over latent sequences as

(1)

where returns all possible latent sequences compatible with . Given labeled data , a seed model with parameters is optimized by minimizing the negative log-likelihood of Eq. (1):

(2)

where indicates SpecAugment [32] for augmenting the input.

2.2 Semi-supervised training with MPL

The goal of semi-supervised training is to exploit unlabeled data for enhancing the seed model trained on labeled data . MPL performs the training using a pair of online and offline models that interact and learn from each other. Let and be the online and offline models with parameters and , which are initialized with the seed model parameters .

Online model training: Given an unlabeled sample , the online model is trained on pseudo-labels generated on the fly by the offline model:

(3)

where is performed by the best path decoding of CTC [4]. With the pseudo-labels generated from Eq. (3), the objective of the online model is defined in the same manner as Eq. (2):

(4)

where is optimized via a gradient descent optimization. Note that, during the semi-supervised training, we also use labeled data and the supervised loss , which helps the online model stabilize and promote learning from unlabeled data with .

Offline model training: After every update of the online model, the offline model accumulates parameters of the online model via the momentum-based moving average:

(5)

where is a momentum coefficient. This momentum update makes the offline model evolve more smoothly than the online model, preventing the pseudo-labels from deviating too quickly from the labels initially generated by the seed model. To handle the sensitive tuning of the momentum coefficient , we follow our prior work and indirectly derive from a weight , where

is the number of iterations (i.e., batches) in a training epoch. The weight

can be regarded as the proportion of the seed model retained after a training epoch, and we fix it to 50% (i.e., ), as it has been shown consistently effective in various semi-supervised settings [29].

3 Proposed Improvements for Momentum Pseudo-Labeling

We propose to enhance MPL by (1) introducing the Conformer architecture [9] to improve the overall accuracy, and (2) adopting iterative pseudo-labeling (IPL) [23] to transfer LM knowledge into the seed model. We expect these approaches to promote the MPL training by enabling the models to generate higher-quality pseudo-labels.

3.1 Conformer for semi-supervised training

Conformer is a variant of Transformer augmented with convolution to increase the capability for capturing local feature patterns [9]. In addition to the multi-head self-attention layer in the Transformer encoder, Conformer introduces a module based on depthwise separable convolution [33]. Unlike Transformer, Conformer employs relative positional encoding and macaron-like feed-forward layers.

While Conformer-based models have achieved outstanding ASR performance compared with standard Transformers [34], we empirically observe that Conformer suffers from poor generalization from labeled data to unlabeled data. A similar issue has been reported in other ASR tasks [35, 36, 37]. Simply adopting Conformer for MPL makes the training become unstable and diverge easily, especially when a domain mismatch exists between labeled and unlabeled data.

We assume that such a problem comes from unreliable statistics computed and used by batch normalization (BN) [38]

in the convolution module. As we suppose the amount of labeled data relatively small (i.e., 100h), the estimated mean and variance of the whole dataset are likely to become less accurate in BN 

[39]. A simple solution is to increase the mini-batch size. However, we observe that a large mini-batch size degrades the seed model performance, which can lead to degrading the quality of pseudo-labels during the MPL training. Hence, we consider replacing BN with group normalization (GN) [40] in the convolution module, as it has been investigated in [41, 35]. GN divides feature maps into groups and normalizes the features within each group, which makes the training less dependent on the mini-batch size. This is found critical for stabilizing the Conformer-based MPL training, as examined in Sec. 4.2.

1:Input:
2:    labeled and unlabeled data
3:    an ASR model architecture
4:    a momentum coefficient
5:
6:# 1.Seed model training
7:Train a seed model with architecture on using Eq. (2)
8:
9:# 2.Iterative pseudo-labeling
10:for  do
11:       Generate pseudo-labels ,
12:   using and LM with beam-search decoding
13:       for  do
14:             for all  do
15:                    Compute loss for with Eq. (2)
16:                    Update using
17:             end for
18:       end for
19:end for
20:
21:# 3.Momentum pseudo-labeling
22:Initialize an online model and an offline model with
23:for  do
24:       for all  do
25:             Obtain
26:             Obtain
27:             Compute loss for with Eq. (2) or (4)
28:             Update using
29:             Update
30:       end for
31:end for
32:return online model is returned for final evaluation
Algorithm 1 Momentum pseudo-labeling using iterative pseudo-labeling for transferring LM knowledge into seed model

3.2 Iterative pseudo-labeling for enhancing seed model

To provide the MPL training with a better model for initializing the online and offline models, we consider enhancing the seed model using (IPL). IPL continuously trains a model with periodic regeneration of pseudo-labels, where an external LM and beam-search decoding are used to generate the labels [23]. While beam-search decoding with an LM plays an important role for generating pseudo-labels with high quality [22, 42], it is computationally intensive for MPL due to the on-the-fly label generation. Hence, we exploit IPL to implicitly transfer LM knowledge to the seed model before applying MPL, providing the MPL training with a better initialization for generating higher-quality pseudo-labels. Moreover, by not using the LM-based pseudo-labels during the MPL traning, we can prevent the model from over-fitting to the LM training text data, which often degrades the generalization capability of the ASR model [26, 28].

Algorithm 1 shows the proposed MPL training with IPL initialization. In the beginning, a seed model is trained using a labeled set as in Sec. 2.1 (line 1–2). Then, the seed model is further trained via IPL with LM and beam-search decoding (line 3–12). Here, we denote as the number of iterations (pseudo-label updates), and as the number of epochs trained in each iteration. We refer to standard pseudo-labeling (PL) [22] when and IPL [23] when . Finally, the enhanced seed model is used to initialize the models for MPL (line 13–23). The MPL training lasts epochs.

In our prior work, we have discussed a little about applying PL as the initialization strategy for MPL [29] and demonstrated its effectiveness. This work extends this early idea by focusing on the better-performing IPL. In Sec. 4.5, we also investigate the influence of the quality of LM used for PL on improving MPL.

4 Experiments

4.1 Experimental setting

Data: We conducted experiments using the LibriSpeech (LS) [43] and TEDLIUM3 (TED3) [44] datasets. LS is a corpus of read English speech, containing 960 hours of training data (split into “train-clean-100”, “train-clean-360”, and “train-other-500”). TED3 is a corpus of English Ted Talks consisting of 450 hours of training data (“train-ted3”). We used the standard development and test sets for each dataset. Kaldi [45] was used to extract 80 mel-scale filterbank coefficients with three-dimensional pitch features. For text tokenization, we used a 1k subword vocabulary, which was constructed from the “train-clean-100” transcriptions using SentencePiece [46].

Semi-supervised settings: After training a seed model on the labeled “train-clean-100” (LS-100) set, we considered three semi-supervised settings with different unlabeled sets: LS-100/LS-360, an in-domain setting with “train-clean-360” (LS-360); LS-100/LS-860, an in-domain setting with “train-{clean-360,other-500}” (LS-860); and LS-100/TED3, an out-of-domain setting with “train-ted3”.

ASR model: We used the Conformer architecture [9] implemented in ESPnet [47]

, which consists of two convolutional neural network layers followed by a stack of 12 self-attention layers. The number of heads

, dimension of a self-attention layer , dimension of a feed-forward network , and kernel size were set to 4, 256, 2048, and 31, respectively. We set the number of groups to 8 for group normalization when used in the convolution module.

Training configuration: We basically followed our prior work [29]. The seed model was trained for 150 epochs using the Adam optimizer [48], and Noam learning rate scheduling [49]. The semi-supervised training was done up to 200 epochs, where the gradient-based optimization was done by using the Adam optimizer. IPL was performed by iterating PL for the maximum of 8 times (), where the model was trained for 25 epochs () in each iteration. Note that, after each iteration, we averaged model parameters over the last 5 checkpoints to stabilize the pseudo-label generation. We set for MPL training, following our prior work [29].

Decoding configuration:

For evaluation, a final model was obtained by averaging model parameters over 10 checkpoints with the best validation performance. We trained an LM consisting of 4 long short-term memory (LSTM) layers with 2048 units, using the LS-100 transcriptions combined with the external text data provided by LibriSpeech 

[43]. For decoding with the LM, we adopted a frame-synchronous CTC prefix beam search algorithm [50, 51], where we used a beam-size of 20, a score-based pruning threshold of 14.0, an LM weight of 1.0, and an insertion bonus factor of 2.0. For decoding without an LM, we performed the best path decoding of CTC [4].

LibriSpeech TED3
Model Norm. type dev-clean dev-other Dev
Transformer 12.2 30.0 31.2
Conformer Batch 08.6 23.1 27.3
Instance 08.9 23.5 27.1
Group 08.4 22.5 26.4
Layer 08.4 22.9 26.9
Table 1: Validation WER [%] for seed models trained on labeled LS-100. For the Conformer-based models, we explored different normalization methods for the convolution module.
Figure 1: Validation token error rate [%] of MPL training using Conformer with batch (dotted line) or group (solid line) normalization.
Decoding without LM Decoding with LM
Dev WER Test WER Test WRR Dev WER Test WER Test WRR
Method Init. clean other clean other clean other clean other clean other clean other
LS-100 L0  seed (Cfm) 08.4 22.5 08.6 23.3 000.0 000.0 05.2 15.2 05.5 16.0 000.0 000.0
LS-100   / LS-360 A0  MPL (Trf) seed (Trf) 08.7 21.4 09.0 21.7 0 0 04.8 13.0 05.1 13.1 0 0
A1  MPL L0 06.1 16.0 06.6 15.8 052.3 076.4 04.5 11.2 04.7 11.1 034.7 071.7
A2  PL L0 05.7 15.9 06.1 15.8 064.6 076.0 04.3 11.4 04.5 11.8 040.6 062.2
A3  IPL L0 05.4 15.1 05.7 15.3 073.3 081.5 04.2 11.5 04.5 11.7 042.2 062.5
A4  MPL A2@ep100 05.7 15.5 06.1 15.6 064.8 077.8 04.2 11.1 04.5 11.3 044.0 069.3
A5  MPL A3@ep100 05.5 15.0 05.6 15.1 075.1 083.3 04.1 10.8 04.3 11.1 051.4 072.7

3.0pt2-15.51.5

plus1fil minus1fil

A6  topline L0 04.1 13.6 04.7 13.4 100.0 100.0 02.9 09.4 03.2 09.2 100.0 100.0
LS-100   / LS-860 B0  MPL seed (Trf) 08.1 16.5 08.3 16.8 0 0 04.6 09.7 04.8 10.1 0 0
B1  MPL L0 05.7 12.2 06.2 12.2 048.1 076.4 04.1 08.5 04.4 08.7 036.5 074.6
B2  PL L0 05.4 13.9 05.7 14.2 057.8 062.3 04.0 10.5 04.2 10.7 043.0 053.7
B3  IPL L0 04.7 11.5 05.0 11.7 071.0 079.3 04.1 09.7 04.4 10.2 036.2 058.9
B4  MPL B2@ep100 05.1 12.1 05.3 12.4 064.0 075.1 03.7 08.4 03.9 08.8 051.0 073.3
B5  MPL B3@ep100 04.7 11.0 05.0 11.1 070.0 083.9 03.6 07.8 03.8 08.2 054.0 079.6

3.0pt2-15.51.5

plus1fil minus1fil

B6  topline L0 03.3 09.0 03.5 08.7 100.0 100.0 02.4 06.1 02.5 06.2 100.0 100.0
Table 2: Word error rate (WER) [] and WER recovery rate (WRR) [] on in-domain LibriSpeech (LS) settings. The results are divided into two sections: whether the LM with beam-search decoding was applied in the final evaluation or not. indicates trained for 100 epochs.
Decoding without LM Decoding with LM
Setting Method Init. Dev WER Test WER Test WRR Dev WER Test WER Test WRR
LS-100 L0  seed (Cfm) 26.4 26.5 000.0 21.3 21.1 000.0
LS-100   / TED3 C0  MPL (Trf) seed (Trf) 18.4 17.0 0 14.9 13.3 0
C1  MPL L0 15.1 13.9 081.0 12.7 11.6 077.3
C2  IPL L0 16.8 16.8 062.2 16.6 16.9 034.2
C3  MPL C2@ep100 14.6 13.8 081.1 12.4 12.0 073.8

3.0pt2-9.51.5

plus1fil minus1fil

C4  topline L0 10.4 10.9 100.0 08.6 08.8 100.0
Table 3: WER [] and WRR [] on out-domain TEDLIUM3 (TED3) setting.

4.2 Effectiveness of adopting Conformer for MPL

In Table 1, we compare the word error rate (WER) of seed models trained with the Transformer (Trf) or the Conformer (Cfm) architecture. For Cfm-based models, we investigated different normalization methods for the convolution module, including {batch [38], instance [52], group [40], layer [53]} normalization ({BN, IN, GN, LN}). Note that IN and LN are the same as GN with group sizes 1 and 256 (), respectively. Comparing the two architectures, the Cfm-based models significantly improved over the Trf-based model. Within the Cfm-based models, GN resulted in the best performance on both LS and TED3, and the 100-hour training data seemed to be too small to take advantage of BN. As normalizing across feature maps (i.e., IN, GN, and LN) achieved better performance than BN on the out-of-domain TED3 set, it indicates that BN led to lower generalization capability with unreliable statistics. Note that in [41], BN achieved better performance than the other normalization methods when another depthwise separable convolution-based ASR model is trained on the full 960-hour set of LS.

Figure 1 shows learning curves from MPL training using Cfm with BN or GN. In all semi-supervised settings, BN caused the training to become unstable. Especially in the out-of-domain setting with TED3, the model diverged more quickly than in the other settings. In contrast, GN successfully stabilized the MPL training with Cfm.

4.3 Results on in-domain setting

Table 2 lists results on the in-domain LS settings in terms of the WER and WER recovery rate (WRR) [54]. The topline results were obtained via fully supervised training on each setting. Looking at the MPL results (A1,B1), MPL led to a substantial improvement over the seed model (L0), effectively learning from unlabeled data using Cfm with GN. These Cfm results significantly outperformed those of prior Trf-based MPL [29] (A0,B0 vs. A1,B1). With pseudo-labels generated using the LM, PL [22] and IPL [23] achieved lower WERs on the “clean” sets than those obtained from MPL, and IPL resulted in better performance than MPL on the “other” sets as well (*2,*3 vs. *1). However, when decoded with the LM, the performance gain was larger for MPL with a slight decrease in WRRs, and MPL achieved much lower WERs on the “other” sets. PL and IPL, in contrast, had smaller improvement with degraded WRRs, which indicates PL and IPL are fitted to LM knowledge and have less variations in the hypotheses during the beam-search decoding. *4 and *5 show results for the proposed MPL training using the seed model enhanced by PL and IPL, respectively. Note that we performed PL or IPL for 100 epochs and MPL for another 100 epochs to match the total training epochs of the other methods. The initialization strategy provided MPL with distinct improvements, pushing the limit of the other methods (*4,*5 vs. *1,*2,*3). With the IPL-based initialization, MPL achieved the best overall performance on both of the settings with different amounts of unlabeled data (A5,B5). Moreover, when decoded with the LM, the improved MPL retained higher WRRs than IPL (*3 vs. *5), maintaining the advantage of MPL and making the model less dependent on the LM knowledege.

4.4 Results on out-of-domain setting

Table 3 shows MPL results on the TED3 setting. Cfm with GN significantly improved MPL over the seed model and Trf-based MPL (C1 vs. L0,C0), successfully stabilizing training on the out-of-domain data. IPL led to a decent improvement over the seed model, but the gain was more substantial for MPL (C1 vs. C2). As there is a domain mismatch between the LM training text and the actual transcriptions of TED3, IPL was less effective at learning from the out-of-domain unlabeled data. Moreover, IPL had little gain from decoding with the LM, indicating the model was prone to over-fit to the LM knowledge. By using IPL to enhance the seed model, MPL further reduced WERs (C1 vs. C3). However, the improvement was much smaller than those observed in the in-domain settings, and the standard MPL performed sufficiently well by decoding with the LM.

4.5 Does better language model lead to better MPL results?

Small LM Large LM
Setting Test data PL MPL PL MPL
LS-100 / LS-360 test-clean 06.3 06.2 06.2 06.1
test-other 16.8 15.9 16.4 15.6
LS-100 / LS-860 test-clean 06.2 05.7 05.7 05.3
test-other 15.3 12.9 14.5 12.4
Table 4: WER for MPL initialized by PL with different LM quality.

In Table 4, we study the importance of LM quality used in PL to improve MPL performance. We focus on in-domain settings (A4,B4 in Table 2), where the initialization strategy was especially effective. We compare a small LM (1-layer LSTM) and large LM (4-layer LSTM), validation perplexities of which were 20.9 and 14.3, respectively. PL was evaluated at epoch 100, which is then used to initialize MPL. As a result, the large LM led to better PL performance and, accordingly, improved MPL with better pseudo-label generation.

5 Conclusions

We proposed several improvements to momentum pseudo-labeling (MPL) for semi-supervised ASR. Experimental results on various semi-supervised settings demonstrated the effectiveness of the enhanced MPL, showing clear improvements over our prior results and other PL-based methods. Moreover, we investigated and shared the key components to make the proposed approaches effective, including normalization method for Conformer and the quality of LM for generating pseudo-labels. Future work should consider evaluating MPL on lower-resource scenarios (e.g., 10h of labeled data [55]).

References

  • [1]

    A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in

    Proc. ICML, 2014.
  • [2] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho et al., “Attention-based models for speech recognition,” in Proc. NeurIPS, 2015.
  • [3] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP, 2016, pp. 4960–4964.
  • [4] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006.
  • [5] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  • [6] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proc. NeurIPS, 2014.
  • [7] D. Bahdanau et al.

    , “Neural machine translation by jointly learning to align and translate,” in

    Proc. ICLR, 2014.
  • [8] L. Dong, S. Xu, and B. Xu, “Speech-Transformer: A no-recurrence sequence-to-sequence model for speech recognition,” in Proc. ICASSP, 2018.
  • [9] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar et al., “Conformer: Convolution-augmented Transformer for speech recognition,” in Proc. Interspeech, 2020.
  • [10] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. ICASSP, 2018.
  • [11] C. Lüscher, E. Beck, K. Irie, M. Kitza et al., “RWTH ASR systems for LibriSpeech: Hybrid vs attention,” in Proc. Interspeech, 2019.
  • [12] S. Karita, N. Chen, T. Hayashi, T. Hori et al., “A comparative study on Transformer vs RNN in speech applications,” in Proc. ASRU, 2019.
  • [13] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning,” IEEE Transactions on Neural Networks, vol. 20, no. 3, 2009.
  • [14] A. Tjandra, S. Sakti, and S. Nakamura, “Listening while speaking: Speech chain by deep learning,” in Proc. ASRU, 2017.
  • [15] T. Hori, R. Astudillo, T. Hayashi, Y. Zhang et al., “Cycle-consistency training for end-to-end speech recognition,” in Proc. ICASSP, 2019.
  • [16] S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, “Deep contextualized acoustic representations for semi-supervised speech recognition,” in Proc. ICASSP, 2020.
  • [17] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020.
  • [18] Y. Zhang, J. Qin, D. S. Park, W. Han et al., “Pushing the limits of semi-supervised learning for automatic speech recognition,” arXiv preprint arXiv:2010.10504, 2020.
  • [19] M. K. Baskar, L. Burget, S. Watanabe, R. F. Astudillo et al., “EAT: Enhanced ASR-TTS for self-supervised speech recognition,” in Proc. ICASSP, 2021.
  • [20] D.-H. Lee, “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in Proc. ICML, 2013.
  • [21]

    H. Scudder, “Probability of error of some adaptive pattern-recognition machines,”

    IEEE Trans. Inf. Theory, vol. 11, no. 3, 1965.
  • [22] J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” in Proc. ICASSP, 2020.
  • [23] Q. Xu, T. Likhomanenko, J. Kahn, A. Hannun et al., “Iterative pseudo-labeling for speech recognition,” in Proc. Interspeech, 2020.
  • [24] Y. Chen, W. Wang, and C. Wang, “Semi-supervised ASR by end-to-end self-training,” in Proc. Interspeech, 2020.
  • [25] D. S. Park, Y. Zhang, Y. Jia, W. Han et al., “Improved noisy student training for automatic speech recognition,” in Proc. Interspeech, 2020.
  • [26] S. Khurana, N. Moritz, T. Hori, and J. Le Roux, “Unsupervised domain adaptation for speech recognition via uncertainty driven self-training,” in Proc. ICASSP, 2021.
  • [27] N. Moritz, T. Hori, and J. Le Roux, “Semi-supervised speech recognition via graph-based temporal classification,” in Proc. ICASSP, 2021.
  • [28] T. Likhomanenko, Q. Xu, J. Kahn, G. Synnaeve et al., “slimIPL: Language-model-free iterative pseudo-labeling,” in Proc. Interspeech, 2021.
  • [29] Y. Higuchi, N. Moritz, J. Le Roux, and T. Hori, “Momentum pseudo-labeling for semi-supervised speech recognition,” in Proc. Interspeech, 2021.
  • [30] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in Proc. NeurIPS, 2017.
  • [31] J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” arXiv preprint arXiv:1612.02695, 2016.
  • [32] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu et al., “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019.
  • [33] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proc. CVPR, 2017.
  • [34] P. Guo, F. Boyer, X. Chang, T. Hayashi et al., “Recent developments on ESPnet toolkit boosted by Conformer,” in Proc. ICASSP, 2021.
  • [35] B. Li, A. Gulati, J. Yu, T. N. Sainath et al., “A better and faster end-to-end model for streaming ASR,” in Proc. ICASSP, 2021.
  • [36] Y. C. Liu, E. Han, C. Lee, and A. Stolcke, “End-to-end neural diarization: From Transformer to Conformer,” in Proc. Interspeech, 2021.
  • [37] J. Kim, J. Lee, and Y. Lee, “Generalizing RNN-transducer to out-domain audio via sparse self-attention layers,” arXiv preprint arXiv:2108.10752, 2021.
  • [38] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. ICML, 2015.
  • [39] S. Ioffe, “Batch renormalization: Towards reducing minibatch dependence in batch-normalized models,” in Proc. NeurIPS, 2017.
  • [40] Y. Wu and K. He, “Group normalization,” in Proc. ECCV, 2018.
  • [41] S. Kriman, S. Beliaev, B. Ginsburg, J. Huang et al., “QuartzNet: Deep automatic speech recognition with 1D time-channel separable convolutions,” in Proc. ICASSP, 2020.
  • [42] E. Wallington, B. Kershenbaum, O. Klejch, and P. Bell, “On the learning dynamics of semi-supervised training for ASR,” in Proc. Interspeech, 2021.
  • [43] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015.
  • [44] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko et al., “TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” in Proc. SPECOM, 2018.
  • [45] D. Povey, A. Ghoshal, G. Boulianne, L. Burget et al., “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011.
  • [46] T. Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” in Proc. ACL, 2018.
  • [47] S. Watanabe, T. Hori, S. Karita, T. Hayashi et al., “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018.
  • [48] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
  • [49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit et al., “Attention is all you need,” in Proc. NeurIPS, 2017.
  • [50] A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, “First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs,” arXiv preprint arXiv:1408.2873, 2014.
  • [51] N. Moritz, T. Hori, and J. Le Roux, “Streaming end-to-end speech recognition with joint CTC-attention based models,” in Proc. ASRU, 2019.
  • [52] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, 2016.
  • [53] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  • [54] J. Ma and R. Schwartz, “Unsupervised versus supervised training of acoustic models,” in Proc. Interspeech, 2008.
  • [55] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov et al., “Libri-light: A benchmark for ASR with limited or no supervision,” in Proc. ICASSP, 2020.