It is natural to have poor speech recognition accuracy with small labeled dataset. There are 2 major approaches to resolving this problem: self-supervised learning (self-sup) and semi-supervised learning (semi-sup). Self-sup pre-trains the audio encoder to learn meaningful representations. Contrastive Predictive Coding (CPC) , Wav2vec  and Wav2vec2.0  show that contrastive loss can be good pre-training objectives. Autoregressive Predictive Coding (APC)  shows that mean squared error (MSE) also can be a good objective. Semi-sup trains the ASR model using the pseudo labels generated by a teacher model. Noisy student training (NST)  is a popular semi-supervised learning approach, which also works for ASR [14, 18]
. Recently, it is also shown that combining self- and semi-sup methods improves the performance of automatic speech recognition (ASR) significantly, leading to the state-of-art (SOTA) Librispeech Word Error Rate (WER). However, the current methods focused on maximizing the performance for in-domain data. The largest dataset of the studies is labeled Librispeech data (1k hours)  and unlabeled LibriLight data (60k hours) .
In large-scale production ASR systems, there is often the challenge of mismatched domain between the training data (source domain) and the real world data (target domain). It is also common to have an order of magnitude bigger dataset with various contexts (e.g. command, chat, caption) than the public dataset like Librispeech. To the best of our knowledge, there are no extensive studies to tackle large-scale domain adaptation using both self- and semi-supervised learning.
In this paper, we propose a combined self- and semi-sup approach for domain adaptation. Our method completely recovers target domain accuracy using only a small fraction of labeled target data. In addition, the improved generalization power enhances source domain accuracy as well. The other main contributions of our work is that we analyze how much self- and semi-sup contribute on domain adaptation. Self-supervised learning improves overall model generalization and semi-sup directly closes out-of-domain generalization gap, which means both are complementary.
2 Related works
The Wav2vec loss maximizes the mutual information between the context latent vector and the future inputs. Instead of the convolution layers, we use conformer blocks  for the context network. As we directly use the log-mel features instead of the waveform, we exclude the feature network. The output of the audio encoder directly predicts the future log-mel features.
The Wav2vec2.0 loss maximizes the mutual information between the context latent vector and the masked input features . We exclude the feature network for the same reason. We do not quantize the input features as we do not observe any performance difference in our experiments.
The APC loss minimizes the MSE between the input log-mel feature and the predicted log-mel feature by the audio encoder . APC paper uses LSTM or transformer blocks but we use conformer blocks.
Noisy student training (NST)  is one of the most popular semi-sup methods. It trains a model with both labeled data and unlabeled data. The teacher model produces pseudo labels of unlabeled data from non-augmented input feature. The method trains the student model with augmented input feature and the pseudo labels as the ground truth. The current Librispeech SOTA paper  uses NST with Librispeech data (labeled data) and LibriLight data (unlabeled data). Unlike image domain, it’s hard to use soft label for sequence model. NST in speech domain [14, 20] uses hard label.
3.1 Self-Supervised learning
All of the self-supervised methods are used to pre-train the audio encoder of the RNN-T model  using all the source and target domain data. This is followed by supervised training of the entire model using only the labeled source domain data. However, with this 2-stage training approach, it is difficult to determine the optimal pre-training hyper parameters and checkpoints because the frame accuracy metrics in self-sup are not always a good indicator of the final WER performance. To resolve this issue, we propose Combined Self-supervised Learning, where a multi-task objective is used to combine the RNN-T loss and self-sup loss:
where is a hyper-parameter. We find that works well.
Wav2vec2.0 is a bi-directional method. When we train it with a causal audio encoder, it does not converge. Our workaround is to pre-train with right context and subsequently fine-tune with left context only. Combined Wav2vec2.0 and RNN-T train does not have the issue. The combined train works on a causal audio encoder, because RNN-T loss can guide Wav2vec2.0 objective. As our RNN-T model is online, we use a causal audio encoder for all combined experiments. We add an additional MLP layer between the audio encoder and joint layer, and between the audio encoder and self-supervised loss. The representation of the last audio encoder layer is specialized by 2 MLP layers for RNN-T loss and self-supervised loss respectively.
3.2 Semi-supervised learning
After self-supervised pre-training, we train the ASR model using RNN-T loss with both source domain data (labeled data) and target domain data (unlabeled data) . NST produces pseudo labels for target domain data.
The teacher model is trained with source domain data (same for the student model). As a result, the pseudo labels generated for the target domain data is error-prone, which is harmful for domain adaptation. When pseudo labels are generated, the teacher model filters out low confidence utterances by Confidence Estimation Module (CEM). When the teacher model is trained by RNN-T loss, we add CEM whose inputs are the audio encoding and the beam search labels from the RNN-T model. The CEM is trained to minimise the binary cross entropy between the estimated confidence and the binary target sequence . The target sequence contains a 1 when the prediction word is correct and 0 otherwise. The average word-level confidence is used to filter utterances. The teacher’s multi-task training objective consists of both the RNN-T loss and confidence loss.
3.3 Domain adaptation approach
Figure 1 visualizes the proposed domain adaptation method. First of all, self-sup trains the audio encoder with source and target domain data. Then, RNN-T model is trained by RNN-T loss with source and target domain data. NST produces pseudo label for target domain data.
4 Experimental Setup
4.1 Model Architecture Details
We use a Conformer RNN-T  model in the experiments. The audio encoder has Conformer blocks with model dimension . As the model is online ASR, we restrict the model from using any future information . Each conformer block [4, 7] uses causal convolution and left-context attention layers. The convolution kernel size is and the self-attention layer consists of heads with left context length. The RNN-T decoder consists of a label encoder with LSTM layers with units projected down to output units, and a joint network with a single feed-forward layer with units, as shown in Fig 1. The architecture and training procedure is similar to the LibriSpeech SOTA WERs model . The total number of weights is 137M, while that of the SOTA model is 1B.
4.2 Data Sets
We use large multi-domain (MD) data sets  in English. MD utterances include multi domain data such as search, farfield, telephony and YouTube. All datasets are anonymized and hand-transcribed except for YT which uses YouTube video transcription with confidence filter . We divide MD data into source domain data and target domain data. We pre-train the model using source domain data by self-sup and transfer it to target domain data by semi-sup. As shown in Table 1, we use Medium-form (MF) utterances as target domain and MD-MF as the source domain. MF is average 10.4 secs length in natural conversation domain. Short-form (SF) is average 4.6 secs length in voice command domain.
For evaluation metrics, we calculate the WER ofMF to measure the performance on target domain and the WER of SF for the performance on the source domain. The goal of domain adaptation is to minimize the amount of required transcriptions of MF while maintaining WERs for both MF and SF. In the final experiments, we use 3% MF as labeled data, because 800 hours data is manageable amount for hand-transcription. NST produces pseudo labels for rest of 97% MF data.
|Multi-domain (MD)||Source Target||k|
|3 MF + MDsrc (MD3p)||Source Target||k|
5 Experimental results
In this section, we conduct extensive domain adaptation experiments using self- and semi-supervised learning.
5.1 Self-supervised Learning
In Table 2, we compare the WERs of supervised and self-supervised learning. The first block lists supervised experiments with MD, MDsrc, and MD3p data. MD has full MF data. MDsrc doesn’t have any MF data. MD3p has 3% MF data. As expected, SF WERs are the same but MF WERs are better when more MF data is used.
The second block lists three different self-sup experiments. One of 3 self-sup methods pre-trains the audio encoder with MD, and RNN-T loss trains the RNN-T model with MDsrc. Wav2vec and APC have better WERs on both MF and SF than Wav2vec2.0, unlike what Wav2vec2.0 paper reported 
. The downstream ASR model is online RNN-T, which is a causal model. Wav2vec and APC are causal models like GPT-3, but Wav2vec2.0 is full context (non-causal) model like BERT. It shows causal self-sup has better performance for causal downstream task. Even though Wav2vec and APC have the same WERs, we use Wav2vec for rest of experiments. In our experience, APC is more sensitive to checkpoint fluctuations. When we choose a pre-trained checkpoint, Wav2vec works between 50k and 1.2M steps, but APC works only near 100k steps. In addition, APC requires total variation auxiliary loss to stabilize it.
In the third block of Table 2, Wav2vec MD3p enhances WERs for both source domain (SF, 6.0 to 5.8) and target domain (MF, 3.7 to 3.6), compared to Supervised MD3p. However, there is huge gap between Supervised MD and Wav2vec MD3p. Self-sup enhances overall model generalization, but cannot reduce gap of out-of-domain (OOD) generalization by itself.
|Algorithms||Data||Word Error Rate ()|
|MF (Target)||SF (Source)|
5.2 Semi-Supervised Learning
The first block in Table 3 shows semi-sup results. Semi-sup MDsrc denotes RNN-T training using MDsrc for labeled data and MF for unlabeled data. Semi-sup MD3p denotes RNN-T training using MD3p for labeled data and 97% MF for unlabeled data. NST produces the pseudo labels for unlabeled data. Compared to Supervised MDsrc, Semi-sup MDsrc improves MF (target domain) WER from 6.2 to 3.4. There is only 0.2 % gap between Supervised MD. Semi-sup can reduce most of gap of OOD generalization, even without target domain data. Semi-sup MD3p closes the all the gap, whose MF WER is same to the full data baseline: 3.2%. All semi-supervised experiments have the same SF WER to the baseline as all of them uses same amount of source domain data.
The teacher model for semi-sup has almost the same architecture as the student model. One difference for the teacher model is that it is non-causal. The self-attention right context is and convolution layer is not causal. The teacher model uses the same amount of data as the student model. The teacher model is trained with MDsrc for Semi-sup MDsrc and MD3p for Semi-sup MD3p.
|Algorithms||Data||Word Error Rate ()|
|MF (Target)||SF (Source)|
|Self + Semi-sup||MD3p||3.1||5.7|
5.3 Self + Semi-Supervised Learning
The second block in Table 3 shows self + semi-sup WERs. In Table 2, Supervised MD uses all the labels of both source domain (MDsrc) and target domain (MF) data. Wav2vec MD3p has better SF (source domain) WER than the supervised baseline because self-sup improves overall generalization as mentioned in Section 5.1. Semi-sup MD3p has the same MF (target domain) WER to the baseline because semi-sup resolves OOD generalization as mentioned in Section 5.2. Combined both self- and semi-sup are complementary. In Table 3, Self + Semi-sup MD3p show even better WERs for both MF (target domain) and SF (source domain). Our domain adaptation method not only reduces all the OOD generalization gap but also improves source domain performance.
5.4 Confidence filter for Semi-Supervised Learning
In Table 3, all semi-supervised experiments use a confidence filter  that filters out target domain (MF) data when the utterance-level confidence score is less than . This results in dropping % of MF utterances and improves the teacher WER on the target domain by %. Without confidence filter, both MF and SF WERs are slightly worse.
Using only high-confidence utterances results in higher quality pseudo labeling, which enhances MF (target domain) WER from 3.2 to 3.1. As the teacher is trained with MD3p, it tends to be more confident to source-domain-like SF data. This results in additional SF (source domain) WER improvement from 5.8 to 5.7. It indicates that confidence score threshold should be tuned per fraction of target label data.
|Algorithms||Word Error Rate ()|
|MF (Target)||SF (Source)|
|Self + Semi-sup w/o filter|
|Self + Semi-sup w filter|
5.5 Combined Self-Supervised Learning
In Fig 2, Combined W2V2 (Green line) reaches WER 4.6 at 300k step, while W2V2 requires 900k steps for both pre-training and supervised learning. Total training steps is reduced significantly. In addition, Combined W2V2 reaches WER 4.4, which W2V2 never reach. It’s because Combined W2V2 utilizes Wav2vec2.0 loss on the causal encoder, so there is no transition gap from non-causal to causal unlike the Wav2vec2.0 experiment. It brings out the full potential of Wav2vec2.0 objective. However, Combined W2V (Pupple line) requires the same steps to reach the converging point to W2V (Blue line), as both utilize a causal audio encoder.
In Table 5, Combined W2V + NST and Combined W2V2 + NST has additional semi-sup stage. They have the same MF (target domain) WER to Self + Semi-sup MD3p in Table 3. However, their SF (source domain) WER is worse than Self + Semi-sup MD3p (5.7%). We leave the weaker in-domain generalization power of combined training for future study.
|Algorithms||Word Error Rate ()|
|MF (Target)||SF (Source)|
|Combined W2V + NST||3.1|
|Combined W2V2 + NST||3.1|
In this paper, we investigated a domain adaptation method by combining self- and semi-supervised learning. Self-sup improves overall generalization and semi-sup closes out-of-domain generalization gap. Both methods complement each other. Extensive experimental results demonstrate that the proposed methods using only 3% of the labeled target domain data obtain better WERs on both the target and source domains than the baseline.
We thank Hasim Sak, Anshuman Tripathi, Jaeyoung Kim, Qian Zhang, Yonghui Wu, Ruoming Pang, Yu Zhang, Arun Narayanan, Tony Bruguier, Lillian Zhou, Petr Zadrazil, Zhehuai Chen, Andrew Rosenberg for helpful discussions.
-  (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477. Cited by: §1, §2, §2, §5.1.
-  (2020) Generative pre-training for speech with autoregressive predictive coding. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 3497–3501. Cited by: §1, §2, §2.
Sequence transduction with recurrent neural networks. In
International Conference on Machine Learning: Representation Learning Workshop, Cited by: §3.1, §4.1.
-  (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §2, §4.1.
-  (2021) Incremental layer-wise self-supervised learning for efficient speech domain adaptation on device. arXiv preprint arXiv:2110.00155. External Links: Cited by: §5.1.
-  (2020) Libri-light: a benchmark for asr with limited or no supervision. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7669–7673. Cited by: §1.
-  (2021) A better and faster end-to-end model for streaming asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5634–5638. Cited by: §4.1.
Large scale deep neural network acoustic modeling with semi-supervised training data for youtube video transcription. In IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 368–373. Cited by: §4.2.
-  (2021) A comparison of supervised and unsupervised pre-training of end-to-end models. In Interspeech, Cited by: §3.2.
-  (2018) Toward domain-invariant speech recognition via large scale training. In 2018 IEEE Spoken Language Technology Workshop (SLT), Vol. , pp. 441–447. Cited by: §4.2.
-  (2019) Recognizing long-form speech using streaming end-to-end models. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 920–927. Cited by: §4.1.
-  (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §2.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. Cited by: §1.
-  (2020-10) Improved noisy student training for automatic speech recognition. Interspeech 2020. External Links: Cited by: §1, §2.
-  (2021) Learning word-level confidence for subword end-to-end asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6393–6397. Cited by: §3.2, §5.4.
-  (2019) wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pp. 3465–3469. External Links: Cited by: §1, §2, §2.
-  (2012) Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. Cited by: §4.1.
-  (2021) Contrastive semi-supervised learning for asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3870–3874. Cited by: §1.
Self-training with noisy student improves imagenet classification. In , pp. 10687–10698. Cited by: §1.
-  (2020) Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504. Cited by: §1, §2, §4.1.