Large-scale ASR Domain Adaptation using Self- and Semi-supervised Learning

10/01/2021 ∙ by Dongseong Hwang, et al. ∙ 0

Self- and semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance. However, the approach mostly focus on in-domain performance for public datasets. In this study, we utilize the combination of self- and semi-supervised learning methods to solve unseen domain adaptation problem in a large-scale production setting for online ASR model. This approach demonstrates that using the source domain data with a small fraction of the target domain data (3 performance gap compared to a full data baseline: relative 13.5 improvement for target domain data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is natural to have poor speech recognition accuracy with small labeled dataset. There are 2 major approaches to resolving this problem: self-supervised learning (self-sup) and semi-supervised learning (semi-sup). Self-sup pre-trains the audio encoder to learn meaningful representations. Contrastive Predictive Coding (CPC) [12], Wav2vec [16] and Wav2vec2.0 [1] show that contrastive loss can be good pre-training objectives. Autoregressive Predictive Coding (APC) [2] shows that mean squared error (MSE) also can be a good objective. Semi-sup trains the ASR model using the pseudo labels generated by a teacher model. Noisy student training (NST) [19] is a popular semi-supervised learning approach, which also works for ASR [14, 18]

. Recently, it is also shown that combining self- and semi-sup methods improves the performance of automatic speech recognition (ASR)

[20] significantly, leading to the state-of-art (SOTA) Librispeech Word Error Rate (WER). However, the current methods focused on maximizing the performance for in-domain data. The largest dataset of the studies is labeled Librispeech data (1k hours) [13] and unlabeled LibriLight data (60k hours) [6].

In large-scale production ASR systems, there is often the challenge of mismatched domain between the training data (source domain) and the real world data (target domain). It is also common to have an order of magnitude bigger dataset with various contexts (e.g. command, chat, caption) than the public dataset like Librispeech. To the best of our knowledge, there are no extensive studies to tackle large-scale domain adaptation using both self- and semi-supervised learning.

In this paper, we propose a combined self- and semi-sup approach for domain adaptation. Our method completely recovers target domain accuracy using only a small fraction of labeled target data. In addition, the improved generalization power enhances source domain accuracy as well. The other main contributions of our work is that we analyze how much self- and semi-sup contribute on domain adaptation. Self-supervised learning improves overall model generalization and semi-sup directly closes out-of-domain generalization gap, which means both are complementary.

2 Related works

We use the three popular self-supervised learning methods in this work: Wav2vec [12, 16], and Wav2vec2.0 [1], APC [2].

The Wav2vec loss maximizes the mutual information between the context latent vector and the future inputs

[16]. Instead of the convolution layers, we use conformer blocks [4] for the context network. As we directly use the log-mel features instead of the waveform, we exclude the feature network. The output of the audio encoder directly predicts the future log-mel features.

The Wav2vec2.0 loss maximizes the mutual information between the context latent vector and the masked input features [1]. We exclude the feature network for the same reason. We do not quantize the input features as we do not observe any performance difference in our experiments.

The APC loss minimizes the MSE between the input log-mel feature and the predicted log-mel feature by the audio encoder [2]. APC paper uses LSTM or transformer blocks but we use conformer blocks.

Noisy student training (NST) [14] is one of the most popular semi-sup methods. It trains a model with both labeled data and unlabeled data. The teacher model produces pseudo labels of unlabeled data from non-augmented input feature. The method trains the student model with augmented input feature and the pseudo labels as the ground truth. The current Librispeech SOTA paper [20] uses NST with Librispeech data (labeled data) and LibriLight data (unlabeled data). Unlike image domain, it’s hard to use soft label for sequence model. NST in speech domain [14, 20] uses hard label.

3 Methods

3.1 Self-Supervised learning

All of the self-supervised methods are used to pre-train the audio encoder of the RNN-T model [3] using all the source and target domain data. This is followed by supervised training of the entire model using only the labeled source domain data. However, with this 2-stage training approach, it is difficult to determine the optimal pre-training hyper parameters and checkpoints because the frame accuracy metrics in self-sup are not always a good indicator of the final WER performance. To resolve this issue, we propose Combined Self-supervised Learning, where a multi-task objective is used to combine the RNN-T loss and self-sup loss:

(1)

where is a hyper-parameter. We find that works well.

Wav2vec2.0 is a bi-directional method. When we train it with a causal audio encoder, it does not converge. Our workaround is to pre-train with right context and subsequently fine-tune with left context only. Combined Wav2vec2.0 and RNN-T train does not have the issue. The combined train works on a causal audio encoder, because RNN-T loss can guide Wav2vec2.0 objective. As our RNN-T model is online, we use a causal audio encoder for all combined experiments. We add an additional MLP layer between the audio encoder and joint layer, and between the audio encoder and self-supervised loss. The representation of the last audio encoder layer is specialized by 2 MLP layers for RNN-T loss and self-supervised loss respectively.

3.2 Semi-supervised learning

After self-supervised pre-training, we train the ASR model using RNN-T loss with both source domain data (labeled data) and target domain data (unlabeled data) [9]. NST produces pseudo labels for target domain data.

The teacher model is trained with source domain data (same for the student model). As a result, the pseudo labels generated for the target domain data is error-prone, which is harmful for domain adaptation. When pseudo labels are generated, the teacher model filters out low confidence utterances by Confidence Estimation Module (CEM)

[15]. When the teacher model is trained by RNN-T loss, we add CEM whose inputs are the audio encoding and the beam search labels from the RNN-T model. The CEM is trained to minimise the binary cross entropy between the estimated confidence and the binary target sequence . The target sequence contains a 1 when the prediction word is correct and 0 otherwise. The average word-level confidence is used to filter utterances. The teacher’s multi-task training objective consists of both the RNN-T loss and confidence loss.

Figure 1: Domain adaptation: Self-sup pre-trains the audio encoder. Supervised and semi-sup train the model.

3.3 Domain adaptation approach

Figure 1 visualizes the proposed domain adaptation method. First of all, self-sup trains the audio encoder with source and target domain data. Then, RNN-T model is trained by RNN-T loss with source and target domain data. NST produces pseudo label for target domain data.

4 Experimental Setup

4.1 Model Architecture Details

We use a Conformer RNN-T [3] model in the experiments. The audio encoder has Conformer blocks with model dimension . As the model is online ASR, we restrict the model from using any future information [7]. Each conformer block [4, 7] uses causal convolution and left-context attention layers. The convolution kernel size is and the self-attention layer consists of heads with left context length. The RNN-T decoder consists of a label encoder with LSTM layers with units projected down to output units, and a joint network with a single feed-forward layer with units, as shown in Fig 1. The architecture and training procedure is similar to the LibriSpeech SOTA WERs model [20]. The total number of weights is 137M, while that of the SOTA model is 1B.

The model input is a vector of size , consisting of contiguous frames of -dimension log-mel features [11] sub-sampled by a factor of and one-hot domain-id vector of size . The model label uses word pieces [17] as sub-word label.

4.2 Data Sets

We use large multi-domain (MD) data sets [10] in English. MD utterances include multi domain data such as search, farfield, telephony and YouTube. All datasets are anonymized and hand-transcribed except for YT which uses YouTube video transcription with confidence filter [8]. We divide MD data into source domain data and target domain data. We pre-train the model using source domain data by self-sup and transfer it to target domain data by semi-sup. As shown in Table 1, we use Medium-form (MF) utterances as target domain and MD-MF as the source domain. MF is average 10.4 secs length in natural conversation domain. Short-form (SF) is average 4.6 secs length in voice command domain.

For evaluation metrics, we calculate the WER of

MF to measure the performance on target domain and the WER of SF for the performance on the source domain. The goal of domain adaptation is to minimize the amount of required transcriptions of MF while maintaining WERs for both MF and SF. In the final experiments, we use 3% MF as labeled data, because 800 hours data is manageable amount for hand-transcription. NST produces pseudo labels for rest of 97% MF data.

Data set Domain Hours
Medium-form (MF) Target k
Short-form (SF) Source k
Youtube (YT) Source k
Multi-domain (MD) Source Target k
MD-MF (MDsrc) Source k
3 MF + MDsrc (MD3p) Source Target k
Table 1: Overview of training data sets. MF denotes the target domain utterances. MD denotes all domain utterances including MF, SF and YT. MDsrc (MD-MF) denotes the source domain utterances.

5 Experimental results

In this section, we conduct extensive domain adaptation experiments using self- and semi-supervised learning.

5.1 Self-supervised Learning

In Table 2, we compare the WERs of supervised and self-supervised learning. The first block lists supervised experiments with MD, MDsrc, and MD3p data. MD has full MF data. MDsrc doesn’t have any MF data. MD3p has 3% MF data. As expected, SF WERs are the same but MF WERs are better when more MF data is used.

The second block lists three different self-sup experiments. One of 3 self-sup methods pre-trains the audio encoder with MD, and RNN-T loss trains the RNN-T model with MDsrc. Wav2vec and APC have better WERs on both MF and SF than Wav2vec2.0, unlike what Wav2vec2.0 paper reported [1]

. The downstream ASR model is online RNN-T, which is a causal model. Wav2vec and APC are causal models like GPT-3, but Wav2vec2.0 is full context (non-causal) model like BERT. It shows causal self-sup has better performance for causal downstream task. Even though Wav2vec and APC have the same WERs, we use Wav2vec for rest of experiments. In our experience, APC is more sensitive to checkpoint fluctuations. When we choose a pre-trained checkpoint, Wav2vec works between 50k and 1.2M steps, but APC works only near 100k steps. In addition, APC requires total variation auxiliary loss to stabilize it

[5].

In the third block of Table 2, Wav2vec MD3p enhances WERs for both source domain (SF, 6.0 to 5.8) and target domain (MF, 3.7 to 3.6), compared to Supervised MD3p. However, there is huge gap between Supervised MD and Wav2vec MD3p. Self-sup enhances overall model generalization, but cannot reduce gap of out-of-domain (OOD) generalization by itself.

Algorithms Data Word Error Rate ()
MF (Target) SF (Source)
Supervised MD 3.2 6.0
Supervised MDsrc
Supervised MD3p
Wav2vec MDsrc
Wav2vec2.0 MDsrc
APC MDsrc
Wav2vec MD3p 3.6 5.8
Table 2: Comparisons of Word Error Rate (WER) between supervised baselines, three self-sup algorithms.

5.2 Semi-Supervised Learning

The first block in Table 3 shows semi-sup results. Semi-sup MDsrc denotes RNN-T training using MDsrc for labeled data and MF for unlabeled data. Semi-sup MD3p denotes RNN-T training using MD3p for labeled data and 97% MF for unlabeled data. NST produces the pseudo labels for unlabeled data. Compared to Supervised MDsrc, Semi-sup MDsrc improves MF (target domain) WER from 6.2 to 3.4. There is only 0.2 % gap between Supervised MD. Semi-sup can reduce most of gap of OOD generalization, even without target domain data. Semi-sup MD3p closes the all the gap, whose MF WER is same to the full data baseline: 3.2%. All semi-supervised experiments have the same SF WER to the baseline as all of them uses same amount of source domain data.

The teacher model for semi-sup has almost the same architecture as the student model. One difference for the teacher model is that it is non-causal. The self-attention right context is and convolution layer is not causal. The teacher model uses the same amount of data as the student model. The teacher model is trained with MDsrc for Semi-sup MDsrc and MD3p for Semi-sup MD3p.

Algorithms Data Word Error Rate ()
MF (Target) SF (Source)
Semi-sup MDsrc
Semi-sup MD3p 3.2
Self + Semi-sup MD3p 3.1 5.7
Table 3: WERs for semi-supervised learning.

5.3 Self + Semi-Supervised Learning

The second block in Table 3 shows self + semi-sup WERs. In Table 2, Supervised MD uses all the labels of both source domain (MDsrc) and target domain (MF) data. Wav2vec MD3p has better SF (source domain) WER than the supervised baseline because self-sup improves overall generalization as mentioned in Section 5.1. Semi-sup MD3p has the same MF (target domain) WER to the baseline because semi-sup resolves OOD generalization as mentioned in Section 5.2. Combined both self- and semi-sup are complementary. In Table 3, Self + Semi-sup MD3p show even better WERs for both MF (target domain) and SF (source domain). Our domain adaptation method not only reduces all the OOD generalization gap but also improves source domain performance.

5.4 Confidence filter for Semi-Supervised Learning

In Table 3, all semi-supervised experiments use a confidence filter [15] that filters out target domain (MF) data when the utterance-level confidence score is less than . This results in dropping % of MF utterances and improves the teacher WER on the target domain by %. Without confidence filter, both MF and SF WERs are slightly worse.

Using only high-confidence utterances results in higher quality pseudo labeling, which enhances MF (target domain) WER from 3.2 to 3.1. As the teacher is trained with MD3p, it tends to be more confident to source-domain-like SF data. This results in additional SF (source domain) WER improvement from 5.8 to 5.7. It indicates that confidence score threshold should be tuned per fraction of target label data.

Algorithms Word Error Rate ()
MF (Target) SF (Source)
Self + Semi-sup w/o filter
Self + Semi-sup w filter
Table 4: WERs of combined self- and semi-sup with/without confidence filter on MD3p.

5.5 Combined Self-Supervised Learning

In Fig 2, Combined W2V2 (Green line) reaches WER 4.6 at 300k step, while W2V2 requires 900k steps for both pre-training and supervised learning. Total training steps is reduced significantly. In addition, Combined W2V2 reaches WER 4.4, which W2V2 never reach. It’s because Combined W2V2 utilizes Wav2vec2.0 loss on the causal encoder, so there is no transition gap from non-causal to causal unlike the Wav2vec2.0 experiment. It brings out the full potential of Wav2vec2.0 objective. However, Combined W2V (Pupple line) requires the same steps to reach the converging point to W2V (Blue line), as both utilize a causal audio encoder.

In Table 5, Combined W2V + NST and Combined W2V2 + NST has additional semi-sup stage. They have the same MF (target domain) WER to Self + Semi-sup MD3p in Table 3. However, their SF (source domain) WER is worse than Self + Semi-sup MD3p (5.7%). We leave the weaker in-domain generalization power of combined training for future study.

Figure 2: Comparison of WER convergence with and without combined training for wav2vec and wav2vec2.0.
Algorithms Word Error Rate ()
MF (Target) SF (Source)
Combined W2V
Combined W2V2
Combined W2V + NST 3.1
Combined W2V2 + NST 3.1
Table 5: WERs of combined self-supervised learning with MD3p.

6 Conclusion

In this paper, we investigated a domain adaptation method by combining self- and semi-supervised learning. Self-sup improves overall generalization and semi-sup closes out-of-domain generalization gap. Both methods complement each other. Extensive experimental results demonstrate that the proposed methods using only 3% of the labeled target domain data obtain better WERs on both the target and source domains than the baseline.

7 Acknowledgements

We thank Hasim Sak, Anshuman Tripathi, Jaeyoung Kim, Qian Zhang, Yonghui Wu, Ruoming Pang, Yu Zhang, Arun Narayanan, Tony Bruguier, Lillian Zhou, Petr Zadrazil, Zhehuai Chen, Andrew Rosenberg for helpful discussions.

References

  • [1] A. Baevski, H. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477. Cited by: §1, §2, §2, §5.1.
  • [2] Y. Chung and J. Glass (2020) Generative pre-training for speech with autoregressive predictive coding. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 3497–3501. Cited by: §1, §2, §2.
  • [3] A. Graves (2012)

    Sequence transduction with recurrent neural networks

    .
    In

    International Conference on Machine Learning: Representation Learning Workshop

    ,
    Cited by: §3.1, §4.1.
  • [4] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §2, §4.1.
  • [5] Z. Huo, D. Hwang, K. C. Sim, S. Garg, A. Misra, N. Siddhartha, T. Strohman, and F. Beaufays (2021) Incremental layer-wise self-supervised learning for efficient speech domain adaptation on device. arXiv preprint arXiv:2110.00155. External Links: 2110.00155 Cited by: §5.1.
  • [6] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, et al. (2020) Libri-light: a benchmark for asr with limited or no supervision. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7669–7673. Cited by: §1.
  • [7] B. Li, A. Gulati, J. Yu, T. N. Sainath, C. Chiu, A. Narayanan, S. Chang, R. Pang, Y. He, J. Qin, et al. (2021) A better and faster end-to-end model for streaming asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5634–5638. Cited by: §4.1.
  • [8] H. Liao, E. McDermott, and A. Senior (2013)

    Large scale deep neural network acoustic modeling with semi-supervised training data for youtube video transcription

    .
    In IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 368–373. Cited by: §4.2.
  • [9] A. Misra, D. Hwang, Z. Huo, S. Garg, N. Siddhartha, A. Narayanan, and K. C. Sim (2021) A comparison of supervised and unsupervised pre-training of end-to-end models. In Interspeech, Cited by: §3.2.
  • [10] A. Narayanan, A. Misra, K. C. Sim, G. Pundak, A. Tripathi, M. Elfeky, P. Haghani, T. Strohman, and M. Bacchiani (2018) Toward domain-invariant speech recognition via large scale training. In 2018 IEEE Spoken Language Technology Workshop (SLT), Vol. , pp. 441–447. Cited by: §4.2.
  • [11] A. Narayanan, R. Prabhavalkar, C. Chiu, D. Rybach, T. N. Sainath, and T. Strohman (2019) Recognizing long-form speech using streaming end-to-end models. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 920–927. Cited by: §4.1.
  • [12] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §2.
  • [13] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. Cited by: §1.
  • [14] D. S. Park, Y. Zhang, Y. Jia, W. Han, C. Chiu, B. Li, Y. Wu, and Q. V. Le (2020-10) Improved noisy student training for automatic speech recognition. Interspeech 2020. External Links: Link, Document Cited by: §1, §2.
  • [15] D. Qiu, Q. Li, Y. He, Y. Zhang, B. Li, L. Cao, R. Prabhavalkar, D. Bhatia, W. Li, K. Hu, et al. (2021) Learning word-level confidence for subword end-to-end asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6393–6397. Cited by: §3.2, §5.4.
  • [16] S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pp. 3465–3469. External Links: Document, Link Cited by: §1, §2, §2.
  • [17] M. Schuster and K. Nakajima (2012) Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. Cited by: §4.1.
  • [18] A. Xiao, C. Fuegen, and A. Mohamed (2021) Contrastive semi-supervised learning for asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3870–3874. Cited by: §1.
  • [19] Q. Xie, M. Luong, E. Hovy, and Q. V. Le (2020)

    Self-training with noisy student improves imagenet classification

    .
    In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 10687–10698. Cited by: §1.
  • [20] Y. Zhang, J. Qin, D. S. Park, W. Han, C. Chiu, R. Pang, Q. V. Le, and Y. Wu (2020) Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504. Cited by: §1, §2, §4.1.