DeepAI
Log In Sign Up

Investigation of Ensemble features of Self-Supervised Pretrained Models for Automatic Speech Recognition

06/11/2022
by   A Arunkumar, et al.
Indian Institute Of Technology, Madras
0

Self-supervised learning (SSL) based models have been shown to generate powerful representations that can be used to improve the performance of downstream speech tasks. Several state-of-the-art SSL models are available, and each of these models optimizes a different loss which gives rise to the possibility of their features being complementary. This paper proposes using an ensemble of such SSL representations and models, which exploits the complementary nature of the features extracted by the various pretrained models. We hypothesize that this results in a richer feature representation and shows results for the ASR downstream task. To this end, we use three SSL models that have shown excellent results on ASR tasks, namely HuBERT, Wav2vec2.0, and WaveLM. We explore the ensemble of models fine-tuned for the ASR task and the ensemble of features using the embeddings obtained from the pre-trained models for a downstream ASR task. We get improved performance over individual models and pre-trained features using Librispeech(100h) and WSJ dataset for the downstream tasks.

READ FULL TEXT VIEW PDF
02/07/2022

Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

Self-supervised learning (SSL) is a powerful tool that allows learning o...
03/31/2022

PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations

While self-supervised speech representation learning (SSL) models serve ...
12/20/2022

Exploring Effective Fusion Algorithms for Speech Based Self-Supervised Learning Models

Self-supervised learning (SSL) has achieved great success in various are...
04/09/2021

Feature Replacement and Combination for Hybrid ASR Systems

Acoustic modeling of raw waveform and learning feature extractors as par...
10/30/2022

Improved acoustic-to-articulatory inversion using representations from pretrained self-supervised learning models

In this work, we investigate the effectiveness of pretrained Self-Superv...
04/09/2020

Improving Readability for Automatic Speech Recognition Transcription

Modern Automatic Speech Recognition (ASR) systems can achieve high perfo...

1 Introduction

Self-Supervised Learning (SSL) models [oord2018representation, chung2019unsupervised, schneider2019wav2vec, baevski2019vq, liu2020mockingjay, baevski2020wav2vec, 9053569, 9478264, hsu2021hubert, chen2021wavlm]

have been shown to provide significant improvement for various downstream tasks such as Automatic Speech Recognition(ASR), phoneme recognition, speaker verification, etc. Progress in ASR in many languages has been significantly affected due to the lack of good quality transcribed data. Self-supervised methods are highly desirable in such scenarios since they need a large amount of audio-only data, which is more readily available. Several pre-trained SSL models are now available in the public domain that can be used to build good downstream ASR models. These models vary in the loss function used during self-supervised learning and the type, nature, and amount of data used in self-supervision. We look at three models that have state-of-art ASR results for Libri subsets as well on SUPERB benchmarks

[baevski2020wav2vec, hsu2021hubert, chen2021wavlm, yang2021superb], namely wav2vec2.0, HuBERT, and WavLM. We look at three models that have state-of-art ASR results for Libri subsets as well on SUPERB benchmarks [baevski2020wav2vec, hsu2021hubert, chen2021wavlm, yang2021superb], namely wav2vec2.0, HuBERT and WavLM. Each of these self-supervised training methods is motivated by a different objective:

  • Wave2vec2.0 masks latent representation and solves for a contrastive task with respect to quantised version of the latent representation

  • HuBERT discovers acoustic-units using a clustering approach, which is used to label the input features. Masking is then applied on the input features, and the training is done to minimise the masked prediction loss using cluster labels as targets

  • WavLM adds gated relative position bias to the transformer structure, and apart from masked prediction loss similar to HuBERT also applies denoising task during self-supervised learning.

The last two models use non-contrastive criterion and therefore do not have to worry about large batch sizes which is important during training of wav2Vec2.0 models. As different SSL models optimize different objective functions, each of them learns to extract different set of features. This opens the possibility of the extracted features to be complementary in nature. In this paper, an ensemble method which combines such complementary features from different pre-trained self-supervised models is investigated. Apart from the objective functions being different, each of these models have been trained on varying amounts of self-supervised data which are readily available in public domain. Similarly, many of these models are also available after fine-tuning for various tasks.

The following state-of-the-art self-supervised models, which are readily available to download from HuggingFace [wolf-etal-2020-transformers] are used in this paper:

  • ”facebook/wav2vec2-base” trained on LibriSpeech 960hrs data [95M parameters]

  • ”facebook/hubert-base-ls960” trained on LibriSpeech 960hrs data [95M parameters]

  • ”microsoft/wavlm-base” trained on LibriSpeech 960hrs data [94.70 parameters]

  • ”facebook/wav2vec2-large-lv60” trained on Libri-Light 60k hours data [317.3M parameters]

  • ”facebook/hubert-xlarge-ll60k” trained on Libri-Light 60k hours data [316.6M parameters]

  • ”microsoft/wavlm-large” trained on mix 94k hours data [60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli] [316.6M parameters]

  • Finetuned models ”facebook/wav2vec2-large-960h-lv60-self” and ”facebook/hubert-xlarge-ls960-ft” which are pre-trained on Libri-Light 60k hours unlabeld data and finetuneed on LibriSpeech 960 hours labeld data.

2 Fine Tuning for ASR task

Pre-trained self-supervised models can be used in downstream ASR tasks in two ways:

  • We finetune the pre-trained self-supervised model using the supervised downstream dataset. This is done by adding a linear CTC layer [graves2006connectionist] on top of the embeddings of the last layer output of the pre-traine model and optimising for the character output obtained from supervision. In this approach all the parameters of the pre-trained model are also updated (after freezing for a few iterations), and is usually expensive. For example, wav2vec2.0 fine-tuning on Libri splits take at least 50 V100 GPU hours [baevski2020wav2vec]

  • We freeze the pre-trained self-supervised model and use extracted features for downstream tasks. Since the SSL model parameters are not updated, a simple linear CTC layer on top during fine-tuning is not sufficient to get good results. In practice, couple of BiLSTM layers or transformer encoder layers are added on top of the pre-trained features followed by a CTC layers. During fine-tuning the BiLSTM/Transformer layers as well as the CTC layers are optmised for the supervised character output. Often a learnt linear combination of SSL layer outputs are used as features before feeding to the transformer encoder layers.

This paper considers the strategy of freezing the pre-trained self-supervised model and using the models or extracted features for ASR tasks. Ensemble learning is a common approach in Deep Learning since it is predicated on the premise that combining the output of numerous models is more effective than using a single model and typically produces superior results. There are various approaches to combine features from multiple models including: 1) Summation of Features, 2) Weighted Average of Features, 3) Concatenation of Features, and 4) Soft mixing using Attention layer. In this paper, we combine features from different models by concatenating extracted features and using the linear layer with CTC loss to learn the optimal combination of these concatenated features. However, since the SSL models are frozen, using only the CTC loss is not enough to guide best feature vector selection. Therefore, we also propose to use Transformer encoder

[vaswani2017attention] on top of the ensemble of features to compute a soft mixture of the features, this allows for even richer representation of features. A similar strategy is also adopted in S3PRL and SUPERB benchmarks. We first describe an approach to combine SSL models that have been specifically fine-tuned for ASR task. This is followed by a method to combine embeddings obtained from different pre-trained (but not fine-tuned) SLL models for a downstream ASR task.

2.1 Ensemble Model

In this section, we describe our approach to combine SSL models that have been fine-tuned on ASR task. Specifically, we consider the wav2vec2.0 model fine-tuned on the Libri-960 hour supervised data, as well as the HuBERT model fine-tuned with the Libri-960 hour supervised data. These models have been fine-tuned with CTC Linear layer on top of the pre-trained transformer encoders with all the SSL model parameters being updated during fine-tuning (after being frozen for some initial iterations). In our proposed approach, we remove the final CTC layers from both the fine-tuned models and concatenate the two final layer embeddings. We now add a random CTC linear layer on top of these concatenated features and fine-tune on small amount of training data for a few epochs. Since these are already well fine-tuned models, the few epochs of fine-tuning help learn the concatenated features to character mapping. Please note that the fine-tuned SSL model parameters are

not updated, and only the CTC layer parameters on top of the concatenated features is learnt. This is shown in Figure 1.

Figure 1: Proposed method for Ensemble model to combine HuBERT and wav2vec2.0 fine-tuned ASR models.
Figure 2: Proposed method to combine the embeddings from pre-trained models for downstream ASR task

2.2 Ensemble Features

In this section, we describe how we combine the features or embeddings from the different pre-trained models for a downstream ASR task. Again, we freeze the pretrained the model parameters and do not allow them to update. Therefore, a simple CTC layer on top of embeddings during fine-tuning will not give good performance. We need a few learnable layers on top to achieve good performance. This is similar to the approach taken by S3PRL and SUPERB benchmarks [yang2021superb], where a bi-directional LSTM on top of a pre-trained model is used for fine-tuning ASR downstream task. The authors proposed to freeze the pre-trained model and use weighted average of all the hidden layer features with trainable weights. Instead of combining features from different layers of the same model, this work proposes to combine the last layer features of different pre-trained models and pass these features to transformer encoder layers. We use transformer encoders instead of BiLSTM, since it offers faster processing with improved representation. The proposed ensemble is generated by combining multiple state-of-the-art models. All models in the ensemble are frozen, and feature representation collected from the final layer of each model is concatenated to produce an ensemble. This is shown in Figure 2.

3 Experiments

3.1 Ensemble Model

For all the experiments presented in this section, large variants of wav2vec2 and HuBERT pre-trained with Libri-Light 60k hours and finetuned with LibriSpeech 960h are considered. These fine-tuned models are available for download in HuggingFace [wolf-etal-2020-transformers]. All our experiments are done on the SpeechBrain toolkit [ravanelli2021speechbrain]. The CTC layer is removed from these models and the last layer features are extracted from these models. Note that these features are obtained from well fine-tuned ASR models, and therefore carry rich information about the final classification task of characters. While Wav2vec2.0 has 1024 dimension embeddings, HuBERT has 1280-dim embeddings. In our proposed approach, these features are concatenated before feeding it to a new randomized CTC layer. The parameters of the linear classification layer are learnt using a few epochs on a small amount of training data. The linear layer helps learn appropriate combination of the concatenated features and a mapping of these features to characters. To enable a fair comparison, the features from the individual models are obtained by removing the CTC layer of the fine-tuned model. These features are fed to a randomised CTC layer whose parameters are also learnt by running a few epochs on a small amount of training data.

Table 1 shows the result of combining the fine-tuned models using the approach of Figure 1. In this case, we have fine-tuned the randomised CTC layer using 10 hour of Libri data for about 8 epochs. The remaining model parameters are frozen. As seen from the table, using ensemble of models gives significant improvement over using individual models. Although both our models have been fine-tuned on 960-hour Librispeech, we also test them on WSJ data using the same approach. Once again we find improved performance using the proposed approach when compared to the individual models.

Method   LibriSpeech -10h   WSJ -10h
clean other   test -Dev 93 test -Eval 93
dev test dev test  
Wav2Vec
2.0
 
4.13 4.31 6.91 7.21   37.21 35.36
HuBERT   4.35 4.03 5.79 6.30   7.32 6.54
Wav2Vec
2.0
+ HuBERT
 
3.15 3.05 4.76 5.16   6.79 6.13
Table 1: Evaluating Results Of Individual SSL Methods And Ensemble Model For In-domain (Libri-Light 10h) And Out-domain (10h Subset Of WSJ) Downstream Data. Only CTC Layer Is Added On Top Of Features. No External Language Model is Used.

3.2 Ensemble Features

In this section, we present results of using ensemble of features from pre-trained models to improve the performance of the downstream ASR task. Both the base and large variants of pre-trained models are used for the ensemble feature experiments. In this section apart from Wav2vec2.0 and HuBERT, we also consider the pre-trained WavLM model since this is also readily available in HuggingFace [wolf-etal-2020-transformers]. All these are experiments were implemented in speechbrain toolkit [ravanelli2021speechbrain]

. The last layer features are extracted from pre-trained model and then passed to the CTC layer or transformer network. Note that all the prertrained models we are using have been trained on Librispeech data in a self-supervised manner. We now conduct experiments in two scenarios: (i) where the fine-tuning is done on matched data, namely Libri subsets, and (ii) where the fine-tuning is done a different domain data, namely WSJ. Since these embeddings are obtained from pre-trained models that have not been tuned for ASR task, and since we do not allow the pre-trained model parameters to be updated, a simple CTC linear on top does not give good performance. We need additional learnable layers to get good performance, and in our case we use transformer encoder layers.

3.2.1 Finetuning with Same Domain Data

In this section, the pre-trained models are fine-tuned with data from same domain, namely, Libri-100 hour data.

Feature Extraction Model   CTC Layer only   2 encoder layers + CTC
Clean Other   Clean Other
dev test dev test   dev test dev test
Base models
WavLM
 
59.54 60.19 67.73 67.85   9.65 10.34 20.87 20.88
HuBERT
 
67.96 68.43 76.33 76.2   12.41 13.37 24.8 25.16
Wav2Vec2.0
 
98.84 98.54 99.16 98.97   28.56 29.68 44.99 46.05
WavLM
+HuBERT
 
51.02 51.57 60.43 60.59   8.72 9.25 19.63 19.60
WavLM
+Wav2Vec2.0
 
60.08 60.64 68.43 68.57   11.58 12.59 23.38 23.60
Wav2Vec2.0
+HuBERT
 
68.59 68.49 76.95 76.19   13.77 14.34 26.19 27.0
Wav2Vec2.0
+HuBERT
+WavLM
 
52.21 52.87 61.53 61.73   9.03 9.58 19.96 20.16
Large models
WavLM
 
47.02 47.66 54.04 53.78   5.79 5.97 11.09 11.01
HuBERT
 
58.98 58.44 62.64 62.19   9.94 10.36 13.54 13.93
Wav2Vec2.0
 
99.99 99.99 99.97 99.98   98.59 98.57 98.67 98.64
WavLM
+HuBERT
 
41.17 41.61 46.38 46.06   5.60 5.49 9.76 9.53
WavLM
+Wav2Vec2.0
 
47.44 48.01 53.94 53.89   6.64 6.99 12.22 12.21
Wav2Vec2.0
+HuBERT
 
56.47 55.84 66.77 60.69   9.64 9.90 12.96 13.35
Wav2Vec2.0
+HuBERT
+WavLM
 
40.00 40.24 46.21 46.37  
Table 2:

A CTC only model or a downstream transformer encoder model with CTC is finetuned with the individual features and the ensemble features extracted from LibriSpeech 100h data. In the first set of experiments, only a CTC layer is finetuned. In the second set of experiments, two layer transformer encoder with a CTC layer on top is finetuned. The results are shown in Table 2. Since, the features are extracted from a pre-trained model and not from a finetuned model, CTC only finetuning is not good enough. It can be seen that the ensemble features model improves over the individual features model. Even though the performance is poor in case of CTC only finetuning, the ensemble features is still relatively better than the individual features. The more interesting results are with the use of transformer encoder layers trained on top of these features. From Table 2, the following observations can be made:

  • SSL Models trained on larger amounts of data consistently gave better perform on the same downstream task as expected.

  • WavLM consistently performs better than HuBERT which is better than Wav2Vec2.0 in all cases for this ASR task.

  • Wav2vec2.0 features are significantly worse, and therefore hurt performance when combined with other features. This is because as observed in previous works, the final layer of WavLM and HuBERT capture significant layer for ASR task, while it is the middle layers of Wav2Vec2.0 that are more appropriate for ASR task.

  • Combining WavLM and HuBERT features gives the best performance for this task, and significantly better than the individual models.

  • Compare to the best individual model performance, the WavLM+HuBERT features give a relative 10% improvement.

3.2.2 Finetuning with Different Domain Data

In this section, we fine-tune with WSJ data for the downstream ASR task. While WSJ data is mostly from business domain with modern English, Librispeech is mostly audiobooks of old English material from project Gutenberg. Therefore, there is a mismatch in domain. Since the domain of labeled data used for finetuning is different from the data used for pre-training, we need a more complex model to learn better. Therefore, while for the previous section we can got good performance for 2-layer encoder, for the mismatched WSJ task we have used 8 encoder layers followed by CTC to get good performance. In the first set of experiments, only a CTC layer is applied on top and finetuned. As expected these perform far worse in this mismatched case. In the next set of experiments, eight layer transformer encoder is finetuned. From Table 3 we make the following observations:

  • For fine-tuning models from a different domain too, the WavLM model performs better than HuBERT which is better than Wav2Vec2.0

  • In this case also, a combination of WavLM+HuBERT gives the best performance.

  • The relative improvement using ensemble features is more significant when using large pretrained models when compared to the base models.

  • Even for the WSJ task, the ensemble feature provide about 10% relative improvement over the best performing individual features.

Feature Extraction Model   CTC Layer only   8 encoder layers + CTC
  Test-Dev93 Test-Eval93   Test-Dev93 Test-Eval93
   
Base models
WavLM
 
66.81 65.28   15.77 14.93
HuBERT
 
74.97 75.01   22.54 20.69
Wav2Vec2.0
 
  35.12 31.84
WavLM
+HuBERT
 
59.46 58.05   16.04 14.93
WavLM
+Wav2Vec2.0
 
68.05 67.37   16.24 15.37
Wav2Vec2.0
+HuBERT
 
74.91 74.87   17.94 17.66
Wav2Vec2.0
+HuBERT
+WavLM
 
60.14 59.79   16.4 15.43
Large models
WavLM
 
55.83 55.08   11.09 10.37
HuBERT
 
66.97 66.47   13.22 12.73
Wav2Vec2.0
 
98.01 98.08  
WavLM
+HuBERT
 
49.81 49.30   9.57 8.86
WavLM
+Wav2Vec2.0
 
56.80 56.13   10.32 9.59
Wav2Vec2.0
+HuBERT
 
66.68 65.43   18.50 17.75
Wav2Vec2.0
+HuBERT
+WavLM
 
48.79 47.62   15.09 13.89
Table 3: Evaluating Results Of Individual Ssl Methods And Ensemble Model For In-domain (Libri-Light 10h) And Out-domain (10h Subset Of WSJ) Downstream Data. Only Ctc Layer Is Added On Top Of Features. No External Language Model Is Used.

4 Conclusion

In this paper, we have explored the use of ensemble of models and embedding from pre-trained model to improve the performance over individual SSL methods. In all cases, we have used publicly available models. The motivation for the use of ensemble is that different SSL methods employ different objective functions such masked prediction loss or contrastive loss. Therefore, they may capture complimentary information. On the downstream ASR, the use of our proposed approaches provide a relative improvement of 10% over the best individual models for both Libri-100 as well as WSJ task.

5 Acknowledgement

We would like to thank Metilda for technical discussion and all her help in preparing this paper.

References