An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

by   Xuankai Chang, et al.

Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations do not provide a comprehensive comparison among many ASR benchmark corpora. In this paper, we focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR. Without any modification of the back-end model architectures or training strategy, some of the experiments with pretrained representations, e.g., WSJ, WSJ0-2mix with HuBERT, reach or outperform current state-of-the-art (SOTA) recognition performance. Moreover, we further explore more scenarios for whether the pretraining representations are effective, such as the cross-language or overlapped speech. The scripts, configuratons and the trained models have been released in ESPnet to let the community reproduce our experiments and improve them.



There are no comments yet.


page 1

page 2

page 3

page 4


Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Recently, self-supervised pretraining has achieved impressive results in...

Recent Developments on ESPnet Toolkit Boosted by Conformer

In this study, we present recent developments on ESPnet: End-to-End Spee...

Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

Sequence-to-sequence ASR models require large quantities of data to atta...

Injecting Text in Self-Supervised Speech Pretraining

Self-supervised pretraining for Automated Speech Recognition (ASR) has s...

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

Much recent work on Spoken Language Understanding (SLU) is limited in at...

Joint Unsupervised and Supervised Training for Multilingual ASR

Self-supervised training has shown promising gains in pretraining models...

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Sequence-to-sequence (seq2seq) voice conversion (VC) models are attracti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The performance of speech recognition systems have been improved a lot over the last decade. On the one hand, the rapid development of deep neural networks has dramatically pushed the limit of the models 

[1, 15, 7, 45, 13, 16]. On the other hand, the increasing computing resources have enabled to train an automatic speech recognition (ASR) system with a large amount of transcribed data [38, 9], leading to a better performance. It is known that the deep neural networks are data hungry, thus some researchers have been trying to improve the capacity of neural networks by incorporating more and more transcribed data [8]

. However, using the transcribed data only is not efficient because the untranscribed data is of great portion in all the data available. Motivated by this, researchers proposed to make use of the untranscribed data, known as unsupervised and semi-supervised learning. Recently, it has become a hot topic in speech recognition and can be roughly divided into two approaches. In 

[28, 43, 23], a semi-supervised method, called pseudo-labelling, was proposed to use both transcribed and untranscribed data. A seed model is first trained with the transcribed data in a supervised manner, which is then used to generate the pseudo-labels for the untranscribed data. After that, a model can be trained with all the data in a supervised manner.

Previous studies in computer vision (CV) and natural language processing (NLP) have investigated to learn representations from data, showing the advantages in the corresponding downstream tasks 

[47, 12, 39]

. Similarly, another approach was proposed to pretrain models using a large amount of untranscribed data for learning high quality speech features. In this context, many pretrained speech representation models have been proposed, which are often referred as self-supervised learning representation (SSLR). These SSLRs can be categorized by their training objectives. To be specific, one direct way to learn the speech representations is to predict the future information given the history information. In 

[10, 11]

, the authors adopted an method similar to the autoregressive language models (LMs) to predict the future acoustic features (e.g. FBANK) conditioned on the past input features, called autoregressive predictive coding (APC). Instead of autoregressive modeling, some researchers proposed to use masking prediction techniques as in BERT-LM 

[12] to learn the speech representations, including Mockingjay [32], TERA [31] and NPC [30]. However, it is not necessary to learn the speech representations by reconstructing the acoustic features. In [36, 42], the models were optimized with a contrastive loss to distinguish the positive sample from negative samples in predictions of future. Later in [3, 4], a BERT Transformer model is concatenated after the encoder trained by the contrastive loss. Recently, a novel model, called HuBERT [21], was proposed to pretrain the representation model by a classification tasks using pseudo-labels motivated by deep cluster models [6, 49].

All the proposed representations have shown promising results, however, we can hardly draw a conclusion about a suitable representation for various tasks because their experiments focused on a limited number of tasks and were performed independently. Recently, a benchmark, called Speech processing Universal PERformance Benchmark (SUPERB) [50], was proposed to provide a fair and standard evaluation of various speech representations, with a unified toolkit, S3PRL. SUPERB focuses on the shallow information of each representation. During evaluation, all the representation models are frozen and applied on several downstream tasks each of which uses a quite light-weight downstream model. For example, the ASR task is evaluated using a two-layer RNN-based connectionist temporal classification (CTC) model. Such evaluation provides informative clues to compare the capacity and the concentration of information for each SSLR. Besides, it prepares the easy access to a lot of pretrained SSLR models. Thus, SUPERB is a very strong benchmark for evaluating the SSLRs without any doubt.

With that being said, it still remains a question that how well these SSLRs can perform in the advanced speech recognition systems. In this paper, we investigate the performance of end-to-end ASR (E2E-ASR) systems using the pretrained SSLRs. To achieve this, we incorporated the SSLRs from S3PRL, the toolkit used in SUPERB, to the ESPnet[48], a widely used E2E speech processing toolkit. Thus, we can easily evaluate the E2E-ASR performance of pretrained SSLRs available in S3PRL using the current state-of-the-art (SOTA) neural network models, such as Transformers [45, 13] and Conformers [16]. We can also easily evalute the SSLRs in other downstream tasks, including speech translation (ST) [22] and speech enhancement (SE) [29]. It is also an interesting question in the air about the generalization ability of these SSLRs, given the fact that most of SSLRs were trained and tested mainly on LibriSpeech [38, 24]. In this project, we explored these pretrained SSLRs on various open-source and publicly available corpora as many as possible, considering different characteristics including read vs. spontaneous speech, single-speaker vs. multi-speaker, noisy/distant-talk environments, and the telephone channel. We show that some of the SSLRs can achieve much better results than the commonly used log-Mel Filterbank (FBANK) feature.

The contributions of this study include:

  • We implement the use of pretrained SSLRs in advanced E2E-ASR models, based on which we compare the performance of different representations.

  • The experimental results show that simply replacing the FBANK features with the SSLRs can surpass our current best E2E-ASR system. For some ASR benchmark corpora, our results get competitive results with the SOTA systems, such as WSJ, LibriSpeech and TEDLIUM2.

  • We explore more scenarios with domain-mismatch from the raw speech data to train the pretraining representations. Some observations show the relationship between pretraining representations and their applicable scenarios.

  • We provide reproducible benchmark results, recipes, setups and well-trained models on several publicly available corpora in our open source toolkit ESPnet.

2 End-to-End ASR

In this section, we will briefly describe the E2E-ASR models used in this paper, including the CTC and the attention-based encoder-decoder (AED) framework.

2.1 Ctc

Giving an input speech feature , where means the number of frames, and the corresponding output label sequence , where and is a vocabulary, CTC [14]

was proposed to estimate a frame-level input-output alignment

by allowing repetitions of labels or emitting a special blank label , i.e.,

. The CTC optimizes the model to maximize the probability distribution

over all possible alignments, which can be formulated as:


where refers all possible alignments of .

2.2 Joint CTC/Attention-based Encoder-Decoder

AED model directly maps an input speech feature into an label sequence of words or characters without any intermediate representations. It models the distribution of each label by conditioning on both previous estimated labels and the input feature as:


For better performance, we adopted the joint CTC/attention-based encoder-decoder architecture [26] in this work. The loss is defined as the sum of the negative log-likelihoods of CTC and AED:


where is a tunable hyper-parameter.

In this work, we majorly use Transformer [45] and Conformer [16] as the basic block to build our E2E-ASR models.

2.2.1 Transformer

Transformer [45] was proposed by Vaswani et al. and has become the dominant model in various E2E-ASR tasks [13, 25]. Both the encoder and decoder are multi-blocked architectures and each block is stacked by a multiheaded self-attention (MHSA) module and a positionwise feed-forward (FFN) module. For the decoder, another source-target attention module is inserted after the MHSA module to joint model the acoustic and linguistic information.

2.2.2 Conformer

In [16], Gulatiet et al. proposed a novel architecture with combination of self-attention and convolution in ASR models, which is named Conformer encoder. Specifically, it includes a MHSA module, a convolution (CONV) module, and a pair of FNN modules in the Macaron-Net style. While the MHSA learns the global context, the CONV module efficiently captures the local correlations synchronously. In this work, our model follows the same setups as in [17], which integrates the Conformer encoder with a normal Transformer decoder.

3 Speech Representations for ASR

Basically, raw speech signal is much less efficient in conveying information than text. Thus for ASR, speech representation extraction is an important module to condense the information of speech signal. Traditionally, many handcrafted features, such as log-Mel Filterbank (FBANK), Mel Frequency Cepstral Coefficients (MFCCs), are used in ASR tasks. There are many SSLRs proposed in previous studies. In this project, we cover the following 8 typical methods to perform the evaluation, including APC, CPC, HuBERT, Mockingjay, NPC, TERA, VQ-APC, Wav2Vec2.0111We only evaluated Wav2Vec2.0 among the Wav2Vec series because we believe it can represent the best of Wav2Vec techniques..

SSLR Objective Data Arch #Params Stride
autoregressive masking contrastive pseudo-labelling
APC LibriSpeech 960hr 3-GRU 4.1M 10ms
CPC LibriLight 60K 5-Conv, 1-LSTM 1.8M 10ms
HuBERT LibriLight 60K 7-Conv 24-Trans 316.2M 20ms
Mockingjay LibriSpeech 960hr 3-Trans 21.3M 10ms
NPC LibriSpeech 960hr 4-Conv, 4-Masked Conv 19.4M 10ms
TERA LibriSpeech 960hr 3-Trans 23.3M 10ms
VQ-APC LibriSpeech 960hr 3-GRU 4.6M 10ms
Wav2Vec2.0 LibriSpeech 960hr 7-Conv 24-Trans 317.4M 20ms
Table 1: Information summary of the SSLRs used in this paper, including the data used in pretaining (Data), model architecture (Arch), number of parameters (#Params) and the stride of the features.

3.1 Autoregressive-prediction based representations

One type of representation learning method is to learn to predict the future acoustic features given the past.

APC. Autoregressive Predictive Coding (APC) was proposed in [10]

. The idea comes from autoregressive LMs for text, which is typically a probability distribution over token sequences. APC uses an RNN for modeling the temporal information of raw speech signals to predict the features of the frame

steps ahead. The model is optimized by minimizing the L1 loss between the input speech signal and the predicted sequence.


is a variant of APC that incorporates vector quantization (VQ) layers. A VQ layer in the middle of the APC network is used to control the amount of information from the previous layer. Therefore, the model is forced to learn better representations to predict future frames.

3.2 Masking-prediction based representations

Speech signal is context dependent. Using both the past and future information when learning the speech representation can be useful. Some researchers proposed to generate the current frame conditioned on both the past and future information.

Mockingjay. Similar to BERT[12], the model is trained by recover the masked input features based on its left and right context frames. During training, a certain amount of input frames are selected to be dynamically masked by zero-masking, random value filling or leaving unchanged. The model is optimized by minimizing the reconstruction loss between the original feature sequence and the prediction using L1 loss. To avoid the model learning the local smoothness, the frames being masked are chosen as a consecutive frame subsequence. The MFCC feature is used as the acoustic features. A Transformer with multi-headed self-attention is used as the model. On top of it, a two-layer feed-forward network was used to predict the original feature.

TERA. Transformer encoder representation from alteration (TERA) [31] is an extension of Mockingjay. Similar to Mockingjay, the model takes the manipulated acoustic features as input and minimizes the difference between the ground-truth and the prediction at masked portion of the input. In addition to the masking process along time axis used in Mockingjay, TERA employs two more alterations, including the masking process in frequency axis and the magnitude alterations by adding random noise.

NPC. Nonautoregressive predictive coding (NPC) [30]

shares the similar principle with Mockingjay and TERA. For NPC, it uses the convolutional neural networks. The model generates the acoustic feature at each time step

conditioned on both the future and past information within a certain range. To avoid the local smoothness problem, the nearest frames within steps are masked out. Thus, the input of the model is in the range .

3.3 Contrastive learning based representations

CPC. Instead of using a conditional generative model to reconstruct the original input signal, the contrastive predictive coding (CPC) [36]

model learns the representation via maximizing the mutual information between the current context and future embeddings by minimizing the noise-contrastive estimation-based (NCE) loss

[18]. We use an updated version of CPC model, called modified-CPC [40]. The model contains a 5-layer convolutional neural network (CNN) encoder to generate latent representation at lower temporal resolution and 1-layer LSTM to summarize the latent representation as a context representation.

Wav2Vec2.0. Wav2Vec2.0 [4]

is also a contrastive learning-based approach. It is composed of a CNN-based encoder network, a VQ module, and a Transformer-based context representation network. The masking idea is also applied in the model, the input to the Transformer being randomly masked. During the pretaining, Wav2Vec2.0 is optimized with a contrastive loss function. In addition, a regularization term is added to the loss function to increase the diversity of the codebook in the VQ module.

3.4 HuBERT

HuBERT Hidden-Unit BERT(HuBERT) [21]

is another novel self-supervised speech representation learning approach. During pretaining, iterative pseudo-labelling method is adopted. First, offline clustering method (e.g. k-Means) is applied on the acoustic features of untranscribed speech, such as MFCC. In this process, the discrete labels is generated for the speech data. There are two major modules in the pretained HuBERT model including a CNN-based encoder and a BERT encoder. The HuBERT model is trained to predict the distribution of discrete units. Similar to BERT-LM, the input feature to the BERT encoder is partially masked. The cross-entropy loss is computed on both the masked and unmasked frames. After training, the model can be used to generate the new pseudo-labels with higher quality. Such refinement can be carried out for many times.

3.5 Application of SSLRs

Figure 1: End-to-End speech processing with various speech representations. The framework can be used in speech recognition (ASR), speech translation (ST), speech enhancement (SE), etc.

All the SSLR models we use are pretained and can be accessed via S3PRL. For HuBERT and CPC, we used the pretained model with 60,000h Libri-Light[24]. For the rest of the SSLRs, we use the models pretained with LibriSpeech 960h [38]. The detailed information of SSLRs is described in Table 1. We follow the policies in SUPERB[50] to use the weighted-sum of multiple hidden states as the input to E2E-ASR. A weight parameter is trained to summarize all the hidden states for each SSLR model. The overall E2E-ASR is shown in Fig. 1. The input speech goes through a frontend to extract feature. Various of SSLRs can be extracted instead of conventional features, such as FBANK. Next, the feature sequence is fed into the ASR network to predict the hypothesis.

4 Experiments

4.1 Setups

To evaluate the performance of various self-supervised learning representations, we conduct experiments on several open-source and publicly available corpora. Although the pretained representation models were learned on the LibriSpeech 960 or LibriLight data, we tested their performance on other datasets to verify their generalization ability. We conducted experiments on 7 ASR corpora 222We did our best to work on as many datasets as possible. Due to the time limit, we couldn’t finish all the experiments before submission unfortunately. We are still working on some other corpora including but not limited to REVERB[27], AMI[34], VoxForge, HKUST[33], TEDLIUM3[19].. We try to cover various aspects in the ASR task, including read vs. spontaneous speech, noisy/distant-talk environment and the telephone channel.

Basically, we follow the recipes in the latest ESPnet[48] to conduct the experiments. The default FBANK feature extractor is replaced with the pretained self-supervised learning representation models. The downsampling layers are used to make all the SSLRs have a 40ms-stride in the encoder. For most of the corpora, we use the Conformer encoder and Transformer decoder as our network architecture, which showed promising results [17]. The hyper-parameters of the ASR models are shown in Table 2.

Dataset DataAugment EncoderType kernel
SpeedPerturb SpecAug
AISHELL Conformer 15 12 6 4 2048 256
CHiME4 Conformer 15 12 6 4 2048 256
LibriSpeech Conformer 31 12 6 8 2048 512
Switchboard Conformer 31 12 6 4 2048 256
TEDLIUM2 Conformer 15 12 6 4 2048 256
WSJ Conformer 15 12 6 4 2048 256
WSJ0-2mix Transformer - 12 6 4 2048 256
Table 2: Hyperparameters of the ASR models for each dataset, including data augmentations, encoder type and kernel size (kernel) of Conformer encoder, number of encoder layers (), number of decoder layers (), number of attention heads (), feed-forward dimension (), attention dimension (.

4.2 Performance on E2E-ASR tasks

In this part, we present the E2E-ASR results of various SSLRs using joint CTC/attention-based encoder-decoder system. The results on different corpora are shown in Table. 3.

Dataset Vocab Metric Evaluation Sets FBANK APC CPC HuBERT Mockingjay NPC TERA VQ-APC Wav2Vec2.0
(960h) (60kh) (60kh) (960h) (960h) (960h) (960h) (960h)
AISHELL Char CER dev / test 4.4 / 4.7 6.1 / 6.5 4.9 / 5.3 4.4 / 4.7 5.0 / 5.4 5.0 / 5.5 4.7 / 5.1 5.0 / 5.5 4.6 / 5.0
black!30CHiME4 Char WER {dt05/et05}-1ch 13.6 / 23.2 16.8 / 27.4 16.2 / 25.8 11.6 / 22.8 - 16.8 / 28.0 16.8 / 28.0 17.1 / 27.4 13.5 / 26.1
{dt05/et05}-5ch 9.4 / 15.8 11.1 / 18.3 10.5 / 17.1 5.0 / 10.2 - 11.0 / 18.8 10.9 / 18.1 11.0 / 18.0 6.3 / 12.5
black!30LibriSpeech BPE WER {dev / test}-clean 1.7 / 1.9 2.4 / 2.6 2.2 / 2.4 1.5 / 1.6 2.3 / 2.4 2.5 / 2.6 2.4 / 2.5 2.2 / 2.6 1.7 / 1.9
{dev / test}-other 4.2 / 4.2 7.3 / 7.5 6.2 / 6.4 3.1 / 3.2 6.4 / 6.9 7.3 / 7.5 6.7 / 7.0 7.0 / 7.3 4.9 / 4.7
black!30Switchboard BPE WER eval2000(callhm/swbd) 15.6 / 8.4 17.7 / 8.9 15.7 / 8.4 18.1 / 7.3 16.9 / 9.9 17.6 / 9.1 17.2 / 8.9 17.4 / 8.6 14.9 / 7.9
black!30TEDLIUM2 BPE WER dev / test 9.5 / 8.9 9.1 / 8.5 9.0 / 8.7 6.4 / 6.2 9.2 / 8.6 9.3 / 8.5 8.9 / 8.4 9.7 / 9.0 7.6 / 7.4
black!30WSJ Char WER dev93/eval92 6.6 / 4.4 7.2 / 4.5 7.1 / 4.7 3.0 / 1.5 6.8 / 4.6 7.3 / 4.8 6.3 / 4.4 7.5 / 4.6 3.7 / 2.1
black!30WSJ0-2mix Char WER dev / test 17.7 16.5 14.9 12.1 15.9 16.5 15.1 17.1 13.2
  • : The FBANK result is obtained

Table 3: Performance (WERs / CERs) of joint CTC/attention-based encoder-decoder on various open source ASR corpora.

4.2.1 Read English Speech

LibriSpeech [38] is a corpus of English speech extracted from audiobooks. Here we used all 960 hours of data as the training set. We can see that HuBERT, using 60,000h data, outperformed the baseline333For APC, NPC, and TERA, the number of gradient accumulation was set incorrectly, thus these results could be worse than they actually are. We are re-training the corresponding models.. In particular, HuBERT achieved a relative improvement of as much as 23.8% and 26.2% for dev-other and test-other sets, respectively. On the other hand, no improvement was observed for other models pre-trained using LibriSpeech 960h. In this case, we used the same data for representation learning and ASR task without fine-tuning. Therefore, it is inferred that there was no room for growth in improvement compared to other corpora setups.

WSJ is a reading speech corpus drawn from the Wall Street Journal news text and the total amount of the training set is about 80 hours. From the Table 3, we can find that most of SSLRs show similar results compared with the FBANK feature except HuBERT and Wav2Vec2.0. When simply using the speech representations learned from HuBERT (60kh) and Wav2Vec2.0 (960h), we obtain superior WERs of 3.0%/1.5% and 3.7%/2.1% on the dev93 and eval92 sets, respectively, reaching the SOTA results (1.3% on eval92 set) presented in [8], which is trained over 5000 hours English speech.

4.2.2 Spontaneous English Speech

We further evaluate the performance with TEDLIUM2[41] dataset, a Semi-Spontaneous English Speech corpus. Using the HuBERT and Wav2Vec2.0 brings a significant performance gain compared with the FBANK. HuBERT relatively improves the word error rates (WERs) on dev and test sets by and , respectively, while Wav2Vec2.0 achieves and . For other representations, they achieve better performance than FBANK. Among them, TERA is slightly better than the rest.

4.2.3 Noisy English Speech

CHiME4 [46] is a noisy multichannel ASR corpus. This corpus is challenging for self-supervised models because the representations are designed for encoding clean speech signals. We evaluate with both single-mic and multi-mic sets. The 5ch set is enhanced with delay and sum beamforming [2]. HuBERT and Wav2Vec2.0 gives significant performance improvement over FBANK features on the multi-mic (5ch) scenario but the performance is similar to FBANK features on the single-mic (1ch) scenario. The other representations degrade the performance compared to FBANK in both scenarios. With Mockingjay, we noticed very serious overfitting issues and it seems inappropriate for this dataset.

4.2.4 Telephone Channel English Speech

Switchboard corpus consists of approximately 260 hours of telephone conversation speech and is collected at 8 KHz. Here, we upsample it to 16 KHz to make it suitable for the pretrained models. Since all SSLR models are trained on the reading English speech, there is no doubt that the data mismatches, like speech style, channel distortion, etc., will evidently hinder the performance. From the table, we can find that while most of the SSLR models are susceptible to performance degradation, both HuBERT and Wav2Vec2.0 are robust to such domain mismatches, obtaining a slight improvement compared with FBANK feature.

4.2.5 Non-English Speech

Another question remaining is whether these SSLR models are also suitable for cross-lingual scenarios, such Mandarin ASR task Thus, we also conduct experiments on the open-source Mandarin AISHELL corpus [5]. The AISHELL corpus contains about 178 hours Mandarin speech recorded in a clean environment. Due to the limitation of time and computation resource, we are only able to conduct experiments on a little SSLR models. From the character error rates (CERs) results, we can see that APC and the English language pretrained models are not able to generalize to other languages. Although the HuBERT achieves same results as FBANK feature, other models, like APC and Wav2Vec2.0 shows a severely performance degradation, yielding up to 30% relative increase of the CERs results.

4.2.6 Multi-speaker overlapped Speech

In the previous sections, results on several single-speaker speech corpora show the different performance of the pretaining models. To further explore whether the models, e.g., HuBERT and Wav2Vec2.0, that work well on a single-speaker dataset can be applied in the case of multi-speaker overlapped speech, in this section, we conduct the experiments on the WSJ0-2mix dataset [20], which is the de-facto benchmark dataset for speech separation or multi-speaker speech recognition systems. In WSJ0-2mix, the 30 hours training set and 10 hours validation set contain two-speaker overlapped mixtures. The 5 hours test set was similarly generated using utterances from another 18 speakers from the WSJ0 validation set and evaluation set which has not overlapped speaker with the training set.

As shown in the last line of Table 3, although with the identical architecture of transformer-based model, the results with different pretained representation show a wide range of differences in the term of WER. As shown in the last line of Table 3, although with the identical architecture of transformer-based model, the results with different pretained representation show a wide range of differences in the term of WER. To be specific, similar to the previous observation on the single-speaker dataset, the HuBERT and Wav2Vec2.0 also show obvious advantages with the WERs of 12.1% and 13.5% respectively. It is worth mentioning that the raw speech data to training the pretained representation are all from single-speaker speech, which is quite different from the overlapped multi-speaker speech. We infer that because of this mismatch between the data distributions, although the back-end models are fully trained with the representation of CPC or Mockingjay, results are not satisfactory.

Figure 2: Validation accuracies of different self-supervised learning representations on WSJ ASR dataset.

4.3 Performance of E2E Non-autoregressive ASR

The pretained representations seem to be effective in E2E-ASR from the experiments above, especially the HuBERT and Wav2Vec2.0 representations. It is based on the joint CTC/attention-based encoder-decoder (AED) network. Recently, non-autoregressive (NAR) ASR has become popular. We use the simplest NAR model, namely the CTC-based ASR, to evaluate the performance of every SSLR. The results are shown in Table 4. When we decode without language model, namely the greedy search in CTC, the HuBERT and Wav2Vec2.0 still outperform all the other representations. The performance of Mockingjay is slightly worse than APC and CPC, different from the joint CTC/AED case. If we add the language model and using beamsearch, the results become much better. For the CTC models, it is surprising that the WERs of HuBERT without LM is even better than those of FBANK with LM.

Frontend CTC Joint CTC & AED
w/o LM w/ LM
FBANK 17.4 / 13.6 8.5 / 5.9 6.6 / 4.4
APC 18.5 / 15.0 9.3 / 6.5 7.2 / 4.5
CPC 18.9 / 14.7 9.6 / 6.7 7.1 / 4.7
HuBERT 8.1 / 5.6 3.3 / 2.1 3.0 / 1.5
Mockingjay 19.2 / 15.6 9.7 / 7.0 6.8 / 4.6
NPC 18.5 / 13.7 9.4 / 6.5 7.3 / 4.8
TERA 18.2 / 14.8 8.7 / 6.5 6.3 / 4.4
VQ-APC 19.4 / 14.7 10.0 / 6.5 7.5 / 4.6
Wav2Vec2.0 9.4 / 7.2 4.3 / 2.5 3.7 / 2.1
Table 4: WERs of dev/test sets on WSJ ASR corpora. Comparison between CTC and joint CTC/attention-based encoder-decoder.

4.4 Weighted-sum vs. Last layer vs. Finetuned Last layer

In this part, we explore different ways to use the pretained representations. Since the HuBERT and Wav2Vec2.0 achieve much better results than other representations, we mainly evaluate these two models. We test three methods on the WSJ dataset for fast development and the results are presented in Table 5. To reduce the impact of language models, we show the results of both with LM and without LM. First, we can see that decoding results of both of the HuBERT and Wav2Vec2.0 without LM are better than FBANK baseline with LM in Table3. For Wav2Vec2.0 without LM, the weighted-sum feature performs better than the last-layer cases. This indicates that the hidden states in the middle of the Transformer encoder within Wav2Vec2.0 is also helpful. We further look at the weights of the feature summarization after training, shown in Fig. 3. It can be seen that HuBERT concentrates more on the last-layer. This phenomenon still remains discussion.

Figure 3: The weights of the feature summarization after training on HuBERT and Wav2Vec2.0 representations.
Frontend LM Weighted-Sum Last-layer Finetuned Last-layer
HuBERT 3.0 / 1.5 2.6 / 1.6 2.8 / 1.6
4.9 / 3.4 4.8 / 3.4 4.6 / 3.5
Wav2Vec2.0 3.7 / 2.1 3.7 / 2.1 3.7 / 2.0
6.1 / 3.9 6.1 / 4.5 6.1 / 4.2
Table 5: WERs of dev/test sets on WSJ ASR corpus in w/ and w/o LM conditions. Using weighted-sum, last-layer and finetuned last-layer outputs as ASR input feature.

4.5 Discussions

According to the experiments, we find that the self-supervised learning representations can improve the performance in some cases, especially using the HuBERT and Wav2Vec2.0 models. In Fig. 2

, we show the validation accuracies of different SSLRs on the WSJ dataset. All the models use the same optimization parameters. We observed that the learning curve behaviour with the pretained representations are obviously better than FBANK at the first few epochs, which shows the fast convergence properties of all SSLRs. The accuracies of HuBERT and Wav2Vec2.0 stay the leading position with large margins throughout the training process, which indicates that we can easily judge whether the pretained representations is correctly working or not without waiting for the entire training epochs.

We summarize the other training tips we have observed in our experiment:

  • If the number of GPUs is insufficient, the accumulating gradient strategy[37] can be employed to emulate a large mini-batch.

  • When a model suffers from over-fitting, dropout of positional encoding and attentions in Transformer and Conformer block can be enabled.

  • Strong self-supervised representation models, including HuBERT and Wav2Vec2.0, are robust to optimization parameters, such as the learning rate, batch size (or accumulating gradient), etc. Thus the hyperparameters used in HuBERT and Wav2Vec2.0 may not always fit in others.

  • Because HuBERT and Wav2Vec2.0 representations are robust, their training can be used as an upper bound for other representations to monitor the training trend.

  • It takes a lot of time to compute the global normalization statistics on CPU. One option is to do it on GPU. For simplicity, we just use utterance-level normalization rather than global one.

5 Conclusion

In this paper, we explore the application of different pretained self-supervised learning representations in E2E-ASR with the joint CTC/attention-based encoder-decoder architecture. We conduct the experiments on several publicly available corpora. The results show that HuBERT and Wav2Vec2.0, which have the largest number of parameters, can dramatically improve the performance against the commonly used log-Mel Filterbank feature. To achieve better performance, we can directly re-use the existing representation models. We also find that the pretained representations can generalize to other corpora, not restricted to a specific dataset. We plan to open-source all of our configurations and scripts in the ESPnet project, to help the community easily access and improve the performance. In the future, we will extend this project to more corpora and the speech processing tasks beyond ASR.

6 Acknowledgement

This work used the Extreme Science and Engineering Discovery Environment (XSEDE) [44], which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system [35], which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).


  • [1] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn (2012) Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In Proc. ICASSP, pp. 4277–4280. Cited by: §1.
  • [2] X. Anguera, C. Wooters, and J. Hernando (2007) Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing 15 (7), pp. 2011–2022. Cited by: §4.2.3.
  • [3] A. Baevski, S. Schneider, and M. Auli (2020) Vq-wav2vec: Self-supervised learning of discrete speech representations. In Proc. ICLR, Cited by: §1.
  • [4] A. Baevski, H. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proc. NeurIPS, Cited by: §1, §3.3.
  • [5] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng (2017) Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proc. O-COCOSDA, pp. 1–5. Cited by: §4.2.5.
  • [6] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features

    In Proc. ECCV, pp. 132–149. Cited by: §1.
  • [7] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals (2016) Listen, attend and spell. In Proc. ICASSP, pp. 4960–4964. Cited by: §1.
  • [8] W. Chan, D. Park, C. Lee, Y. Zhang, Q. Le, and M. Norouzi (2021) SpeechStew: Simply mix all available speech recognition data to train one large neural network. In Proc. Interspeech, Cited by: §1, §4.2.1.
  • [9] G. Chen, S. Chai, G. Wang, J. Du, W. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, et al. (2021) GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. In Proc. Interspeech, Cited by: §1.
  • [10] Y. Chung, W. Hsu, H. Tang, and J. Glass (2019) An unsupervised autoregressive model for speech representation learning. In Proc. Interspeech, pp. 146–150. Cited by: §1, §3.1.
  • [11] Y. Chung, H. Tang, and J. Glass (2020) Vector-quantized autoregressive predictive coding. In Proc. Interspeech, pp. 3760–3764. Cited by: §1, §3.1.
  • [12] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL, pp. 4171–4186. Cited by: §1, §3.2.
  • [13] L. Dong, S. Xu, and B. Xu (2018) Speech-Transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proc. ICASSP, pp. 5884–5888. Cited by: §1, §1, §2.2.1.
  • [14] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks

    In Proc. ICML, pp. 369–376. Cited by: §2.1.
  • [15] A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In Proc. ICASSP, pp. 6645–6649. Cited by: §1.
  • [16] A. Gulati, J. Qin, C. Chiu, N. Parmar, et al. (2020) Conformer: Convolution-augmented Transformer for speech recognition. In Proc. Interspeech, pp. 5036–5040. Cited by: §1, §1, §2.2.2, §2.2.
  • [17] P. Guo, F. Boyer, X. Chang, T. Hayashi, Y. Higuchi, H. Inaguma, N. Kamo, C. Li, D. Garcia-Romero, J. Shi, et al. (2021) Recent developments on ESPnet toolkit boosted by Conformer. In Proc. ICASSP, pp. 5874–5878. Cited by: §2.2.2, §4.1.
  • [18] M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proc. AISTATS, pp. 297–304. Cited by: §3.3.
  • [19] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Esteve (2018) TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Proc. SPECOM, pp. 198–208. Cited by: footnote 2.
  • [20] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe (2016) Deep clustering: Discriminative embeddings for segmentation and separation. In Proc. ICASSP, pp. 31–35. Cited by: §4.2.6.
  • [21] W. Hsu, Y. H. Tsai, B. Bolte, R. Salakhutdinov, and A. Mohamed (2021) HuBERT: How much can a bad teacher benefit ASR pre-training?. In Proc. ICASSP, pp. 6533–6537. Cited by: §1, §3.4.
  • [22] H. Inaguma, S. Kiyono, K. Duh, S. Karita, N. Yalta, T. Hayashi, and S. Watanabe (2020) ESPnet-ST: All-in-one speech translation toolkit. In Proc. ACL, pp. 302–311. Cited by: §1.
  • [23] J. Kahn, A. Lee, and A. Hannun (2020) Self-training for end-to-end speech recognition. In Proc. ICASSP, pp. 7084–7088. Cited by: §1.
  • [24] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, et al. (2020) Libri-light: A benchmark for ASR with limited or no supervision. In Proc. ICASSP, pp. 7669–7673. Cited by: §1, §3.5.
  • [25] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, et al. (2019) A comparative study on Transformer vs RNN in speech applications. In Proc. ASRU, pp. 449–456. Cited by: §2.2.1.
  • [26] S. Kim, T. Hori, and S. Watanabe (2017) Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Proc. ICASSP, pp. 4835–4839. Cited by: §2.2.
  • [27] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas, et al. (2013) The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In Proc WASPAA, pp. 1–4. Cited by: footnote 2.
  • [28] D. Lee et al. (2013) Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proc. ICML, pp. 896. Cited by: §1.
  • [29] C. Li, J. Shi, W. Zhang, A. S. Subramanian, X. Chang, N. Kamo, M. Hira, T. Hayashi, C. Boeddeker, Z. Chen, et al. (2021) ESPnet-SE: End-to-end speech enhancement and separation toolkit designed for ASR integration. In Proc. SLT, pp. 785–792. Cited by: §1.
  • [30] A. H. Liu, Y. Chung, and J. Glass (2020) Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406. Cited by: §1, §3.2.
  • [31] A. T. Liu, S. Li, and H. Lee (2020) Tera: Self-supervised learning of transformer encoder representation for speech. arXiv preprint arXiv:2007.06028. Cited by: §1, §3.2.
  • [32] A. T. Liu, S. Yang, P. Chi, P. Hsu, and H. Lee (2020) Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In Proc. ICASSP, pp. 6419–6423. Cited by: §1.
  • [33] Y. Liu, P. Fung, Y. Yang, C. Cieri, S. Huang, and D. Graff (2006) HKUST/MTS: A very large scale mandarin telephone speech corpus. In Proc. ISCSLP, pp. 724–735. Cited by: footnote 2.
  • [34] I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, et al. (2005) The AMI meeting corpus. In Proceedings of the 5th International Conference on Methods and Techniques in Behavioral Research, Vol. 88, pp. 100. Cited by: footnote 2.
  • [35] N. A. Nystrom, M. J. Levine, R. Z. Roskies, and J. R. Scott (2015) Bridges: A uniquely flexible HPC resource for new communities and data analytics. In Proc. XSEDE, pp. 1–8. Cited by: §6.
  • [36] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §3.3.
  • [37] M. Ott, S. Edunov, D. Grangier, and M. Auli (2018)

    Scaling neural machine translation

    In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 1–9. Cited by: 1st item.
  • [38] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: An ASR corpus based on public domain audio books. In Proc. ICASSP, pp. 5206–5210. Cited by: §1, §1, §3.5, §4.2.1.
  • [39] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • [40] M. Riviere, A. Joulin, P. Mazaré, and E. Dupoux (2020) Unsupervised pretraining transfers well across languages. In Proc. ICASSP, pp. 7414–7418. Cited by: §3.3.
  • [41] A. Rousseau, P. Deléglise, Y. Esteve, et al. (2014) Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks.. In Proc. LREC, pp. 3935–3939. Cited by: §4.2.2.
  • [42] S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) Wav2vec: Unsupervised pre-training for speech recognition. In Proc. Interspeech, pp. 3465––3469. Cited by: §1.
  • [43] G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert (2020) End-to-end ASR: From supervised to semi-supervised learning with modern architectures. In Proc. ICML, Cited by: §1.
  • [44] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, et al. (2014) Xsede: Accelerating scientific discovery computing in science & engineering, 16 (5): 62–74, sep 2014. URL https://doi. org/10.1109/mcse 128. Cited by: §6.
  • [45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, et al. (2017) Attention is all you need. In Proc. NeurIPS, pp. 5998–6008. Cited by: §1, §1, §2.2.1, §2.2.
  • [46] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer (2017) An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech & Language 46, pp. 535–557. Cited by: §4.2.3.
  • [47] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2016) Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 652–663. Cited by: §1.
  • [48] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al. (2018) ESPnet: End-to-end speech processing toolkit. In Proc. Interspeech, pp. 2207–2211. Cited by: §1, §4.1.
  • [49] J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In Proc. ICML, pp. 478–487. Cited by: §1.
  • [50] S. Yang, P. Chi, Y. Chuang, C. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G. Lin, et al. (2021) SUPERB: Speech processing universal performance benchmark. In Proc. Interspeech, Cited by: §1, §3.5.