Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

12/03/2019 ∙ by Shaoshi Ling, et al. ∙ Amazon 0

We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42 the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Current state-of-the-art models for speech recognition require vast amounts of transcribed audio data to attain good performance. In particular, end-to-end ASR models are more demanding in the amount of training data required when compared to traditional hybrid models. While obtaining a large amount of labeled data requires substantial effort and resources, it is much less costly to obtain abundant unlabeled data.

For this reason, semi-supervised learning (SSL) is often used when training ASR systems. The most commonly-used SSL approach in ASR is self-training [26, 15, 20, 17, 23]

. In this approach, a smaller labeled set is used to train an initial seed model, which is applied to a larger amount of unlabeled data to generate hypotheses. The unlabeled data with the most reliable hypotheses are added to the training data for re-training. This process is repeated iteratively. However, self-training is sensitive to the quality of the hypotheses and requires careful calibration of the confidence measures. Other SSL approaches include: pre-training on a large amount of unlabeled data with restricted Boltzmann machines (RBMs)

[27]; entropy minimization [9, 14, 29], where the uncertainty of the unlabeled data is incorporated as part of the training objective; and graph-based approaches [18]

, where the manifold smoothness assumption is exploited. Recently, transfer learning from large-scale pre-trained language models (LMs)

[24, 7, 28] has shown great success and achieved state-of-the-art performance in many NLP tasks. The core idea of these approaches is to learn efficient word representations by pre-training on massive amounts of unlabeled text via word completion. These representations can then be used for downstream tasks with labeled data.

Inspired by this, we propose an SSL framework that learns efficient, context-aware acoustic representations using a large amount of unlabeled data, and then applies these representations to ASR tasks using a limited amount of labeled data. In our implementation, we perform acoustic representation learning using forward and backward LSTMs and a training objective that minimizes the reconstruction error of a temporal slice of filterbank features given previous and future context frames. After pre-training, we fix these parameters and add output layers with connectionist temporal classification (CTC) loss for the ASR task.

The paper is organized as follows: in Section 2, we give a brief overview of related work in acoustic representation learning and SSL. In Section 3, we describe an implementation of our SSL framework with DeCoAR learning. We describe the experimental setup in Section 4 and the results on WSJ and LibriSpeech in Section 5, followed by our conclusions in Section 6.

Figure 1: Illustration of our semi-supervised speech recognition system.

2 Related work

While semi-supervised learning has been exploited in a plethora of works in hybrid ASR system, there are very few work done in the end-to-end counterparts [17, 2, 8]. In [17]

, an intermediate representation of speech and text is learned via a shared encoder network. To train these representation, the encoder network was trained to optimize a combination of ASR loss, text-to-text autoencoder loss and inter-domain loss. The latter two loss functions did not require paired speech and text data. Learning efficient acoustic representation can be traced back to restricted Boltzmann machine

[12, 11, 3]

, which allows pre-training on large amounts of unlabeled data before training the deep neural network acoustic models.

More recently, acoustic representation learning has drawn increasing attention [13, 5, 6, 4, 25, 1] in speech processing. For example, an autoregressive predictive coding model (APC) was proposed in [6] for unsupervised speech representation learning and was applied to phone classification and speaker verification. WaveNet auto-encoders [4] proposed contrastive predictive coding (CPC) to learn speech representations and was applied on unsupervised acoustic unit discovery task. Wav2vec [25]

proposed a multi-layer convolutional neural network optimized via a noise contrastive binary classification and was applied to WSJ ASR tasks.

Unlike the speech representations described in [25, 6], our representations are optimized to use bi-directional contexts to auto-regressively reconstruct unseen frames. Thus, they are deep contextualized representations that are functions of the entire input sentence. More importantly, our work is a general semi-supervised training framework that can be applied to different systems and requires no architecture change.

3 DEep COntextualized Acoustic Representations

3.1 Representation learning from unlabeled data

Our approach is largely inspired by ELMo [24]. In ELMo, given a sequence of tokens

, a forward language model (implemented with an LSTM) computes its probability using the chain rule decomposition:

Similarly, a backward language model computes the sequence probability by modeling the probability of token given its future context as follows:

ELMo is trained by maximizing the joint log-likelihood of both forward and backward language model probabilities:


where is the parameter for the token representation layer,

is the parameter for the softmax layer, and

, are the parameters of forward and backward LSTM layers, respectively. As the word representations are learned with neural networks that use past and future information, they are referred to as deep contextualized word representations.

For speech processing, predicting a single frame may be a trivial task, as it could be solved by exploiting the temporal smoothness of the signal. In the APC model [6], the authors propose predicting a frame steps ahead of the current one. Namely, the model aims to minimize the

loss between an acoustic feature vector

at time and a reconstruction predicted at time : . They conjectured this would induce the model to learn more global structure rather than simply leveraging local information within the signal.

We propose combining the bidirectionality of ELMo and the reconstruction objective of APC to give deep contextualized acoustic representations (DeCoAR). We train the model to predict a slice of acoustic feature vectors, given past and future acoustic vectors. As depicted on the left side of Figure 1, a stack of forward and backward LSTMs are applied to the entire unlabeled input sequence

. The network computes a hidden representation that encodes information from both previous and future frames (i.e.

) for each frame . Given a sequence of acoustic feature inputs , for each slice starting at time step , our objective is defined as follows:


where are the concatenated forward and backward states from the last LSTM layer, and


is a position-dependent feed-forward network with 512 hidden dimensions. The final loss is summed over all possible slices in the entire sequence:


Note this can be implemented efficiently as a layer which predicts these frames at each position , all at once. We compare with the use of unidirectional LSTMs and various slice sizes in Section 5.

Representation 100 hours 360 hours 460 hours 960 hours
test-clean test-other test-clean test-other test-clean test-other test-clean test-other
filterbank 9.36 30.20 7.57 25.28 7.11 24.31 5.82 14.50
wav2vec [25] 6.92 20.00 6.26 18.17 6.01 17.00 5.12 13.07
DeCoAR 6.10 17.43 5.23 14.67 5.12 14.10 4.74 12.20
Table 1: Semi-supervised LibriSpeech results.

3.2 End-to-end ASR training with labeled data

After we have pre-trained the DeCoAR on unlabeled data, we freeze the parameters in the architecture. To train an end-to-end ASR system using labeled data, we remove the reconstruction layer and add two BLSTM layers with CTC loss [10], as illustrated on the right side of Figure 1. The DeCoAR vectors induced by the labeled data in the forward and backward layers are concatenated. We fine-tune the parameters of this ASR-specific new layer on the labeled data.

While we use LSTMs and CTC loss in our implementation, our SSL approach should work for other layer choices (e.g. TDNN, CNN, self-attention) and other downstream ASR models (e.g. hybrid, seq2seq, RNN transducers) as well.

4 Experimental Setup

4.1 Data

We conducted our experiments on the WSJ and LibriSpeech datasets, pre-training by using one of the two training sets as unlabeled data. To simulate the SSL setting in WSJ, we used 30%, 50% as well as 100% of labeled data for ASR training, consisting of 25 hours, 40 hours, and 81 hours, respectively. We used dev93 for validation and eval92 and evaluation. For LibriSpeech, the amount of training data used varied from 100 hours to the entire 960 hours. We used dev-clean for validation and test-clean, test-other for evaluation.

4.2 ASR systems

Our experiments consisted of three different setups: 1) a fully-supervised system using all labeled data; 2) an SSL system using wav2vec features; 3) an SSL system using our proposed DeCoAR features. All models used were based on deep BLSTMs with the CTC loss criterion.

In the supervised ASR setup, we used conventional log-mel filterbank features, which were extracted with a 25ms sliding window at a 10ms frame rate. The features were normalized via mean subtraction and variance normalization on a per-speaker basis. The model had 6 BLSTM layers, with 512 cells in each direction. We found that increasing the number of cells to a larger number did not further improve the performance and thus used it as our best supervised ASR baseline. The output CTC labels were 71 phonemes


The CMU lexicon:
plus one blank symbol.

In the SSL ASR setup, we pre-trained a 4-layer BLSTM (1024 cells per sub-layer) to learn DeCoAR features according to the loss defined in Equation 2 and use a slice size of 18. We optimized the network with SGD and use a Noam learning rate schedule, where we started with a learning rate of 0.001, gradually warm up for 500 updates, and then perform inverse square-root decay. We grouped the input sequences by length with a batch size of 64, and trained the models on 8 GPUs. After the representation network was trained, we froze the parameters, and added a projection layer, followed by 2-layer BLSTM with CTC loss on top it. We fed the labeled data to the network. For comparison, we obtained 512-dimensional wav2vec representations [25] from the wav2vec-large model222 Their model was pre-trained on 960-hour LibriSpeech data with constrastive loss and had 12 convolutional layers with skip connections.

For evaluation purposes, we applied WFST-based decoding using EESEN [21]. We composed the CTC labels, lexicons and language models (unpruned trigram LM for WSJ, 4-gram for LibriSpeech333Downloaded from into a decoding graph. The acoustic model score was set to and for WSJ and LibriSpeech, respectively, and the blank symbol prior scale was set to for both tasks. We report the performance in word error rate (WER).

Representation Unlabeled Labeled dev93 eval92
filterbank - 81h 8.21 5.44
wav2vec [25] 960h Libri. 81h 6.84 3.97
DeCoAR 960h Libri. 81h 6.30 3.17
filterbank - 25h 18.16 11.04
filterbank - 40h 13.50 9.20
DeCoAR 81h WSJ 25h 10.38 5.81
DeCoAR 81h WSJ 40h 9.41 5.09
DeCoAR 81h WSJ 81h 8.34 4.64
Table 2: Semi-supervised WSJ results. Unlabeled indicates the amount of unlabeled data used for acoustic representation learning, and Labeled indicates the amount of labeled data in ASR training.
Figure 2: The spectrograms for a portion of LibriSpeech dev-clean utterance 2428-83699-0034 reconstructed by generated by taking the -th frame prediction (slice size 18) at each time step. The reconstruction becomes less noisy but more simplistic when predicting further into the masked slice.

5 Results

5.1 Semi-supervised WSJ results

Table 2 shows our results on semi-supervised WSJ. We demonstrate that DeCoAR feature outperforms filterbank and wav2vec features, with a relative improvement of 42% and 20%, respectively. The lower part of the table shows that with smaller amounts of labeled data, the DeCoAR features are significantly better than the filterbank features: Compared to the system trained on 100% labeled data with filterbank features, we achieve comparable results on eval92 using 30% of the labeled data and better performance on eval92 using 50% of the labeled data.

5.2 Semi-supervised LibriSpeech results

Table 1 shows the results on semi-supervised LibriSpeech. Both our representations and wav2vec[25] are trained on 960h LibriSpeech data. We conduct our semi-supervised experiments using 100h (train-clean-100), 360h (train-clean-360), 460h, and 960h of training data. Our approach outperforms both the baseline and wav2vec model in each SSL scenario. One notable observation is that using only 100 hours of transcribed data achieves very similar performance to the system trained on the full 960-hour data with filterbank features. On the more challenging test-other dataset, we also achieve performance on par with the filterbank baseline using a 360h subset. Furthermore, training with with our DeCoAR features approach improves the baseline even when using the exact same training data (960h). Note that while [22] introduced SpecAugment to significantly improve LibriSpeech performance via data augmentation, and [19] achieved state-of-the-art results using both hybrid and end-to-end models, our approach focuses on the SSL case with less labeled training data via our DeCoAR features.

5.3 Ablation Study and Analysis

5.3.1 Context window size

We study the effect of the context window size during pre-training. Table 3 shows that masking and predicting a larger slice of frames can actually degrade performance, while increasing training time. A similar effect was found in SpanBERT [16], another deep contextual word representation which found that masking a mean span of 3.8 consecutive words was ideal for their word reconstruction objective.

Slice size dev93 eval92
12 6.58 3.50
18 6.30 3.17
22 6.62 3.62
Table 3: Comparison of WERs on WSJ after pre-training with different slice window sizes () on LibriSpeech.

5.3.2 Unidirectional versus bidirectional context

Next, we study the importance of bidirectional context by training a unidirectional LSTM, which corresponds to only using to predict . Table 4 shows that this unidirectional model achieves comparable performance to the wav2vec model [25], suggesting that bidirectionality is the largest contributor to DeCoAR’s improved performance.

Representation Context dev93 eval92
filterbank 8.21 5.44
wav2vec unidirectional 6.84 3.97
DeCoAR unidirectional 6.87 3.62
DeCoAR bidirectional 6.30 3.17
Table 4: Comparison of WERs on WSJ after pre-training using unidirectional or bidirectional context on LibriSpeech.

5.3.3 DeCoAR as denoiser

Since our model is trained by predicting masked frames, DeCoAR has the side effect of learning decoder feed-forward networks which reconstruct the -th filterbank frame from contexts and . In this section, we consider the spectrogram reconstructed by taking the output of at all times .

The qualitative result is depicted in Figure 2 where the slice size is 18. We see that when (i.e., when reconstructing the -th frame from ), the reconstruction is almost perfect. However, as soon as one predicts unseen frames (of 16), the reconstruction becomes more simplistic, but not by much. Background energy in the silent frames 510-550 is zeroed out. By

artifacts begin to occur, such as an erroneous sharp band of energy being predicted around frame 555. This behavior is compatible with recent NLP works that interpret contextual word representations as denoising autoencoders


The surprising ability of DeCoAR to broadly reconstruct a frame in the middle of a missing 16-frame slice suggests that its representations capture longer-term phonetic structure during unsupervised pre-training, as with APC [6]. This motivates its success in the semi-supervised ASR task with only two additional layers, as it suggests DeCoAR learns phonetic representations similar to those likely learned by the first 4 layers of a corresponding end-to-end ASR model.

6 Conclusion

In this paper, we introduce a novel semi-supervised learning approach for automatic speech recognition. We first propose a novel objective for a deep bidirectional LSTM network, where large amounts of unlabeled data are used to learn deep contextualized acoustic representations (DeCoAR). These DeCoAR features are used as the representations of labeled data to train a CTC-based end-to-end ASR model. In our experiments, we show a 42% relative improvement on WSJ compared to a baseline trained on log-mel filterbank features. On LibriSpeech, we achieve similar performance to training on 960 hours of labeled by pretraining then using only 100 hours of labeled data. While we use BLSTM-CTC as our ASR model, our approach can be applied to other end-to-end ASR models.


  • [1] A. Baevski, S. Schneider, and M. Auli (2019) Vq-wav2vec: self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453. Cited by: §2.
  • [2] M. K. Baskar, S. Watanabe, R. Astudillo, T. Hori, L. Burget, and J. Černocký (2019) Semi-Supervised Sequence-to-Sequence ASR Using Unpaired Speech and Text. In Interspeech, pp. 3790–3794. Cited by: §2.
  • [3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle (2007) Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pp. 153–160. Cited by: §2.
  • [4] J. Chorowski, R. J. Weiss, S. Bengio, and A. v. d. Oord (2019) Unsupervised speech representation learning using wavenet autoencoders. arXiv preprint arXiv:1901.08810. Cited by: §2.
  • [5] Y. Chung and J. Glass (2018) Speech2Vec: a sequence-to-sequence framework for learning word embeddings from speech. In Interspeech, pp. 811–815. Cited by: §2.
  • [6] Y. Chung, W. Hsu, H. Tang, and J. Glass (2019)

    An unsupervised autoregressive model for speech representation learning

    In Interspeech, pp. 146–150. Cited by: §2, §2, §3.1, §5.3.3.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp. 4171–4186. Cited by: §1.
  • [8] S. Dey, P. Motlicek, T. Bui, and F. Dernoncourt (2019) Exploiting Semi-Supervised Training Through a Dropout Regularization in End-to-End Speech Recognition. In Interspeech, pp. 734–738. Cited by: §2.
  • [9] Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §1.
  • [10] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    In ICML, pp. 369–376. External Links: Document Cited by: §3.2.
  • [11] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al. (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29. Cited by: §2.
  • [12] G. E. Hinton, S. Osindero, and Y. Teh (2006) A fast learning algorithm for deep belief nets. Neural computation 18 (7), pp. 1527–1554. Cited by: §2.
  • [13] W. Hsu and J. Glass (2018)

    Extracting domain invariant features by unsupervised learning for robust automatic speech recognition

    In ICASSP, pp. 5614–5618. Cited by: §2.
  • [14] J. Huang and M. Hasegawa-Johnson (2010)

    Semi-supervised training of gaussian mixture models by conditional entropy minimization

    In Interspeech, pp. 1353–1356. Cited by: §1.
  • [15] Y. Huang, Y. Wang, and Y. Gong (2016)

    Semi-supervised training in deep learning acoustic model

    In Interspeech, pp. 3848–3852. Cited by: §1.
  • [16] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2019) SpanBERT: improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529. Cited by: §5.3.1.
  • [17] S. Karita, S. Watanabe, T. Iwata, A. Ogawa, and M. Delcroix (2018) Semi-supervised end-to-end speech recognition.. In Interspeech, pp. 2–6. Cited by: §1, §2.
  • [18] Y. Liu and K. Kirchhoff (2016) Graph-based semisupervised learning for acoustic modeling in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24 (11), pp. 1946–1956. Cited by: §1.
  • [19] C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter, and H. Ney (2019) RWTH ASR systems for LibriSpeech: hybrid vs attention. In Interspeech, pp. 231–235. Cited by: §5.2.
  • [20] V. Manohar, H. Hadian, D. Povey, and S. Khudanpur (2018) Semi-supervised training of acoustic models using lattice-free mmi. In ICASSP, pp. 4844–4848. Cited by: §1.
  • [21] Y. Miao, M. Gowayyed, and F. Metze (2015) EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In ASRU, pp. 167–174. Cited by: §4.2.
  • [22] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) SpecAugment: a simple data augmentation method for automatic speech recognition. In Interspeech, pp. 2613–2617. Cited by: §5.2.
  • [23] S. H. K. Parthasarathi and N. Strom (2019) Lessons from building acoustic models with a million hours of speech. In ICASSP, pp. 6670–6674. Cited by: §1.
  • [24] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §1, §3.1.
  • [25] S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) Wav2vec: unsupervised pre-training for speech recognition. In Interspeech, pp. 3465–3469. Cited by: §2, §2, Table 1, §4.2, Table 2, §5.2, §5.3.2.
  • [26] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky (2013) Deep neural network features and semi-supervised training for low resource speech recognition. In ICASSP, pp. 6704–6708. Cited by: §1.
  • [27] K. Veselỳ, M. Hannemann, and L. Burget (2013) Semi-supervised training of deep neural networks. In ASRU, pp. 267–272. Cited by: §1.
  • [28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1, §5.3.3.
  • [29] D. Yu, B. Varadarajan, L. Deng, and A. Acero (2010) Active learning and semi-supervised learning for speech recognition: a unified framework using the global entropy reduction maximization criterion. Computer Speech & Language 24 (3), pp. 433–444. Cited by: §1.