DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

by   Shaoshi Ling, et al.

Recent success in speech representation learning enables a new way to leverage unlabeled data to train speech recognition model. In speech representation learning, a large amount of unlabeled data is used in a self-supervised manner to learn a feature representation. Then a smaller amount of labeled data is used to train a downstream ASR system using the new feature representations. Based on our previous work DeCoAR and inspirations from other speech representation learning, we propose DeCoAR 2.0, a Deep Contextualized Acoustic Representation with vector quantization. We introduce several modifications over the DeCoAR: first, we use Transformers in encoding module instead of LSTMs; second, we introduce a vector quantization layer between encoder and reconstruction modules; third, we propose an objective that combines the reconstructive loss with vector quantization diversity loss to train speech representations. Our experiments show consistent improvements over other speech representations in different data-sparse scenarios. Without fine-tuning, a light-weight ASR model trained on 10 hours of LibriSpeech labeled data with DeCoAR 2.0 features outperforms the model trained on the full 960-hour dataset with filterbank features.



There are no comments yet.


page 2

page 3


Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Wav2vec-C introduces a novel representation learning technique combining...

Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

We propose a novel approach to semi-supervised automatic speech recognit...

Competitive Learning Enriches Learning Representation and Accelerates the Fine-tuning of CNNs

In this study, we propose the integration of competitive learning into c...

Effectiveness of self-supervised pre-training for speech recognition

We present pre-training approaches for self-supervised representation le...

Automatic Dialect Density Estimation for African American English

In this paper, we explore automatic prediction of dialect density of the...

Lessons from Building Acoustic Models with a Million Hours of Speech

This is a report of our lessons learned building acoustic models from 1 ...

Autoregressive Co-Training for Learning Discrete Speech Representations

While several self-supervised approaches for learning discrete speech re...

Code Repositories


Code for DeCoAR (ICASSP 2020) and BERTphone (Odyssey 2020)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the long history of semi-supervised learning (SSL) in speech recognition, self-training approach

[34, 14, 31] and knowledge distillation [13], or known as teacher-student model training [18]

are the two commonly used SSL methods. Recent success of representation learning enables a new approach towards leveraging unlabeled data. In natural language processing community, BERT 

[9], ELMo [26], XLNet [38], GPT [27] and its follow-ups are classical examples of representation learning. The key philosophy of representation learning is based on using self-supervised learning, where we obtain ‘free’ labels from unlabeled data and train them in a supervised manner via some proxy tasks. In the context of BERT [9], two proxy tasks are defined including masked language model task and two-sequence prediction task. These proxy tasks are designed to force the learning of a robust, meaningful representation. After the representation has been learned, a downstream task model is then trained using labeled data with the learned representation. Optionally, the representation learning block and downstream task block can be fine-tuned together.

Learning efficient speech representation can be traced back to restricted Boltzmann machine

[12, 11, 3]

, which allows pre-training on large amounts of unlabeled data before training the deep neural network speech models. More recently, speech representation learning has drawn increasing attention in speech processing community and has shown promising results in semi-supervised speech recognition

[29, 2, 19, 21]. The design of proxy tasks in learning speech representation can be categorized into two types. The first type is based on contrastive loss [32] and has been applied to speech representation such as wav2vec and its variants [29, 1, 2]. The model is trained to learn representations containing information that most discriminates the future or masked frame from a set of negative samples via contrastive loss. The second type is based on reconstructive loss. The proxy task for these representation learning methods is to reconstruct temporal slices of acoustic features based on contextual information. These reconstruction tasks can be defined as autoregressive reconstruction, or masked-based reconstruction. APC [7] and its follow-up [6] are examples to use autoregressive reconstruction loss. In many state-of-the-art pretrained language model task, masked-based prediction is adopted in the proxy tasks such as BERT [9] and XLNet [38]. In speech, instead of prediction, we randomly mask temporal slices of acoustic features and attempt to reconstruct them [16, 30, 35, 22, 5, 21].

Orthogonal to the contrastive-/reconstructive-loss based speech representation learning, vector-quantized speech representations have been proposed [32, 20, 1, 2, 8]. One motivation to apply vector quantization (VQ) is that enforcing quantization can lead to better linguistic unit discovery [25, 10] due to the discrete nature of phonetic units. In VQ-APC [8], the authors use VQ as a way to limit model capacity and control information needed in encoding representation. In VQ-wav2vec [1] and wav2vec 2.0 [2], the author use VQ to facilitate direct application of BERT and other NLP algorithms.

In this paper, we introduce DeCoAR 2.0, a Deep Contextualized Acoustic Representation with vector quantization. We take inspirations from many recent advances in speech representation learning, and propose multiple improvements over vanilla DeCoAR. We summarize the contributions of this paper as follows:

  • [leftmargin=*,itemsep=0pt, topsep=1pt]

  • We propose to use Transformer as encoding block and replace LSTM in the vanilla DeCoAR;

  • We present a deep contextualized acoustic representation learning approach with the addition of a vector quantization layer;

  • We propose a new objective function that combines masked-based reconstruction loss with VQ diversity loss.

2 Related Work

2.1 An Overview on DeCoAR

DeCoAR stands for deep contextualized acoustic representations, and was proposed in our previous work [19]. As depicted in Figure 1, DeCoAR consists of two modules, an encoder module and a reconstruction module. For an input speech sequence , an encoder module

consists of a stacked forward and backward LSTMs, and computes a hidden representation that encodes information from both previous and future frames (i.e.

). For each temporal slice , the reconstruction module takes the concatenated forward state at time and backward state at as inputs, and uses position-dependent feed-forward networks to recontruct each frame. Formally, the DeCoAR objective is defined as follows:


where is a position-dependent feed-forward network to reconstruct the -th frame in the slice. The final loss is calculated over all possible slices in the entire sequence in an autoregressive manner, defined as: .

Figure 1: Illustration of DeCoAR.

2.2 Vector-quantized Representation Learning

2.2.1 wav2vec 2.0

Wav2vec 2.0 [2] is one of the successful examples in representation learning. It uses 10 minutes of labeled data with 53k hours of unlabeled data to achieve a word error rate (WER) of 5.2%/8.6% on LibriSpeech benchmark. The model relies on a diverse codebook learned to correlate the underlying speech units to representations via contrastive loss. Discretizing the continuous representation enables applications of many state-of-the-art NLP algorithms. In wav2vec 2.0, after applying VQ operations, the model is trained using a masked LM style loss, similar to BERT.

One potential challenge in learning optimal codebooks with contrastive loss is posed by data with nuisance factors such as noise and other adverse conditions. In these cases, the codebook can be trivially optimized by assigning acoustic condition (e.g. voice activity, noise) to the codebook. A potential work-around is to use frame reconstruction as objective so that the network can leverage all available information of the input feature to guide the learning of a robust representation.

2.2.2 Vq-Apc

VQ-APC [8] introduced an novel approach that inserted a VQ layer before frame prediction. The motivation of using VQ is to quantify the information needed to encode speech representation and control the capacity of the models. The model uses autoregressive predictive coding (APC) as objective, instead of a contrastive predictive coding (CPC). Their experiments showed APC/reconstruction objective performed better than CPC/constrastive objective under the same condition. They also demonstrated the learned VQ codes highly correlate to phoneme path, suggesting VQ can be used to capture linguistic units in an implicit way.

Figure 2: Illustration of DeCoAR 2.0 framework. The left side shows the architecture of speech representation model using unlabeled data. The right side shows an example on using labeled data with the learned speech representation. Note that the quantization and reconstruction modules are removed, and a frozen encoder is attached to a downstream ASR model (such as an acoustic model in hybrid-based system, or an end-to-end ASR system). Only the parameters in ASR model block are trained.

3 Proposed Framework

DeCoAR 2.0 is a follow-up work based on DeCoAR, and we take inspirations of recent advancement in natural language and speech representation learning. The left figure in Figure 2 illustrates the proposed DeCoAR 2.0 architecture. The model consists of three modules. The first module is the encoder network that maps input masked acoustic frames into a latent representation via multiple Transformer blocks. The second module is the vector quantization network that maps latent representation to a new quantized representation . The last module, reconstruction network, takes the quantized representation to a feed-forward network and reconstructs the original input frames as . We will describe the design of each module and its training criterion in the following sections.

3.1 Encoder Module

We replace forward/backward LSTM with Transformer, due to its superiority in modeling long context [33, 36]. While RNN/LSTM can model long context in theory, the Transformer achieves better performance thanks to its multi-head attention mechanism that captures the relationship for any arbitrary pair of samples in a long input sequence. In our encoder, we use a 1D convolutional layer with kernel size of 256 and 16 groups. This performs an implicit relative positional encoding as pointed out in [24]. The convolution is followed by Gaussian Error Linear Unit (GELU) and layer normalization. The output is then fed into the deep transformer encoder network and produce a sequence of hidden vectors .

In our masking strategy, we mask a proportion of the feature and replace them with a trainable feature vector. We randomly mask the subsequent consecutive time steps from every sampled index; spans are not overlap and we masked around 40% frames in total.

3.2 Quantization Module

We introduce a quantization module in DeCoAR 2.0 framework. Quantization module takes the latent representation from encoder module, and map it to a new representation . This is done by selecting one entry from a fixed-size codebook , where

is the size of the codebook, and apply a linear transformation to obtain

. Selecting an entry in a discrete cookbook is not differentiable. To mitigate the problem, we use the Gumbel-Softmax loss with reparameterization trick. In line with VQ-wav2vec [1], wav2vec 2.0 [2] and VQ-APC [8]

, we use the straight-through Gumbel-Softmax estimator 


In our quantization module, we use multiple codebooks [2] to obtain quantized representations. Formally, given the latent representation from the encoder module, a set of codebooks where is the number of codebooks, entries in each codebook, we select one variable from each codebook and stack the resulting vectors followed by a linear transformation to obtain new representation . In order to train which entry to select, we map the encoder output

to logits

via a linear layer, and the probability of selecting the

-th code in -th codebook is defined as follows:


where is the softmax temperature, and are uniformly sampled from . In inference, the index with largest value in logits is selected from each codebook.

3.3 Training Objective

The training objective consists of two parts. The first objective is the reconstruction loss. We use loss between an acoustic feature vectors at time and a reconstruction predicted at time for all masked indices , defined as . We use

loss as it is less sensitive to outliers.

Since vector quantization layers are known to significantly disrupt model training, we apply the diversity loss proposed in wav2vec 2.0 [2] to encourage the equal use of all entries in each codebook. Diversity loss maximizes the entropy of the averaged softmax distribution over the entries for each codebook in each mini-batch. Formally, the diversity loss is defined as:


Our final training objective is a combination of the two loss functions, weighted by a hyperparameter



3.4 Semi-supervised Speech Recognition with DeCoAR 2.0

After we have pre-trained the DeCoAR 2.0 model on unlabeled data, we freeze all the parameters in the network. We remove the quantization module and reconstruction module. The representations from the Transformer encoder module are then attached to a downstream ASR system. This ASR system can be either a conventional acoustic model in a hybrid-based ASR system, or an end-to-end speech recognition such as RNN-Transducers [28, 17] or Encoder-Decoder based model [4, 37]

. Note that in our framework, we only train parameters for the downstream ASR model and leave all parameters in the encoder module fixed (i.e. no backpropagation to all layers in encoder module).

Representation Encoder Model 1 hour 10 hours 100 hours 960 hours
test-clean test-other test-clean test-other test-clean test-other test-clean test-other
filterbank - 50.90 78.66 17.45 47.18 9.36 30.20 5.82 14.50
wav2vec 2.0 [2] 12 Transformer 13.63 29.97 5.63 13.39 5.10 11.94 - -
VQ-APC [8] 3 uni-GRU 28.66 61.12 12.38 32.28 7.42 23.38 - -
DeCoAR [19] 4 bi-LSTM 17.93 38.38 10.40 27.41 6.10 17.43 - -
DeCoAR 2.0 12 Transformer 13.75 29.13 5.43 13.27 5.02 12.07 - -
Table 1: Semi-supervised LibriSpeech results.

4 Experimental Setup and Results

Our experiments were conducted on the publicly available LibriSpeech dataset. To simulate different SSL scenarios, we varied the labeled data size from 1-hour, 10-hour, up to 100-hour. The 100-hr dataset is based on

train-clean-100 split, and the 1-hr/10-hr subsets are randomly selected from it.

4.1 Pretrain DeCoAR 2.0 Model using Unlabeled Data

To train the DeCoAR 2.0 model, we used the entire 960 hours of LibriSpeech dataset as unlabeled data. We followed the conventional frontend feature extraction, and used a 80-dimensional log-mel filterbank features, which were extracted with a 25ms sliding window at a 10ms frame rate. The features were normalized via mean subtraction and variance normalization on a per-speaker basis.

For the encoder network in DeCoAR 2.0, we used 12 Transformer blocks, each consists of a multi-head self-attention sublayer followed by a feed forward sublayer. For fair comparison, we set the model dimension to 768, the inner dimension in feed forward sublayer to 3072, with 8 attention heads as used in wav2vec 2.0 base model. The slice size

was set to 20. We optimized the network with Adam and used learning rate warm-up for the first 32000 updates to a peak of 0.0003, and then linearly decayed it. We grouped the input sequences by length with a batch size of 128 (we chopped the maximum length to 15 seconds), and trained the models on 16 GPUs for 150 epochs. The Gumbel softmax temperature

is annealed from 2 to a minimum of 0.5 by a factor of at every update. We use weight for the diversity loss and we set and for the quantization module.

4.2 Semi-supervised Speech Recognition Experimental Results

We trained acoustic models using CTC loss on labeled data as downstream tasks. Unlike conventional HMM-based hybrid ASR, training acoustic model with CTC loss gets rid of the need to prepare frame-wise alignments and other tedious processes such as preparing state-tying trees. The total size of CTC labels were 71 phonemes derived from CMU lexicon, plus one blank symbol. For decoding, we used WFST-based decoding using EESEN

[23]. CTC labels, lexicons and a 4-gram language model for LibriSpeech were composed into a WFST-based decoding graph. We set the acoustic model scale to , and the blank symbol prior scale to . We used dev-clean for validation and test-clean, test-other for evaluation.

We trained different ASR systems for comparison, using different acoustic representations, including wav2vec 2.0 features [2], VQ-APC features [8], our previously proposed DeCoAR features [19], DeCoAR 2.0 features as proposed in this work. For wav2vec 2.0 features [29], we obtained 768-dimensional representations from the wav2vec 2.0 base model downloaded from111, which was pre-trained on 960-hour LibriSpeech data with contrastive loss and had the exactly same encoding network as ours. For VQ-APC features, we trained a VQ-APC model using the official code222 provided by the authors on 960-hour LibriSpeech. We obtained 512-dimensional VP-APC representations as input features. DeCoAR and DeCoAR 2.0 have dimensionality of 2048 and 768, respectively. For all systems trained on learned speech representations, the downstream ASR model are 2 layers of bidirectional LSTMs with CTC loss. In line with our previous work [19], we also train purely supervised systems using conventional filterbank features. These models are trained using 6 layers of bidirectional LSTMs with CTC loss. We also trained a purely supervised system using the entire 960-hour dataset uisng filterbank features as a baseline.

Table 1 shows the results on semi-supervised LibriSpeech experiments. We conducted our semi-supervised experiments using 1 hour, 10 hours, and 100 hours of training data. Our proposed approach significantly outperforms the pure supervised filterbank baselines. In particular, under extremely data-sparse conditions, the proposed DeCoAR 2.0 methods achieved highly competitive performance, with a WER of 5.43%/13.27% for test-clean/test-other using 10 hours of labeled data, and a WER of 13.75%/29.13% for test-clean/test-other using only 1 hour of labeled data. One notable observation is that using 10 hours of labeled data can already outperform the system trained on the full 960-hour data with filterbank features by 6.7%/8.5% relative WER improvements on test-clean/test-other.

Among different speech representations, wav2vec 2.0 and DeCoAR 2.0 performed favorably compared to VQ-APC and DeCoAR. DeCoAR 2.0 is comparable to wav2vec 2.0 in all different SSL conditions as well. It is worth noting that we did not perform fine-tuning for all representation learning layers as these models were trained in different stacks. We are interested in gauging the performance comparison by directly using the resulting speech representations produced from different pre-trained speech representation models.

We conduct an ablation study to investigate the effect of inserting VQ layer in DeCoAR 2.0 in Table 2, and confirm the VQ module is beneficial for ASR tasks. We hypothesize that vector quantization forces the DeCoAR model to reduce the model capacity and focus more on informative factors such as linguistic/phonetic unit discovery and less so on other factors such as speaker traits, acoustic condition.

VQ test clean test other
6.29 18.54
5.43 13.27
Table 2: Ablation on the effect of using VQ layer on the Librispeech 10 hours SSL experiment.

5 Conclusion

In this paper, we present vector quantized Deep Contextualized Acoustic Representation (DeCoAR 2.0), an improved speech representation learning approach based on DeCoAR and vector quantization. DeCoAR 2.0 has multiple modification over the its predecessor, with a deep Transformer as encoding block, and the addition of a vector quantization module before reconstruction module. In extreme data-limited semi-supervised conditions, we observe that using 10 hours of labeled data with DeCoAR 2.0 achieved performance on par with the system trained on 960 hours of conventional filterbank features. DeCoAR 2.0 also performed comparably to wav2vec 2.0 in all different semi-supervised scenarios. Future work includes exploring the efficacy of representation learning in real world data including noisy and adverse conditions, and extension to neural transducers [28, 17] and other end-to-end ASR systems as downstream tasks.


  • [1] A. Baevski, S. Schneider, and M. Auli (2020) vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations. In Proc. ICLR, Cited by: §1, §1, §3.2.
  • [2] A. Baevski, H. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477. Cited by: §1, §1, §2.2.1, §3.2, §3.2, §3.3, Table 1, §4.2.
  • [3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle (2007) Greedy Layer-wise Training of Deep Networks. In Proc. NeurIPS, Cited by: §1.
  • [4] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. In Proc. ICASSP, Cited by: §3.4.
  • [5] P. Chi, P. Chung, T. Wu, C. Hsieh, S. Li, and H. Lee (2020) Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation. arXiv preprint arXiv:2005.08575. Cited by: §1.
  • [6] Y. Chung and J. Glass (2020) Improved Speech Representations with Multi-target Autoregressive Predictive Coding. In Proc. ACL, Cited by: §1.
  • [7] Y. Chung, W. Hsu, H. Tang, and J. Glass (2019)

    An Unsupervised Autoregressive Model for Speech Representation Learning

    In Proc. Interspeech, Cited by: §1.
  • [8] Y. Chung, H. Tang, and J. Glass (2020) Vector-Quantized Autoregressive Predictive Coding. In Proc. Interspeech, Cited by: §1, §2.2.2, §3.2, Table 1, §4.2.
  • [9] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. NAACL-HLT, Cited by: §1, §1.
  • [10] D. Harwath, W. Hsu, and J. Glass (2020) Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech. In Proc. ICLR, Cited by: §1.
  • [11] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al. (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal processing magazine 29. Cited by: §1.
  • [12] G. E. Hinton, S. Osindero, and Y. Teh (2006) A Fast Learning Algorithm for Deep Nelief Nets. Neural computation. Cited by: §1.
  • [13] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531. Cited by: §1.
  • [14] Y. Huang, Y. Wang, and Y. Gong (2016)

    Semi-supervised training in deep learning acoustic model

    In Interspeech, pp. 3848–3852. Cited by: §1.
  • [15] E. Jang, S. Gu, and B. Poole (2017) Categorical Reparameterization with Gumbel-Softmax. In Proc. ICLR, Cited by: §3.2.
  • [16] D. Jiang, X. Lei, W. Li, N. Luo, Y. Hu, W. Zou, and X. Li (2019) Improving Transformer-based Speech Recognition using Unsupervised Pre-training. arXiv preprint arXiv:1910.09932. Cited by: §1.
  • [17] J. Li, R. Zhao, H. Hu, and Y. Gong (2019) Improving RNN Transducer Modeling for End-to-end Speech Recognition. In Proc. ASRU, Cited by: §3.4, §5.
  • [18] J. Li, R. Zhao, J. Huang, and Y. Gong (2014) Learning Small-size DNN with Output-distribution-based Criteria. In Proc. Interspeech, Cited by: §1.
  • [19] S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff (2020) Deep Contextualized Acoustic Representations for Semi-supervised Speech Recognition. In Proc. ICASSP, Cited by: DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization, §1, §2.1, Table 1, §4.2.
  • [20] A. H. Liu, T. Tu, H. Lee, and L. Lee (2020) Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning. In Proc. ICASSP, Cited by: §1.
  • [21] A. T. Liu, S. Li, and H. Lee (2020) TERA: Self-supervised Learning of Transformer Encoder Representation for Speech. arXiv preprint arXiv:2007.06028. Cited by: §1.
  • [22] A. T. Liu, S. Yang, P. Chi, P. Hsu, and H. Lee (2020) Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders. In Proc. ICASSP, Cited by: §1.
  • [23] Y. Miao, M. Gowayyed, and F. Metze (2015) EESEN: End-to-end Speech Recognition using Deep RNN Models and WFST-based Decoding. In Proc. ASRU, Cited by: §4.2.
  • [24] A. Mohamed, D. Okhonko, and L. Zettlemoyer (2019) Transformers with Convolutional Context for ASR. arXiv preprint arXiv:1904.11660. Cited by: §3.1.
  • [25] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748. Cited by: §1.
  • [26] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep Contextualized Word Representations. In Proc. NAACL-HLT, Cited by: §1.
  • [27] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving Language Understanding by Generative Pre-training. Cited by: §1.
  • [28] K. Rao, H. Sak, and R. Prabhavalkar (2017) Exploring Architectures, Data and Units for Streaming End-to-end Speech Recognition with RNN-Transducer. In Proc. ASRU, Cited by: §3.4, §5.
  • [29] S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) wav2vec: Unsupervised Pre-training for Speech Recognition. In Proc. Interspeech, Cited by: §1, §4.2.
  • [30] X. Song, G. Wang, Z. Wu, Y. Huang, D. Su, D. Yu, and H. Meng (2020) Speech-XLNet: Unsupervised Acoustic Model Pretraining for Self-attention Networks. In Proc. Interspeech, Cited by: §1.
  • [31] G. Synnaeve, Q. Xu, J. Kahn, E. Grave, T. Likhomanenko, V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert (2019) End-to-end ASR: from Supervised to Semi-supervised Learning With Modern Architectures. arXiv preprint arXiv:1911.08460. Cited by: §1.
  • [32] A. Van Den Oord, O. Vinyals, et al. (2017) Neural Discrete Representation Learning. In Proc. NeurIPS, Cited by: §1, §1.
  • [33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is All You Need. In Proc. NeurIPS, Cited by: §3.1.
  • [34] K. Veselỳ, M. Hannemann, and L. Burget (2013) Semi-supervised Training of Deep Neural Networks. In Proc. ASRU, Cited by: §1.
  • [35] W. Wang, Q. Tang, and K. Livescu (2020) Unsupervised Pre-training of Bidirectional Speech Encoders via Masked Reconstruction. In Proc. ICASSP, Cited by: §1.
  • [36] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjandra, X. Zhang, F. Zhang, et al. (2020) Transformer-based Acoustic Modeling for Hybrid Speech Recognition. In Proc. ICASSP, Cited by: §3.1.
  • [37] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi (2017) Hybrid CTC/attention Architecture for End-to-end Speech Recognition. IEEE Journal of Selected Topics in Signal Processing. Cited by: §3.4.
  • [38] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proc. NeurIPS, Cited by: §1, §1.