Many high-level properties of speech, e.g., phonetic content and speaker characteristics, are not easily accessible111Following other recent works [1, 2, 3], in this study we use linear separability (or separability with a shallow network) to define the accessibility of information for a downstream task. without sufficiently powerful transformations from the surface features such as audio waveforms and spectrograms. Speech representation learning aims to search for a transformation from the surface features that better reveals these properties to downstream tasks.
Recently, self-supervised learning—a paradigm that treats the input itself or modifications of the input as learning targets—has obtained promising results for learning such transformations [1, 2, 4, 5, 6, 7, 8, 9, 3, 10]. These methods, mostly inspired by the techniques for pre-training NLP models [11, 12, 13], learn speech representations by either inferring future information conditioned on historical audio [1, 2], or predicting masked parts of input audio [9, 3]. The resulting representations, obtained without requiring any additional labeled data, are able to outperform the surface features across downstream tasks such as speech recognition, speech translation, speaker identification, and emotion recognition [14, 15].
Autoregressive Predictive Coding (APC)  is one of the recent self-supervised speech representation models. APC defines a prediction task that trains an autoregressive neural model (e.g., a unidirectional RNN) to predict a future frame considering the past context. Although the learned representations contain highly accessible phonetic and speaker information, the reason why this seemingly unrelated self-supervised objective produces such a representation remains unclear. In this work, we aim to provide an explanation, investigating the constituents that lead to low objective values, and connect them with the properties of the learned representations.
Our approach is to study the properties of the learned representation as we limit the model capacity. The models with limited capacity are forced to retain information to achieve maximal prediction, thereby allowing us to study the constituents of the task and the learned representations. Several options are available to obtain a spectrum of models with different capacity, including reducing the number of layers, reducing the hidden layer size, or enforcing a bottleneck along the feed-forward process. The impact of different numbers of hidden layers has been studied in prior work . Regardless, it is difficult to quantify the amount of information by changing the number of layers, changing the hidden layer size, or using low-rank matrices to enforce bottlenecks. In this work, we study the use of vector quantization (VQ), where the amount of information (i.e., bits required to transmit the codebook and the sequence of codes) can be exactly quantified, to control the capacity of the models.
Recent studies on VQ for representation learning, mostly motivated by the discrete nature of phonetic units, attempt to show that enforcing the quantization leads to a better representation for acoustic unit discovery [16, 17]7, 18]. In contrast, our goal is not to discover the discrete units in speech. We treat VQ as a general approach to limit the model capacity, and study its impact on the information encoded in the learned representations.
2 Proposed Models
2.1 A review on APC
Given a sequence of acoustic feature vectors as context, APC incorporates an autoregressive neural model , e.g., a unidirectional RNN or a Transformer decoder [19, 12], to summarize the sequence for predicting a future frame that is steps ahead of . Let denote the prediction of at time . In practice, for a speech utterance , where is the sequence length, is trained by minimizing the frame-wise loss between the predicted sequence and the target sequence :
is trained, one can extract features by taking its hidden representations, e.g., the last layer output, to replace the surface features as the new input to downstream models.
Figure 1 illustrates the VQ-APC architecture, which is based upon APC with additional quantization layer(s). Assume has layers. Let denote the -th layer of . After feeding to , each will produce a sequence of hidden vectors . Next, we add a vector quantization (VQ) layer  that replaces by , where is one of the elements in a codebook . We then pass the resulting hidden vectors to the next layer and continue the feed-forward process.
We use the Gumbel-Softmax with the straight-through estimator for selecting discrete codebook variables in a fully differentiable way. Specifically, we apply a linear layer to map to a vector . At test time, we simply choose the largest index in
. At training time, the probabilityof selecting the -th code variable is computed as follows:
where and is uniformly sampled from . controls how close the approximation is to argmax. During the forward pass, the argmax code where is chosen; during the backward pass, the true gradients of the Gumbel-Softmax outputs are used. The training objective is the same as APC (Equation 1).
The codebook size and the code dimension of a VQ layer control the amount of information from the previous layer flowing to the next, and changing either of these two factors allows us to explicitly control the capacity of the models. As we limit the model capacity, the model is forced to retain information to achieve maximal prediction. By studying a sequence of increasingly limited models, we are able to reveal the constituents of the prediction task and the learned representations.
We conduct experiments to study the properties of the learned representations and their connection to the self-supervised objective. Since VQ layers are known to significantly disrupt model training, we first examine where VQ layer(s) should be inserted. Next, by using phonetic and speaker classification as probing tasks, we show the model’s preference in preserving speech information as its capacity becomes constrained. We then visualize the learned VQ codes to show the presence and absence of phonetic information and the correspondence between codes and phones. Finally, we compare VQ-APC with other self-supervised speech representation models.
Training of self-supervised models. All self-supervised models, including the VQ-APC variants and other models to be compared, are trained on the clean 360-hour subset of LibriSpeech 
. We use 80-dimensional log Mel spectrograms (normalized to zero mean and unit variance per speaker) as input features, that is,
, and train all models for 100 epochs using Adam with a batch size of 32 and an initial learning rate of .
Probing tasks. We consider phonetic and speaker classification for measuring the accessibility of the phonetic and speaker information contained in the representations, respectively. Both experiments are carried out on the Wall Street Journal corpus (WSJ) 
. For phonetic classification, the goal is to correctly classify each frame in an utterance into one of the 42 phones. The phone alignments are generated with a speaker adapted GMM-HMM model. We follow the standard split of WSJ, using 90% ofsi284 for training, the rest 10% for validation, and reporting phone error rate on dev93. For speaker classification, the goal is to correctly predict the speaker identity of an utterance. We follow  and consider a 259-class classification task where each class corresponds to an unique speaker, using 80% of si284 for training, the other 10% for validation, and reporting classification error rate on the rest 10%. We note that speaker classification is not a typical task for WSJ, and only serves as a sanity check for the presence of speaker information (and its potentially correlated channel information) [1, 3]
. The classifier for both tasks is a linear logistic regression that takes the features extracted from the self-supervised models as input. For speaker classification, the features from the same utterance are averaged before being fed to the classifier. All self-supervised models are kept frozen when training the linear classifier. All reported numbers are an average of 5 runs, of which variances are negligibly small and not included.
|Phone error rate|
3.2 Preliminary VQ experiments
We first explore several potential places to insert VQ layers. For all VQ-APC variants in our experiments, the autoregressive model is a 3-layer unidirectional GRU with 512 hidden units, and the target frame in the future, , is set to 5 when training (Equation 1) on LibriSpeech. Whenever a VQ layer is added, the embedding dimension of each code is 512, and for the Gumbel-Softmax straight-through estimator (Equation 2) is a fixed value of 0.1 throughout training.
Table 1 presents the phonetic classification results of adding VQ layers to different layers in . In the “VQ config.” column, the numbers inside the parenthesis denote the layers we insert a VQ layer. For example, means that we only add VQ layer after . denotes the case where no VQ layer is applied, equivalent to the regular APC. The codebook size here is 128. We try using both the hidden vectors and their quantized codes (when applicable) for as extracted features. We also include the final VQ-APC training loss on the LibriSpeech 360-hour subset after 100 epochs (not the downstream linear classifier’s training loss on WSJ).
Quantizing one layer. As indicated by the training loss, we see that the bottleneck imposed by the VQ layer indeed handicaps the models’ ability to predict the future, as , and all have higher training loss than . In terms of phone error rate, regardless of where VQ is inserted, we see improvement over the APC representations. Inserting VQ at the third layer leads to the most improvement, from 33.3 to 30.5. The quantized codes , when applicable, could also be used as extracted features, which perform similarly to their corresponding pre-quantized representations. For example, in , (30.8) is close to (30.5).
Quantizing multiple layers. We find that our VQ-APC models with multiple VQ layers have trouble fitting the training set. Their representations are also much worse than the regular APC on phonetic classification. One potential remedy is to enable VQ with a schedule , but is beyond the scope of the paper.
3.3 The constituents of the learned representations
Experiments so far suggest that the phonetic information is still present (if not better) after using VQ. For the rest of the paper, we will focus on the case where VQ is inserted at the third layer, i.e., the case of . To study the constituents of the learned representations, we train a series of increasingly limited VQ-APC models by decreasing the codebook size from 2048 to 64 while fixing the code dimension to 512. As the codebook size becomes smaller, the model is forced to choose what information to encode and what to discard, thus revealing the constituents of the learned representations. We show the training losses of these models at convergence and the respective phone and speaker error rates in Figure 2. The dashed lines are the training loss, phone error rate, and speaker error rate of a regular APC model.
First, the training loss (purple curve) increases as expected, showing worse fit on the training set as we limit the codebook size. Note that in theory, when the codebook size goes to infinity, we recover the regular APC. The phone error rate (red curve) obtains a minimum at codebook size 512, and starts to worsen with smaller codebook size. The sharp degradation in phone error rate suggests that the model discards certain phonetic information to achieve maximal self-supervised objective.
The speaker error rate (blue curve), on the contrary, does not change by much as we limit the codebook size. This shows that the speaker information (and its potentially correlated channel information) is mostly retained. Given the sharp degradation in phone error rate, we can conclude that the model prefers to retain speaker information over phonetic information to achieve maximal future prediction. The preference of information can potentially stem from the use of GRUs, the VQ configuration, and the self-supervised, future prediction objective. More analyses are needed to disentangle the among these causes.
On the other hand, when the codebook size becomes large, the model falls back to regular APC and might suffer from overfitting , paying unnecessary attention to the spectral details that does not generalize for predicting future frames. Finally, even with a codebook size of 64, we still see gains over regular APC, showing the strong performance of VQ-APC in representation learning.
3.4 Visualizations of learned codes
To better measure the correspondence between the learned VQ codes and English phones, we compute co-occurrence statistics (at the frame level) across the 360-hour subset of LibriSpeech, the dataset we use to train the VQ-APC models. We compare three settings, , , and with a codebook size of 128. The conditional probability , as shown in Figure 3, are estimated based on the co-occurrence statistics, i.e., via maximum likelihood. In each sub-figure, the rows and columns are ordered via spectral co-clustering with 15 clusters to group together phones that share similar sets of codes, and a diagonal segment would imply a high correspondence between phones and codes. Note that the phone labels of LibriSpeech are only used for analysis and never seen during training.
From Figure 3, we see that the correspondence between phones and VQ codes is stronger when quantized at higher layers, and is especially strong for . Recall that probing tasks are useful for showing the presence of certain information, but have difficulty showing the absence of it. In contrast, given the co-occurrence statistics, mutual information can be estimated to support the absence of information. The normalized mutual information are 0.167, 0.285, and 0.406 for , , and , respectively. In other words, not only can we conclude that the learned representations in contain phonetic information, we can also readily conclude that and contain much less information for certain phones.
3.5 A comparison with other self-supervised models
Finally, we compare VQ-APC with other self-supervised speech representation models, including Contrastive Predictive Coding (CPC) , Bidirectional Masked Reconstruction , Mockingjay , and Multi-Target APC . We briefly review these methods below, and show their results on phonetic and speaker classification in Table 2. To stay as close to the original implementation as possible, we do not separate the discussion of models, such as the use RNNs or Transformers, and the self-supervised objectives.
CPC and APC share a similar methodology as both use an autoregressive model to learn representations through conditioning on past frames to predict information about a future frame. Their difference is that while APC tries to directly predict the future frame via regression, CPC aims to learn representations containing information that most discriminates the future frame from a set of negative samples. We mainly follow the original paper  for implementing CPC with some modifications described in . These modifications are meant to minimize the architectural differences between APC and CPC while maintaining their training objectives.
Multi-Target APC (MT-APC) is an extension of APC. It incorporates an auxiliary objective serving as a regularizer to improve the generalization of the main future prediction task. The exact same setup described in  is used in our experiments.
Different from CPC and APC that are based on the idea of future prediction, Bidirectional Masked Reconstruction (BMR) and Mockingjay are under the category of masked prediction. Inspired by the masked language modeling technique from BERT , BMR and Mockingjay mask parts of the input signals, and predict them through conditioning on both past and future contexts with a bidirectional RNN and Transformer encoder , respectively. For experiments, we mainly follow the implementations in the original papers [9, 3], except that the number of layers are reduced to match ours to minimize the architectural differences.
|log Mel + MLP-1||43.1||12.3|
|log Mel + MLP-3||41.2||11.9|
|CPC ||3-layer uni-GRU||34.1||9.7|
|APC ||3-layer uni-GRU||33.3||8.5|
|MT-APC ||3-layer uni-GRU||30.5||7.3|
|VQ-APC (ours)||3-layer uni-GRU||28.4||5.5|
|BMR ||3-layer bi-GRU||32.4||6.2|
|Mockingjay ||3-layer Transformer||30.8||5.1|
Phonetic and speaker classification results of different self-supervised speech representation models. All features are fed to a linear logistic repression. For log Mel, we also include the results of using a 1- and 3-layer multi-layer perceptron, denoted as MLP-1 and MLP-3, respectively. We also note the neural architectures used by each model.
On phonetic classification, we see that VQ-APC (28.4) improves over APC (33.3) and MT-APC (30.5), demonstrating the effectiveness of VQ layers. It also outperforms other self-supervised models despite using the same (vs. CPC) or smaller (vs. BMR and Mockingjay) network. On speaker classification, VQ-APC (5.5) again improves over the other two APC models (8.5 and 7.3), and is on par with the best model (Mockingjay, 5.1). On both tasks, all self-supervised models outperform log Mel regardless of the type of classifier it uses.
We have demonstrated that incorporating vector quantization (VQ) layers into an Autoregressive Predictive Coding model imposes a bottleneck, forcing the model to learn better representations. Extensive experiments have been conducted to compare different VQ configurations, to study the effect of varying codebook sizes, and to compare with other self-supervised speech representation models. We show evidence for the presence and absence of phonetic and speaker information in the learned representations, and also show the model’s preference in retaining information when the model capacity is limited, in the hope to bridge the connection between the self-supervised objective and the properties of the learned representations. When the phonetic information is present, the learned VQ codes also correspond well with English phones.
-  A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
-  Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsupervised autoregressive model for speech representation learning,” in Interspeech, 2019.
-  A. Liu, S.-W. Yang, P.-H. Chi, P.-C. Hsu, and H.-Y. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional Transformer encoders,” in ICASSP, 2020.
J. Chorowski, R. Weiss, S. Bengio, and A. van den Oord, “Unsupervised speech representation learning using wavenet autoencoders,”TASLP, vol. 27, no. 12, pp. 2041–2053, 2019.
-  S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” in Interspeech, 2019.
-  S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” in Interspeech, 2019.
-  A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” in ICLR, 2020.
-  A. Baevski and A. Mohamed, “Effectiveness of self-supervised pre-training for speech recognition,” in ICASSP, 2020.
-  W. Wang, Q. Tang, and K. Livescu, “Unsupervised pre-training of bidirectional speech encoders via masked reconstruction,” in ICASSP, 2020.
-  S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, “Deep contextualized acoustic representations for semi-supervised speech recognition,” in ICASSP, 2020.
-  M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in NAACL-HLT, 2018.
-  A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” OpenAI, Tech. Rep., 2018.
-  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional Transformers for language understanding,” in NAACL-HLT, 2019.
-  Y.-A. Chung and J. Glass, “Generative pre-training for speech with autoregressive predictive coding,” in ICASSP, 2020.
-  M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, and Y. Bengio, “Multi-task self-supervised learning for robust speech recognition,” in ICASSP, 2020.
-  A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in NIPS, 2017.
-  D. Harwath, W.-N. Hsu, and J. Glass, “Learning hierarchical discrete linguistic units from visually-grounded speech,” in ICLR, 2020.
-  A. Liu, T. Tu, H.-Y. Lee, and L.-S. Lee, “Towards unsupervised speech recognition and synthesis with quantized speech representation learning,” in ICASSP, 2020.
-  P. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer, “Generating wikipedia by summarizing long sequences,” in ICLR, 2018.
-  E. Jang, S. Gu, and B. Poole, “Categorical reparametrization with gumble-softmax,” in ICLR, 2017.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP, 2015.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
-  D. Paul and J. Baker, “The design for the wall street journal-based CSR corpus,” in Speech and Natural Language Workshop, 1992.
-  Y.-A. Chung and J. Glass, “Improved speech representations with multi-target autoregressive predictive coding,” in ACL, 2020.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.