. In such techniques, the unlabelled data is used to design an input and corresponding target output, without any manual annotations. The learned representations are then used as input to a supervised model (and often fine-tune the representation model as well) for a downstream task. The expected outcome is to either improve downstream task performance or to reduce the amount of labelled data required for training.
For the speech domain, various SSL techniques have recently been shown to improve downstream task performance [33, 36, 41, 10, 46, 27, 20, 3, 26, 47]. Although new and improved approaches are being proposed at a rapid rate, the pre-trained representations themselves are not well-understood, leaving the development and application of SSL models as a time- and resource-consuming process of trial and error.
We seek to fill this gap by analyzing pre-trained models to understand how the representations evolve across layers, how they relate to a range of linguistic properties, and how they change when fine-tuned for a downstream task. We are especially interested in developing tools to study representations directly, rather than training additional classifiers as probes, to avoid the overhead and unclear dependence on design decisions involved in training classifiers. In this work, we focus our analysis on the open-source wav2vec 2.0 (W2V2) models, which have been successful for speech recognition [19, 11, 2] and translation .
The main findings of this work are: (i) the W2V2 transformer layers follow an autoencoder-style behavior, where as we go deeper into the model, the representation starts deviating from the input speech features followed by a reverse trend where even deeper layers become more similar to the input, as if reconstructing the input; (ii) the layer-wise evolution of the representations follows an acoustic-linguistic hierarchy, with the shallowest layers encoding acoustic features, followed by phonetic, word identity, and word meaning information, in that order (and then followed by a reverse trend as described above), as illustrated in Fig.1; (iii) fine-tuning the model for ASR breaks the autoencoder-style behavior in the final few layers, which accordingly also get better at encoding word identity; (iv) the final convolutional (CNN) layers and initial transformer layers are highly correlated with mel spectrogram features, suggesting that the model learns to extract features similar to human-engineered ones; (v) the model seems to encode some word meaning information; (vi) the last two layers often defy the trends observed for other layers; (vii) a modified fine-tuning protocol for ASR, designed based on these findings, improves performance in a low-resource setting.
2 Related work
There has been extensive work on analyzing supervised speech models [4, 34], but research on analyzing SSL models has been much more limited. Some very recent work has explored the phonetic content in SSL models using a classifier probe [20, 2] and relationships between models with different training objectives and architectures . We study a broader range of linguistic content, and also study lightweight methods that don’t require training classifiers. The 2021 Zero Resource Speech Benchmark  introduces zero-shot analysis datasets and metrics to evaluate the ability of SSL speech representations to encode different levels of linguistic information. While we share much of the motivation of , our approach focuses on layer-wise analysis without relying on custom downstream tasks.
Layer-wise analysis of linguistic structure has been done before for visually grounded speech  and SSL text models . Our methods are closest to Voita et al.’s work on layer-wise analysis of text representations 
with canonical correlation analysis (CCA) and discrete mutual information (MI) estimates, but we apply them to the continuous domain of speech (as opposed to discrete text tokens), analyze the relationship between representations and both discrete and continuous labels, and analyze the relationship between pre-trained and fine-tuned models. To our knowledge, this is the first work on layer-wise analysis of a pre-trained speech representation model to assess a range of linguistic properties.
3 Analysis Methods
Fig. 2 sketches the W2V2 model structure and the representations used in many of our analyses.
Canonical Correlation Analysis. CCA 
is a statistical technique that measures the relationship between two continuous-valued random vectors as represented by the maximum correlations between their linear combinations. CCA has been previously used as a measure of similarity to compare representations within and across neural network models[39, 25, 44]. Here we use it in the same way, and also to measure the similarity between a layer representation and another vector, such as word embeddings or acoustic features.
CCA takes pairs of vectors , sampled from the random vectors (or “views”) , as input and returns a correlation score as a measure of similarity between the two views. The solution can be defined iteratively as follows: First we define the directions of maximum correlation between linear projections of and : . The subsequent directions , , maximize the same correlation subject to each new projection being uncorrelated with others in the same view.
In standard CCA the canonical correlation is the sum (or mean) of the correlations . We use a variant, projection-weighted CCA (PWCCA) , which computes a weighted mean of the s, with higher weights for directions accounting for a higher proportion of the input. PWCCA has been found to be more robust to spurious correlations in the data. Since PWCCA is asymmetric, we report the mean of the two quantities and . Henceforth we refer to this average as the “CCA similarity”, and it has a maximum value of 1.
As illustrated in Fig. 2, we use PWCCA to measure similarity between the W2V2 layer representations and various continuous-valued quantities of interest, either (i) from a different layer of the same model (CCA-intra), (ii) from a fine-tuned version of the model (CCA-inter), or (iii) from an external representation. For the third type of analysis we use mel filter bank features (CCA-mel), acoustically grounded word embeddings  (cca-agwe)111AGWEs are trained to be close to acoustic embeddings of the corresponding words, so we expect they encode mainly acoustic-phonetics. and GloVe word embeddings  (cca-glove) as ways to assess the local acoustic, word-level acoustic-phonetic, and word meaning information encoded in the W2V2 representations respectively.
Mutual information. While CCA is natural for relating continuous-valued vectors, we use mutual information (MI) to measure dependence between the learned representations, or from Fig. 2, and the corresponding phone (MI-phone) or word (MI-word) label. We cluster the continuous-valued representation vectors to obtain discrete clusters, as in . We then estimate MI using the co-occurrence counts of the cluster IDs and the phone/word labels.
Word discrimination. (word-disc) is the task of detecting whether two speech segments correspond to the same or different words  and is commonly used to evaluate acoustic word embeddings and other acoustic representations [24, 21, 1, 22]
. We follow a typical evaluation protocol, where we label a pair of segments as “same word” if the cosine similarity between their word-level representations is above some threshold, and measure performance via the average precision as the threshold is varied. We use this task-specific measure primarily to corroborate our findings from MI-word.
Word similarity tasks. As another measure of the extent to which the model encodes word meaning, we use it to perform a suite of 11 standard word similarity tasks (word-sim) .222https://github.com/vecto-ai/word-benchmarks We extract context-independent word embeddings from W2V2 representations as described in Sec. 4.2. The semantic similarity score for each word pair is measured as the cosine similarity between these extracted embeddings. Performance is measured as the Spearman’s correlation between these scores and the human similarity judgements.
4 Experimental Setup
4.1 Representation Learning Model
The W2V2 model  maps raw waveforms to higher-level contextual features via a set of convolutional layers followed by self-attention (transformer) layers, as shown in Fig. 2, and is trained with a contrastive objective that measures the ability of the model to differentiate between a true masked input segment and a set of distractors. The self-attention layers in the transformer allow the model to encode information from the context surrounding a given masked segment.
We analyze representations extracted from three pre-trained W2V2 variants: (i) Base has 12 layers and is trained on 960 hrs of LibriSpeech , (ii) Large-960 has 24 layers and is trained on 960 hrs of LibriSpeech, (iii) Large-60k has 24 layers and is trained on 60k hrs of LibriVox . Additionally, we also use representations from the models fine-tuned with 10 minutes (ft-10m), 100 hours (ft-100h), and 960 hours (ft-960h) of supervised data.333All models are downloaded from the wav2vec 2.0 repository: https://github.com/pytorch/fairseq/blob/master/examples/wav2vec
4.2 Setup Details
We perform all experiments on LibriSpeech. The sampled utterances (details in Tab. 1) are passed through each W2V2 model, and the outputs from all layers are extracted. Random masking is turned off except for experiments analyzing the effect of masking (Sec. 5.5).
|Experiment||# labels||# representation examples|
|2.7k words||4.8k word segments|
|MI-phone||39 phones||train: 187k phone segments|
|dev: 7.6k phone segments|
|MI-word||500 words||train: 427k word segments|
|dev: 6.9k word segments|
|word-disc||300 words||2.4k word segments|
Representation extraction: We use LibriSpeech alignments generated using the Montreal forced aligner [29, 28] to define phone and word segments. As illustrated in Fig. 2, word-level representations are obtained by averaging the frame representations of all frames in a given word segment. Phone-level representations are obtained by averaging the frame representations of the central third of each phone segment; the first and last third are discarded to reduce co-articulation effects. These segment representations are used for all experiments in Tab. 1 except the first row. The context-independent embedding (used for the word-sim experiments) for each word is computed by averaging the representations across all the instances of that word in train-clean.444We also tried weighted mean pooling, using averaged attention weights from all the attention heads, which produced similar results.
Mel filter bank features
: 80-dimensional mel filter bank (fbank) features are extracted using a frame length of 25ms and an overlap of 10ms. In order to make the W2V2 representations comparable to the fbank features, we compute moving averages of CNN features or downsample fbank features as needed to ensure their strides and receptive fields match.
Discrete cluster IDs: For MI experiments, we discretize the continuous-valued W2V2 representations by clustering. Specifically, we cluster a set of phone/word-level representations sampled from the train-clean LibriSpeech split with roughly the same number of examples of each label,555Similar trends are obtained when the chosen instances are uniformly sampled from the data instead.
using mini-batch k-means. Then we assign each example in the development set to the nearest cluster. We use 500 clusters for MI-phone and 5000 clusters for MI-word.
We present results for experiments done on the dev-clean split on some of the W2V2 variants (base, base-ft-960, large-60k, large-60k-ft-960); the findings generalize to dev-other and to the large-960 model unless stated otherwise. We analyze pre-trained models in Sec. 5.1-5.3 and their fine-tuned counterparts in Sec. 5.4. Each plot below gives the mean of the relevant measure across the four sample sets; typical variation across the sets is for CCA measures, for MI, and less than 2% for word-disc. We refer to the output of transformer layer as the representation at layer and the output of the CNN feature encoder as layer .
5.1 How do the representations evolve across layers?
In Fig. 3
we compare (via CCA similarity) the transformer layer representations with the “local features” extracted by the CNN module. We see that the pre-trained model (solid black curve) follows an autoencoder style behavior, where as we go deeper into the model, the representation starts deviating from the input features, followed by a reverse trend where even deeper layers become more similar to the input, as if reconstructing the input (although this trend seems to break for the last two layers; see Sec.5.5). Since the training objective is to distinguish the masked input segment from distractors, it is natural for the final layers to have similar properties to the input. A similar behavior, referred to as the context-encoding and reconstruction, has been previously observed for the BERT text model , where the objective is based on masked reconstruction rather than contrastive prediction.
5.2 Where is acoustic/linguistic information encoded?
Next we consider how certain properties are encoded in different layers. As a reminder, all our experiments are performed on features extracted locally from a short span of frames (frame/phone/word-level). Any increase in “information” across layers for these local representations is possible due to contextualization from the self-attention layers that enable each frame-level output to access the whole utterance. For the same reason, any decrease in “information” across layers could be attributed to de-localization, i.e. the information is no longer localized to the frame/phone/word segment.
5.2.1 Frame-level acoustic content
Fig. 4 shows the layer-wise CCA similarity between fbank vectors and Base model layers. For the first few layers the correlation increases with depth. The Large models follow a similar curve, with high correlation for layers C4-T2 (). We can infer that the model learns to compute features much like fbank, suggesting a potential simplification to W2V2 to take fbank as input (which we leave to future work).
5.2.2 Phonetic information
We measure the phonetic information encoded in the pre-trained model in two ways, MI-phone and CCA-agwe666We use AGWEs trained on LibriSpeech similarly to . (see Sec. 3), both shown in Fig. 5. We expect AGWEs to encode mostly phonetic information, and indeed the phone and AGWE curves in Fig. 5 follow broadly similar trends.
5.2.3 Word identity
Fig. 6 shows the mutual information between the layer representations and word labels. For Base, the trends are similar to those of MI-phone (Fig. 5a). For Large-60k (Fig. 6b), word identity appears to be encoded similarly well by layers 12 to 18, without the dip seen in the MI-phone curve.
5.2.4 Evidence for the most contextual layers
For the Large-60k model, the curves measuring acoustic-phonetic information (Figs. 3b, 5c, 5d) all have a dip around layers 13-17 (see also Fig. 1). These are also the same layers that seem to have the most word content (Fig. 6b). This suggests that around these layers, the model may be extracting the most contextual and high-level information, and retaining less lower-level information like phonetic content. Beyond these layers, the model enters the reconstruction phase, thus encoding more local representations at even deeper layers.
The Base model does not have the same significant inter-mediate drop for phonetic content (Figs. 1, 5a, 5b) as does Large-60k, which could indicate less contextualization. In additional experiments (not shown here) on Large-960, the MI-phone and CCA-agwe scores do not show this drop either, implying that this effect is the result of the larger training set of Large-60k, and not its larger model size.
5.3 Does the pre-trained model learn word meaning?
While some linguistic properties seem essential for the model to learn to solve the self-supervised task, it is not obvious that word meaning is one such property. We probe for word meaning in W2V2 by measuring the CCA similarity between word segment representations and GloVe embeddings, shown in Fig. 8. These plots (also a part of Fig. 1) provide further evidence that the central layers (7-8 for Base and 14-16 for Large-60k) encode the most contextual information. Note that these curves have a narrower plateau of maximum performance around these layers than the MI-word curves (Fig. 6), suggesting that the most contextual layers are better at encoding word meaning while the peripheral layers are good at encoding lower-level linguistic content but not meaning.
To further calibrate our measure of semantic information, we evaluate the W2V2 representations on standard word similarity benchmarks, as described in Sec. 3. Fig. 9 reports the performance of the best layers for Large-60k and Large-60k-ft-960. The best performance for both models occurs at layer 15, which again agrees with our hypothesis that layers 14-16 contain the most semantic information.
We also present two baselines: (i) the naive baseline defines word distance as the character edit distance for each word pair; this baseline has non-trivial performance when orthography is a helpful clue (ii) the AGWE baseline uses AGWEs in place of the W2V2 representations, and may succeed for word pairs where acoustic-phonetic similarity correlates with semantic similarity. We also have two models that are trained (on LibriSpeech) specifically to encode semantics: (i) Speech2Vec  learns word embeddings from speech using an approach similar to word2vec , and (ii) GloVe  are popular written word embeddings trained on text. Since W2V2 is not trained with an explicit semantic criterion, it is not surprising that it is outperformed by Speech2Vec and GloVe. It is interesting, however, that W2V2 representations perform better than the non-semantic baselines, suggesting that some meaning is being encoded.
5.4 How does fine-tuning affect the above observations?
We see in CCA-intra, Fig. 3, that fine-tuning breaks the autoencoder-style behavior. After fine-tuning for ASR, the deeper layers that were originally reconstructing the input are now diverging from the input, and presumably learning more task-specific information. We also see from Fig. 10 that the higher layers change the most in fine-tuning, suggesting that the pre-trained model does not serve as a good initialization of these top layers for ASR. This finding suggests re-initializing these layers before fine-tuning, as has been recently discovered for BERT . We design a fine-tuning experiment to validate this idea, described in Sec. 6.
The MI with word identity consistently improves across the top layers (19-24) after fine-tuning (Fig. 6). The same does not always hold for phone identity (Fig. 5). These results indicate that, as might be expected, fine-tuning with character-level CTC loss is more directly related to the word identity than it is to phone identity. Additionally, for the semantic measures of CCA-glove and word-sim, we see some improvement after fine-tuning but not the same large improvements in the final few layers as for MI-word, again as may be expected since ASR does not necessarily require high-quality word meaning representations.
5.5 What about those peculiar last two layers?
We see a peculiar pattern in most of the CCA similarity curves for pre-trained W2V2 models, where at least one of the last two layers fails to follow the trend of the previous layers. We find that this peculiarity disappears when we turn random masking on and consider only the representations of the masked segments. Moreover, the phonetic and word content, as measured by MI, improves for the last two layers (while reducing for the rest) when working with the representations of masked segments. This finding suggests that the representations of the final two layers are more meaningful when the input segment is masked. Furthermore, this discrepancy is not present in the fine-tuned models, suggesting that this effect is connected to the training objective, but the exact relationship is unclear. We also note that this peculiarity has been observed for local representations extracted from BERT .
6 Practical Implications for ASR
We have noted that the last few layers of W2V2 change the most during fine-tuning (Fig. 10), and that the linguistic content that should be helpful for ASR is less well represented in the final few layers (Figs. 5a, 6a). Based on these observations, we hypothesize that some of these final layers do not provide a good initialization for fine-tuning and would benefit from being re-initialized before fine-tuning. To test this hypothesis, we experiment with fine-tuning W2V2-base for ASR. We retain all of the CNN and the first transformer layers and re-initialize the top 12- layers before fine-tuning all the transformer layers with character-level connectionist temporal classification (CTC) loss . We conduct all ASR experiments using the SpeechBrain toolkit . We find that re-initializing the last 1-3 layers indeed outperforms the standard approach of directly fine-tuning the whole model (Tab. 2), with large improvements when fine-tuning on the 10-minute training set and minor improvements for larger training sets.
|train set||re-init last 12- layers||standard fine-tuning|
We have presented a set of analyses to assess the layer-specific information in pre-trained speech representations, applied to wav2vec 2.0 models. We find that various acoustic and linguistic properties tend to be encoded in different layers, and the pre-trained model follows an autoencoder-style behavior. We also find that the model encodes some non-trivial word meaning information, although more work is needed to determine the nature of the semantic content. We corroborate most of our findings with multiple analytical measures and certain downstream tasks. Such analyses can help understand the abilities and limitations of models trained without external supervision, and also help direct research toward additional useful modifications. For example, some of our findings have motivated a modification to the fine-tuning protocol, which leads to improved downstream ASR performance in the very low-resource setting.
Our analyses focus on representations extracted locally (over a frame/phone/word), so it does not measure the infor-mation delocalization that may be happening as a result of the self-attention layers. We leave in-depth analysis of self-attention to future work. Additional future directions include applying the same analytical tools to additional models with different architectures or training objectives, and further studying the implications for additional downstream tasks.
We thank Shane Settle for providing the AGWEs trained on LibriSpeech, and David Yunis, Puyuan Peng, and Shubham Toshniwal for their help with preliminary experiments and ideation. This research was funded by NSF award IIS-1816627, by Air Force Office of Scientific Research award FA9550-18-1-0166, and by an AWS Machine Learning Research Award
-  (2020) Evaluating the reliability of acoustic speech embeddings. In Interspeech, Cited by: §3.
-  (2021) Unsupervised speech recognition. arXiv preprint arXiv:2105.11084. Cited by: §1, §2.
-  (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. In NeurIPS, Cited by: §1, §1, §4.1.
-  (2019) Analysis methods in neural language processing: A survey. TACL. Cited by: §2.
-  (2011) Rapid evaluation of speech representations for spoken term discovery. In Interspeech, Cited by: §3.
-  (2020) A simple framework for contrastive learning of visual representations. In ICML, Cited by: §1.
-  (2017) Representations of language in a model of visually grounded speech signal. In ACL, Cited by: §2.
-  (2021) Similarity analysis of self-supervised speech representations. In ICASSP, Cited by: §2.
-  (2018) Speech2vec: a sequence-to-sequence framework for learning word embeddings from speech. In Interspeech, Cited by: §5.3.
-  (2020) Generative pre-training for speech with autoregressive predictive coding. In ICASSP, Cited by: §1.
-  (2020) Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979. Cited by: §1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1.
-  (2015) Unsupervised visual representation learning by context prediction. In CVPR, Cited by: §1.
How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In EMNLP, Cited by: §5.5.
-  (2014) Community evaluation and exchange of word vectors at wordvectors. org. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Cited by: §3.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, Cited by: §6.
-  (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, Cited by: §1.
-  (1936) Relations between two sets of variates. Biometrika 28 (3/4), pp. 321–377. Cited by: §3.
-  (2021) Robust wav2vec 2.0: analyzing domain shift in self-supervised pre-training. arXiv preprint arXiv:2104.01027. Cited by: §1.
-  (2021) HuBERT: how much can a bad teacher benefit ASR pre-training?. In ICASSP, Cited by: §1, §2, §5.2.2.
-  (2020) Multilingual jointly trained acoustic and written word embeddings. In Interspeech, Cited by: §3.
-  (2021) Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation. In SLT, Cited by: §3.
-  (2020) Libri-light: a benchmark for ASR with limited or no supervision. In ICASSP, Cited by: §4.1.
-  (2015) Unsupervised neural network based feature extraction using weak top-down constraints. In ICASSP, Cited by: §3.
-  (2019) Similarity of neural network representations revisited. In ICML, Cited by: §3.
-  (2020) Deep contextualized acoustic representations for semi-supervised speech recognition. In ICASSP, Cited by: §1.
-  (2020) Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP, Cited by: §1.
-  (2019) Speech model pre-training for end-to-end spoken language understanding. In Interspeech, Cited by: §4.2.
-  (2017) Montreal forced aligner: trainable text-speech alignment using kaldi.. In Interspeech, Cited by: §4.2.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §5.3.
-  (2018) Insights on representational similarity in neural networks with canonical correlation. In NeurIPS, Cited by: §3.
-  (2020) The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. In Self-Supervised Learning for Speech and Audio Processing Workshop @ NeurIPS, Cited by: §2.
-  (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1.
-  (2019) Learned in speech recognition: contextual acoustic word embeddings. In ICASSP, Cited by: §2.
-  (2015) LibriSpeech: an ASR corpus based on public domain audio books. In ICASSP, Cited by: §4.1.
-  (2019) Learning problem-agnostic speech representations from multiple self-supervised tasks. In Interspeech, Cited by: §1.
-  (2014) GloVe: Global vectors for word representation. In EMNLP, Cited by: §3, §5.3.
-  (2018) Improving language understanding by generative pre-training. Technical Report, OpenAI. Cited by: §1.
SVCCA: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In NIPS, Cited by: §3.
-  (2021) SpeechBrain: a general-purpose speech toolkit. arXiv preprint arXiv:2106.04624. Cited by: §6.
-  (2019) Wav2vec: unsupervised pre-training for speech recognition. In Interspeech, Cited by: §1.
-  (2019) Acoustically grounded word embeddings for improved acoustics-to-word speech recognition. In ICASSP, Cited by: §3, footnote 6.
-  (2019) BERT rediscovers the classical NLP pipeline. In ACL, Cited by: §2.
-  (2019) The bottom-up evolution of representations in the transformer: a study with machine translation and language modeling objectives. In NAACL, Cited by: §2, §3, §3, §5.1.
Large-scale self-and semi-supervised learning for speech translation. arXiv preprint arXiv:2104.06678. Cited by: §1.
-  (2020) Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. In ICASSP, Cited by: §1.
-  (2021) SUPERB: speech processing universal PERformance benchmark. arXiv preprint arXiv:2105.01051. Cited by: §1.
-  (2020) Revisiting few-sample BERT fine-tuning. arXiv preprint arXiv:2006.05987. Cited by: §5.4.