Continual-wav2vec2: an Application of Continual Learning for Self-Supervised Automatic Speech Recognition

07/26/2021 ∙ by Samuel Kessler, et al. ∙ 0

We present a method for continual learning of speech representations for multiple languages using self-supervised learning (SSL) and applying these for automatic speech recognition. There is an abundance of unannotated speech, so creating self-supervised representations from raw audio and finetuning on a small annotated datasets is a promising direction to build speech recognition systems. Wav2vec models perform SSL on raw audio in a pretraining phase and then finetune on a small fraction of annotated data. SSL models have produced state of the art results for ASR. However, these models are very expensive to pretrain with self-supervision. We tackle the problem of learning new language representations continually from audio without forgetting a previous language representation. We use ideas from continual learning to transfer knowledge from a previous task to speed up pretraining a new language task. Our continual-wav2vec2 model can decrease pretraining times by 32 new language task, and learn this new audio-language representation without forgetting previous language representation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks require large labelled datasets to train for applications such as image recognition or neural machine translation. In automatic speech recognition (ASR) labelled datasets are expensive to obtain and speech recognition systems generally need thousands of hours of speech annotated with text for good performance. Furthermore there are thousands of different languages spoken in the world and only a select few have large annotated datasets. Self-supervised learning (SSL) has recently garnered a lot of attention in machine learning since it can learn representations from unlabelled data alone and achieve extremely competitive results in comparison to fully supervised methods while only training on a small amount of labelled data. SSL for ASR has been successfully shown to produce very good performance when pre-training on an unlabelled raw audio data, then subsequently finetuning on a small labelled dataset (Baevski et al., 2020).

SSL for ASR models do two things simultaneously, firstly they learn how to map raw audio into a vector representation and secondly they learn a language specific representation of the extracted speech features. SSL models use state of the art architectures for sequence learning based on multi-head self attention

(Vaswani et al., 2017). Both of these elements combined mean that these models are extremely large with parameters for the wav2vec2.0 base model. Training this model will typically take on the order two weeks to train on an GPU cluster111Simulating a GPU cluster with gradient accumulation steps on hours of audio.. Going one step further, if we want to learn a representation for a second language then obtaining a multi-lingual representation from the union of two unlabelled datasets will take even longer to train.

We are interested in leveraging ideas from the Continual Learning (CL) paradigm to be more economical when learning a new language representation and to make SSL for audio more experimentation friendly. In CL a learner is required to learn from a sequence of tasks one after another. After training on one task the learner loses access to the training dataset. However, the learner is still required to remember how to perform well on the task. When training a NN continually, previous task parameters are overwritten while learning a new task, this is referred to as catastrophic forgetting (French, 1999; Goodfellow et al., 2013). We aim to learn representations for language tasks continually without forgetting and ensure we can transfer knowledge from a previous task to speed up training on a new task, this is summarized in Fig 1. We also want to do this in a parameter efficient manner which scales sub-linearly compared to training an independent model for a new task. An example of where CL might be of use is under legislation such as GDPR where data needs to be stored for specific reasons and storage periods can be limited

Our core contributions are to develop an ASR model which can continually learn representations using SSL. Our model can retain performance on old tasks and crucially transfer knowledge from a previous task to be more computationally efficient when training a new task. To do this we use a modular approach to CL where we add new parameters as language adapters (Houlsby et al., 2019) for learning new languages, see Section A for a background of different CL methods. The power of this method is demonstrated in that we are able to prevent forgetting completely and speed up training a new task by reusing model components, see the results in Section 4.

Figure 1: The task corresponds to learning a representation of a language . Different colours are different languages. After pretraining on a new language we assess the performance on all languages seen so far by finetuning.

2 Preliminaries

2.1 Continual Learning

In Continual Learning (CL) a agent needs to learn a sequence of tasks , up to . Each task is comprised of a dataset . The learner trains on each task sequentially and is evaluated on all previous task test sets . The learner has access to the task identifier to use to decide evaluate and train on.

2.2 wav2vec2.0

Figure 2: Diagram of wav2vec2 model which takes in as input raw waveform and uses a self-supervised loss to learn a representation of speech. is a convolutional encoder and is a masked transformer encoder the grey time-step is masked out and is predicted by and contrasted against the discrete tokens for .

The wav2vec2.0 model in Fig 2 takes in raw waveform as input, this is encoded by a CNN into a representation which is then quantized and contrasted against a token output by a multi-head self-attention (MHSA) encoder (Vaswani et al., 2017). Inputs to the transformer encoder are randomly masked and compared to the true tokens of the quantizer (Baevski et al., 2020) and so has to perform BERT style masked prediction to learn a representation from audio (pretraining) (Devlin et al., 2018). Formally, the model takes as input a raw waveform , a CNN extracts features . The extracted sequence is passed to a MHSA context network, . The wav2vec2.0 model also discretizes the output of the feature extractor . Quantization module is comprised of codebooks (Gumbel-Softmax distributions (Jang et al., 2016; Maddison et al., 2016)), with entries. The outputs of the quantizer are then contrasted against the outputs of the MHSA encoder.

Figure 3: Diagram of main components of the cwav2vec2 architecture. A MHSA layers in the original wav2vec2 encoder . B the cwav2vec2 additional layers in the encoder : additional adapters and layer norms per pretraining task. C shows the adapter architecture. Modules with a diagonal line through denotes layers which are frozen for tasks .

Objective. The contrastive loss identifies the true quantized latent speech representation from masked time steps. The MHSA encoder output at a masked time step is contrasted against the true quantized latent speech representation from a set of representations which include distractors. Distractors are uniformly sampled from other masked time steps of the same utterance. The loss is:



is the cosine similarity, defined as

. See Section C for further details.

2.3 Adapters

Adapters are intermediate layers which are inserted into a deep neural network to allow adaptation to a new domain. The intermediate layers can be convolutional layers which allow adaptation to a new vision dataset after having trained the network on ImageNet to obtain domain-agnostic parameters

(Rebuffi et al., 2017). Similarly in NLP adapters in the form of fully-connected bottleneck layers can be inserted in each layer of BERT to allow parameter efficient finetuning on a wide range of text classification tasks (Houlsby et al., 2019). We use these adapters for cwav2vec2.0, Fig 3.

3 Continual-wav2vec2.0

The wav2vec2.0 model does two things: a) it learns a representation from raw waveform to a vector and b) learns a language representation of speech, both by self-supervision. The objective of our continual-wav2vec2.0 (cwav2vec2.0) model is to transfer as much knowledge from the first task’s language representation to 1) enable quicker learning of a second language representation, 2) prevent catastrophic forgetting of our first task’s language representation.

We take a modular approach to CL by using language adapters (LAs) (Houlsby et al., 2019; Pfeiffer et al., 2020a, b)

. LAs have been used before for finetuning large language models such as BERT

(Devlin et al., 2018) on different tasks after pre-training. The cwav2vec2.0 model aims to speed up pretraining on a new language speech task () by freezing learnt parameters of the first task. The feature extractor and MHSA layers in are frozen. We assume that the wave-form to vector function mapping doesn’t need further training for tasks . Two LAs are added in each layer of to learn a language specific representation during the self-supervised pretraining (Houlsby et al., 2019), in addition to a language task specific layer norm (Ba et al., 2016). The LA architecture is simply two fully connected layers followed by a layer norm with a skip connection, Figure 3. Despite restricting the number of parameters used for pre-training a new language/task representation, recent work has shown that BERT style pretraining learns universal representations which are widely transferable to a wide variety of tasks (Lu et al., 2021). Thus, our hypothesis is that even though we are training a first task on a single language audio dataset, the features can transfer well for a second audio pretraining task and enable us to meet our criteria 1). Since we are freezing parameters from our first task then we should also meet our criteria 2).

We also work with multi-head architectures which are commonly used in CL to retain task specific knowledge. These work by learning task specific mappings from the feature extractor and from the encoder to the objective (Li & Hoiem, 2017).

Finetuning. We finetune on 10 hours of data. We follow the procedure in (Baevski et al., 2020) by freezing the feature extractor and appending a linear classification layer to the output of the wav2vec2.0 architecture. However we additionally freeze the MHSA encoder layers and append a new task specific LA on top of the pretrained LA. This is partially inspired by (Pfeiffer et al., 2020b) which uses both language and task based adapters for adaptation of NLP models. We only optimise the task specific LA and the layer norms in the MHSA encoder, see Section C.4 for details.

4 Experiments

We first pretrain wav2vec2.0 on English and finetune on English to gain a baseline. We then continue pretraining on either French or Spanish, and finetune the model throughout training on all languages seen so far. See figure 1. Continual learning approaches are applied after training on each task.

Handful of tasks. SSL for ASR models with wav2vec2.0 models takes a very long time to train. Hence CL in this setting is only sensible with or different tasks. This is in contrast to other CL benchmarks which can have tasks for benchmarks such as -way SplitCIFAR100 (Lopez-Paz & Ranzato, 2017). However, just because we are using a handful of tasks doesn’t mean that we are not performing CL. We are equally concerning ourselves with forgetting, knowledge transfer and building scalable models.

Baselines. We compare our cwav2vec2.0 model to some baselines. Warm-starting: simply continuing training with wav2vec2.0 on a new dataset. Multi-head wav2vec2.0 (MH wav2vec2.0) is where we have tuned the learning rate, added task specific projection heads between the main blocks of wav2vec2.0 and frozen . We also compare to MH wav2vec2.0 with an L2 regularization of the MHSA layers about the previous task’s optimum (MH wav2vec2.0 + L2). We add the term

to the loss function, where

is the previous task’s optimal parameters,

is a hyperparameter. This is a more simple mechanism than other CL regularization methods

(Kirkpatrick et al., 2017; Zenke et al., 2017).

Figure 4: Word-error rate (WER) learning curves after finetuning on hours of data having performed steps of self-supervision. The purple dashed lines denote the monolingual and multilingual wav2vec2.0 performances. By using LAs we protect against forgetting and allow learning new languages.

Results. In Fig. 4 and Fig. 5 we can see that simply warm-starting the wav2vec2.0 model when learning a new language representation for French or Spanish after English in yeilds an unstable model with word-error rate (WER). With MH wav2vec2.0 we can learn a more stable representation however this model is susceptible to forgetting of the original English language representation; we see a degradation in the English WER when finetuning. Thirdly we validate our hypothesis that by using LAs we can entirely stop forgetting and enable learning a new language representation. However the new language representation is not as powerful as MH wav2vec2.0 since the number of training parameters is smaller when learning a second task. We see that using an L2 regularization on the MHSA layers (MH wav2vec2 + L2) can help to stop forgetting, however it is not as robust as using adapters, as can be seen in a small drop in performance when pretraining on Spanish. Additionally we do not get the same efficiency benefits of cwav2vec2 Fig 6, see Section C.3 for details.

Our objective 1), to transfer knowledge from our first task to learn a new language representation or audio and decrease the amount of time required to train a new task is achieved using our cwav2vec2.0 model, Fig. 6. The reason for this is due to parameters which are frozen. Please see Section C for implementation details and for parameter values. The objective 2) is also achieved as LAs preserve the previous task parameters thus preventing forgetting. Additionally we demonstrate qualitatively that the task aware design of cwav2vec2 ensures that we are learning a language specific embedding which doesn’t overlap with the previous task’s language embedding, thus helping to alleviate forgetting, see Section D.

5 Discussion and Conclusion

We introduce cwav2vec2.0, a simple and effective modular CL approach to build up self-supervised language representations from raw audio. cwav2vec2.0 builds on the successful wav2vec2.0 model, it is able to limit forgetting of our initial audio representation while learning a new task by freezing parameters and training LAs which are used to specialize to a new task. By freezing parameters and training new LA modules we are able to significantly speed up learning a new self-supervised language representation on audio. Furthermore when learning a new representation in a multi-task scenario we are required to train on all the data: first task and new task making the training even more expensive. By using cwav2vec2.0 we can train a new representation in days compared to nearly using wav2vec2.0, enabling us to be more economical with computation resources and making self supervision with audio more experimentation friendly. Additionally we can achieve close to the mono-lingual performance on a new language despite having frozen a significant fraction of the model’s parameters and training LAs only. Future directions include allowing cwav2vec2 to scale to a further task and allowing different LAs to combine.

Figure 5: Word-error rate (WER) learning curves after finetuning on hours of data having performed steps of self-supervision of first English then Spanish. We yield very similar results to when using Spanish as to the learning curves for French in Fig 4.
Figure 6: Left Time required to pre-train our models on a second task. Right The total number of parameters in the models, total number of trainable and frozen parameters.


Appendix A Related Works

a.1 Continual Learning

CL methods can be categorized into three different types: the first is regularization methods, where the optimal parameters from the previous task are used to regularize learning a new task. EWC uses sequential Bayes with a Laplace approximation to motivate an regularization with respect to previous task weights (Kirkpatrick et al., 2017). Other methods also use an regularization about the previous task’s optimal parameters but with different penalty weightings (Schwarz et al., 2018; Zenke et al., 2017) or use variational inference with a KL penalty (Nguyen et al., 2017; Kessler et al., 2019). Task specific functions can also be regularized to ensure forgetting is minimized (Li & Hoiem, 2017).

Replay methods use a single predictor for all tasks which tackles forgetting by replaying data from past tasks while training on the current task. This is performed by storing a small memory of data from previous tasks (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2018). Or using a generative model which is able to sample previous task data and augment the current task dataset for the learner to prevent forgetting of past tasks (Shin et al., 2017; van de Ven et al., 2020).

The third category is expansion methods which add new parameters to a model when learning a new task. In Progressive Nets a new NN predictor is added for each new task with connections between adjacent layers (Rusu et al., 2016), this means that the number of network parameters grows super-linearly as the number of tasks increases. Dynamically Expanding Networks uses a sparsity regularization to dynamically select and limit the number of new weights to add for each new task (Yoon et al., 2017)

. New modules can be added by a controller by casting the expansion process as a reinforcement learning problem

(Xu & Zhu, 2018), this is very expensive to train. Evolution methods and random search have also been applied to select the optimal sub-network in a large NN for each task (Fernando et al., 2017; Rajasegaran et al., 2019). By finding the task subnetwork which most resembles the previous task and adding modules to this subnetwork, one can then use exhaustive search to find the subnetwork which performs best on a new task wile freezing previous task weights (Veniat et al., 2020). This approach allows the number of parameters increase sub-linearly in the number tasks. CL has also been applied to RL where different tasks are different environments (Rolnick et al., 2018; Kessler et al., 2021).

a.2 Continual Learning applied to Speech

CL has also been applied to ASR models using a modular approach with a separate model encapsulating each new English dataset (Sadhu & Hermansky, 2020). Thus a linear increase in the number of parameters and an increase of time at inference. CL has also been applied to ASR models where different tasks are different English dialects, RNN predictors are used with EWC and LwF (Houston & Kirchhoff, 2020). As far as we are aware we are the first to apply CL methods very large ASR models which leverage SSL. Differenty CL strategies have been applied to deep LSTM based models which use different English datasets as different tasks trained with the CTC loss (Chang et al., 2021). The authors compare regularization and memory based approaches and show that memory approaches (Lopez-Paz & Ranzato, 2017) work well.

a.3 Automatic Speech Recognition and Self-supervised Learning

The wav2vec model uses the CPC loss (Oord et al., 2018) for ASR (Schneider et al., 2019). Extending wav2vec, the vq-wav2vec model discretizes the audio representations (using a Gumbel Softmax (Jang et al., 2016; Maddison et al., 2016)

or K-means) with CPC style pretraining. Then in a second step performs a BERT style pre-training by masking input tokens

(Devlin et al., 2018) to create a language representation for subsequent finetuning ASR (Baevski et al., 2019). It is more effective to use discrete audio representations over a vector representation (Baevski & Mohamed, 2020). The wav2vec2.0 model solves the quantization and language representation learning end-to-end (Baevski et al., 2020). Using the representation from pretraining and finetuning on a small amount of labelled ASR data the model achieves state of the art results when finetuning on as little as minutes of data.

State of the art results for ASR using SSL have been achieved by combining different methods (Zhang et al., 2020): wav2vec2.0 is used for pretraining with a Conformer encoder (Gulati et al., 2020). Then during finetuning, iterative pseudo-labelling is performed to augment the labelled dataset (Xu et al., 2020).

Using CPC (Oord et al., 2018) and a simple -layer GRU network on top of a convolutional encoder has been shown to produce representations which generalise to CommonVoice datasets when pre-trained on English (Rivière et al., 2020). Also, the speech representations from CPC on English alone have been shown to generalise to other low resource African, Asian and Latin American languages (Kawakami et al., 2020). wav2vec2.0 trained out-of-the-box on multilingual audio can learn an audio representation which can perform multi-language ASR using a single quantisation module (Conneau et al., 2020). Multi-lingual ASR by directly optimizing the CTC loss (Graves et al., 2006) is explored with language adapters and multi-lingual BERT to generalize to low resource languages (Winata et al., 2020).

Appendix B Datasets

b.1 English: LibriSpeech

We use the LibriSpeech dataset which is an English ASR dataset derived from audiobooks and contains

hours of speech (Panayotov et al., 2015). The corpus is divided in “clean” and “other” partitions, lower-WER speakers are placed into the clean dataset and the rest into the other dataset. We validate our models on the dev-clean dataset, the size of the datasets are described in Table 1.

Dataset Hours
dev-clean 5.4
dev-other 5.3
test-clean 5.4
test-other 5.1
train-clean-100 100.6
train-clean-360 363.6
train-other-500 596.7
Table 1: LibriSpeech datasets and their lengths.

b.2 French

We used the following training datasets for performing self-supervision of French audio, in Table 2. We use hours of raw audio to perform self-supervision with for validation. For finetuning we use hours from the common_voice training set and we valid with the hours from the common_voice dev set.

Dataset Train (hours) dev (hours) test (hours)
common_voice (Ardila et al., 2019) 962.4 23.8 25.0
fr FR HQ 333 142.8 - -
french single speaker 444 19.1 - -
Voxforge 555 37.2 - -
LibriFrench 666 45.6 - -
Siwis 77footnotemark: 7 10.7 - -
African Accented 88footnotemark: 8 0 1.0 0.3
Table 2: French datasets used for pretraining and finetuning.

b.3 Spanish

We used the following training datasets for performing self-supervision of Spanish audio in Table 3. We use hours of raw audio to perform self-supervision with for validation. For finetuning we use hours from the common_voice training set and we valid with the hours from the common_voice dev set, similarly to French.

Dataset Train (hours) dev (hours) test (hours)
common_voice (Ardila et al., 2019) 825.19 24.8 25.5
tedX (Hernandez-Mena, 2019) 14.7 - -
Table 3: Spanish datasets used for pretraining and finetuning.

Appendix C Implementations

We use the fairseq framework (Ott et al., 2019) to implement our continual-wav2vec2 models. The pre-training and finetuning proceeds similarly to (Baevski et al., 2020) apart from the model changes in MH wav2vec2 and cwav2vec2. We provide a recap for completeness.

Feature extractor. The feature extractor contains several temporal convolutional layers with layer norms (Ba et al., 2016)

and GeLU activation functions

(Hendrycks & Gimpel, 2016).

Quantization module. The Quantization discretizes the output of the feature extractor . Quantization module is comprised of codebooks, with entries . One entry from each codebook is chosen and concatenated: . To construct the quantizer codebook, the outputs of the feature encoder

are mapped to logits

. In the wav2vec2.0 BASE model used, the outputs from the encoder and each codebook is of dimensionality

. Then a linear transformation is applied to the vectors

such that where and .

Objective. The contrastive loss identifies the true quantized latent speech representation from masked time steps is:


where is the cosine similarity, defined as . A diversity regularization is also added to encourage the model to use the entire codebook of entries in each of the codebooks by maximizing the entropy of the averaged softmax distribution over the codebooks:


Thus the objective is where is a scalar which controls the strength of the codebook diversity regularization.


The training objective is to identify the correct quantized latent audio representation by comparing to the masked prediction. The feature encoder outputs are masked before being fed into the context network. To mask, we randomly sample a with probability

of all time steps to be starting indices and then mask the subsequent consecutive time steps from every sampled index; spans can overlap. This masking procedure is identical that from wav2vec2.0 (Baevski et al., 2020).

Pre-training. We performed pre-training on Tesla V100 GPUs. We simulated GPUs which are used wav2vec2.0 by using accumulation steps before updating gradients. We following the pre-training procedure of wav2vec2.0 (Baevski et al., 2020). For mask starting timesteps with and mask the next timesteps. The feature extractor has seven blocks with

channels with strides

and kernel widths . A -d convolutional layer models relative positional embeddings and has size and groups (Mohamed et al., 2019). We use the BASE model configuration fro the encoder (Baevski et al., 2020): transformer blocks with dimension , inner dimension (FC 2 in Figure 3) and attention heads. Audio crops do not exceed m samples per GPU, the batch size is hours. We use Adam and warm up the learning rate for the first of updates to a maximum of , we use the same learning rate scheduling procedure for all our models and the learning rate quoted is always the maximum. We train for k steps, similarly to the wav2vec2.0 BASE model. The penalty for the diversity loss is . The Gumbel-Softmax temperature is annealed from a maximum of to a minimum of by a factor of every update (Jang et al., 2016). The temperature of the contrastive loss and use distractors. An L2 penalty is added to the activations of the final layer of the feature encoder with size .

Fine-tuning. After pretraining on each unlabelled task dataset we finetune the learned representations on hours of labelled data following the same procedure as in (Baevski et al., 2020). A new classification layers is placed on the final layer of the network, the feature extractor is frozen during finetuning. The network is optimized with Adam and the CTC loss (Graves et al., 2006) using a tri-stage learning rate: the learning rate is warmed up for the first of updates, then held constant at a maximum learning for the next of steps and then linearly decayed. We do not use a Language model for fine-tuning.

c.1 Warm-starting wav2vec2.0

If we use the default parameters for wav2vec2.0 we find that the model yields unstable results when we simply warm-start the network when learning a new audio representation for a new language for a second task. The learning rate is . We can lower the learning rate for the new task though to prevent the instability. By lowering the learning rate for a new task we are essentially balancing how much we can learn a new task versus to remaining stable with respect to forgetting of the previous task. What we find is that by simply lowering the learning rate of wav2vec2.0 when learning a new task we can restrict forgetting but also enable learning a new audio task, see Figure 4 and Figure 5. However by simply warm-starting the model we are not going to get the performance benefits for a second or third task that we would get for cwav2vec2.0.

c.2 Multi-headed wav2vec2.0

Figure 7: Diagram of cwav2vec2.0 v1 architecture, the strike-through means that the module it frozen and no gradient is taken when training with SGD.

We propose a very simple baseline which is to use multi-head architectures for wav2vec2.0. Multi-head architectures are commonly used in CL to retain task specific knowledge by learning task specific mappings from hidden to output layers (Li & Hoiem, 2017; Nguyen et al., 2017). We are using heads to learn task specific mappings from the feature extractor to the encoder and the from the encoder to the objective function. After having trained on a first task, the feature extractor has learnt how to extract features from the audio, so we can freeze the feature extractor and add a language specific head before into the MHSA encoder . We also tune the learning rate by pretraining for k steps from the set and found to work well.

Additionally, we enable learning new language representation in the quantization module by adding a language specific quantizer. This will enable learning a new language without interference from a previous language. This is in contrast to previous multi-task wav2vec2.0 models which share a quantizer for all languages (Conneau et al., 2020). Instead of warm-starting the quantizer from the previous task learning a new quantizer from randomly initialized weights. In the CTC finetuning step we train the entire encoder and an additional fully connected projection layer mapping into the classes representing the vocabulary of the ASR task.

Finetuning proceeds in the same way as wav2vec2.0.

Figure 8: Diagram of cwav2vec2.0 finetuning architecture: the pretraining adapters are frozen and the new finetuning adapters are placed on top of the pretraining adapters and trained together with the layer norms during the fine-tuning stage.

c.3 Multi-headed wav2vec2.0 and L2 regularization

To combat forgetting regularization methods like EWC (Kirkpatrick et al., 2017) and SI (Zenke et al., 2017) add a regularization penalty around the previous task’s optimal parameters with different weightings for each parameter. We do a simple regularization in a similar vein: add a regularization to each parameter in the MHSA layers. We add the regularization for task : , where are the optimal parameters for the previous task and is a hyperparameter which controls the regularization strength. We experiment with and find that works the best after training for steps of pretraining, we also experiment with regularization the top layers of the MHSA encoder and find little difference in performance after steps of pretraining, hence we regularize the top layers to make the optimization simpler and ensure we are not constraining learning the new language representation too much.

c.4 Continual-wav2vec2.0

We place LAs in each layer of the MHSA encoder , the model dimension is . For French, every adapter has a bottleneck dimension , which we tune by pretraining for only from the set . We use a learning rate of which is tuned from by pretraining for only k steps from the set . For Spanish, every adapter has a bottleneck dimension , which we tune by pretraining for only from the set . We use a learning rate of which is tuned from by pretraining for only k steps from the set . Performance is very dependent on adapter size and learning rate, and further tuning could lead to better performance.


In wav2vec2 the encoder and a new classification layer are finetuned. Alternatively for cwav2vec2 we place LAs on top of the pretraining adapters and train only the newly placed adapters and the corresponding layers norms, Figure 8. The finetuning LA bottleneck dimension is and we use a tri-stage learning rate where the maximum learning rate is .

Figure 9: t-SNE embeddings (Van der Maaten & Hinton, 2008). We compare the cwav2vec2 embeddings from different languages from different quantizers; after learning on only English (column 1), then after learning a second task French (row 1) or Spanish (row 2). The third columns are the multilingual embeddings from XLSR (Conneau et al., 2020).

Appendix D Qualitative analysis of cwav2vec2 versus wav2vec2

In Figure 9 we show embeddings from the quantizer from cwav2vec2.0 in columns and and compare to the multi-task/multi-lingual representation from the XLSR model (Conneau et al., 2020), column . We visualize the features from the quantizers: we are able to get an understanding of the language representation from as both are learnt jointly. We notice from Figure 9 that when training on which is English only (column ) we can achieve a good -shot embedding for French and Spanish, reflected in the fact that we can get a WER when we perform finetuning in these languages without any pretraining. More importantly we show in the second column that there is a separation between languages which have been learnt by cwav2vec2.0. cwav2vec2.0 is task aware and learns to represent a new language as a different task thus the separation between language embeddings in column , note how the representation for English in has been preserved from column . This is in contrast to the final column which has been trained multi-lingually on many different languages and is not task aware, as a result the language embeddings overlap.

We can see from Figure 9 how the separability of the embeddings is produced by constraining the pretraining process for to focus specifically on learning a new language representation, distinct from .