Neural networks require large labelled datasets to train for applications such as image recognition or neural machine translation. In automatic speech recognition (ASR) labelled datasets are expensive to obtain and speech recognition systems generally need thousands of hours of speech annotated with text for good performance. Furthermore there are thousands of different languages spoken in the world and only a select few have large annotated datasets. Self-supervised learning (SSL) has recently garnered a lot of attention in machine learning since it can learn representations from unlabelled data alone and achieve extremely competitive results in comparison to fully supervised methods while only training on a small amount of labelled data. SSL for ASR has been successfully shown to produce very good performance when pre-training on an unlabelled raw audio data, then subsequently finetuning on a small labelled dataset (Baevski et al., 2020).
SSL for ASR models do two things simultaneously, firstly they learn how to map raw audio into a vector representation and secondly they learn a language specific representation of the extracted speech features. SSL models use state of the art architectures for sequence learning based on multi-head self attention(Vaswani et al., 2017). Both of these elements combined mean that these models are extremely large with parameters for the wav2vec2.0 base model. Training this model will typically take on the order two weeks to train on an GPU cluster111Simulating a GPU cluster with gradient accumulation steps on hours of audio.. Going one step further, if we want to learn a representation for a second language then obtaining a multi-lingual representation from the union of two unlabelled datasets will take even longer to train.
We are interested in leveraging ideas from the Continual Learning (CL) paradigm to be more economical when learning a new language representation and to make SSL for audio more experimentation friendly. In CL a learner is required to learn from a sequence of tasks one after another. After training on one task the learner loses access to the training dataset. However, the learner is still required to remember how to perform well on the task. When training a NN continually, previous task parameters are overwritten while learning a new task, this is referred to as catastrophic forgetting (French, 1999; Goodfellow et al., 2013). We aim to learn representations for language tasks continually without forgetting and ensure we can transfer knowledge from a previous task to speed up training on a new task, this is summarized in Fig 1. We also want to do this in a parameter efficient manner which scales sub-linearly compared to training an independent model for a new task. An example of where CL might be of use is under legislation such as GDPR where data needs to be stored for specific reasons and storage periods can be limited 222ec.europa.eu/info/law/law-topic/data-protection.
Our core contributions are to develop an ASR model which can continually learn representations using SSL. Our model can retain performance on old tasks and crucially transfer knowledge from a previous task to be more computationally efficient when training a new task. To do this we use a modular approach to CL where we add new parameters as language adapters (Houlsby et al., 2019) for learning new languages, see Section A for a background of different CL methods. The power of this method is demonstrated in that we are able to prevent forgetting completely and speed up training a new task by reusing model components, see the results in Section 4.
2.1 Continual Learning
In Continual Learning (CL) a agent needs to learn a sequence of tasks , up to . Each task is comprised of a dataset . The learner trains on each task sequentially and is evaluated on all previous task test sets . The learner has access to the task identifier to use to decide evaluate and train on.
The wav2vec2.0 model in Fig 2 takes in raw waveform as input, this is encoded by a CNN into a representation which is then quantized and contrasted against a token output by a multi-head self-attention (MHSA) encoder (Vaswani et al., 2017). Inputs to the transformer encoder are randomly masked and compared to the true tokens of the quantizer (Baevski et al., 2020) and so has to perform BERT style masked prediction to learn a representation from audio (pretraining) (Devlin et al., 2018). Formally, the model takes as input a raw waveform , a CNN extracts features . The extracted sequence is passed to a MHSA context network, . The wav2vec2.0 model also discretizes the output of the feature extractor . Quantization module is comprised of codebooks (Gumbel-Softmax distributions (Jang et al., 2016; Maddison et al., 2016)), with entries. The outputs of the quantizer are then contrasted against the outputs of the MHSA encoder.
Objective. The contrastive loss identifies the true quantized latent speech representation from masked time steps. The MHSA encoder output at a masked time step is contrasted against the true quantized latent speech representation from a set of representations which include distractors. Distractors are uniformly sampled from other masked time steps of the same utterance. The loss is:
is the cosine similarity, defined as. See Section C for further details.
Adapters are intermediate layers which are inserted into a deep neural network to allow adaptation to a new domain. The intermediate layers can be convolutional layers which allow adaptation to a new vision dataset after having trained the network on ImageNet to obtain domain-agnostic parameters(Rebuffi et al., 2017). Similarly in NLP adapters in the form of fully-connected bottleneck layers can be inserted in each layer of BERT to allow parameter efficient finetuning on a wide range of text classification tasks (Houlsby et al., 2019). We use these adapters for cwav2vec2.0, Fig 3.
The wav2vec2.0 model does two things: a) it learns a representation from raw waveform to a vector and b) learns a language representation of speech, both by self-supervision. The objective of our continual-wav2vec2.0 (cwav2vec2.0) model is to transfer as much knowledge from the first task’s language representation to 1) enable quicker learning of a second language representation, 2) prevent catastrophic forgetting of our first task’s language representation.
. LAs have been used before for finetuning large language models such as BERT(Devlin et al., 2018) on different tasks after pre-training. The cwav2vec2.0 model aims to speed up pretraining on a new language speech task () by freezing learnt parameters of the first task. The feature extractor and MHSA layers in are frozen. We assume that the wave-form to vector function mapping doesn’t need further training for tasks . Two LAs are added in each layer of to learn a language specific representation during the self-supervised pretraining (Houlsby et al., 2019), in addition to a language task specific layer norm (Ba et al., 2016). The LA architecture is simply two fully connected layers followed by a layer norm with a skip connection, Figure 3. Despite restricting the number of parameters used for pre-training a new language/task representation, recent work has shown that BERT style pretraining learns universal representations which are widely transferable to a wide variety of tasks (Lu et al., 2021). Thus, our hypothesis is that even though we are training a first task on a single language audio dataset, the features can transfer well for a second audio pretraining task and enable us to meet our criteria 1). Since we are freezing parameters from our first task then we should also meet our criteria 2).
We also work with multi-head architectures which are commonly used in CL to retain task specific knowledge. These work by learning task specific mappings from the feature extractor and from the encoder to the objective (Li & Hoiem, 2017).
Finetuning. We finetune on 10 hours of data. We follow the procedure in (Baevski et al., 2020) by freezing the feature extractor and appending a linear classification layer to the output of the wav2vec2.0 architecture. However we additionally freeze the MHSA encoder layers and append a new task specific LA on top of the pretrained LA. This is partially inspired by (Pfeiffer et al., 2020b) which uses both language and task based adapters for adaptation of NLP models. We only optimise the task specific LA and the layer norms in the MHSA encoder, see Section C.4 for details.
We first pretrain wav2vec2.0 on English and finetune on English to gain a baseline. We then continue pretraining on either French or Spanish, and finetune the model throughout training on all languages seen so far. See figure 1. Continual learning approaches are applied after training on each task.
Handful of tasks. SSL for ASR models with wav2vec2.0 models takes a very long time to train. Hence CL in this setting is only sensible with or different tasks. This is in contrast to other CL benchmarks which can have tasks for benchmarks such as -way SplitCIFAR100 (Lopez-Paz & Ranzato, 2017). However, just because we are using a handful of tasks doesn’t mean that we are not performing CL. We are equally concerning ourselves with forgetting, knowledge transfer and building scalable models.
Baselines. We compare our cwav2vec2.0 model to some baselines. Warm-starting: simply continuing training with wav2vec2.0 on a new dataset. Multi-head wav2vec2.0 (MH wav2vec2.0) is where we have tuned the learning rate, added task specific projection heads between the main blocks of wav2vec2.0 and frozen . We also compare to MH wav2vec2.0 with an L2 regularization of the MHSA layers about the previous task’s optimum (MH wav2vec2.0 + L2). We add the term
to the loss function, whereis the previous task’s optimal parameters,
is a hyperparameter. This is a more simple mechanism than other CL regularization methods(Kirkpatrick et al., 2017; Zenke et al., 2017).
Results. In Fig. 4 and Fig. 5 we can see that simply warm-starting the wav2vec2.0 model when learning a new language representation for French or Spanish after English in yeilds an unstable model with word-error rate (WER). With MH wav2vec2.0 we can learn a more stable representation however this model is susceptible to forgetting of the original English language representation; we see a degradation in the English WER when finetuning. Thirdly we validate our hypothesis that by using LAs we can entirely stop forgetting and enable learning a new language representation. However the new language representation is not as powerful as MH wav2vec2.0 since the number of training parameters is smaller when learning a second task. We see that using an L2 regularization on the MHSA layers (MH wav2vec2 + L2) can help to stop forgetting, however it is not as robust as using adapters, as can be seen in a small drop in performance when pretraining on Spanish. Additionally we do not get the same efficiency benefits of cwav2vec2 Fig 6, see Section C.3 for details.
Our objective 1), to transfer knowledge from our first task to learn a new language representation or audio and decrease the amount of time required to train a new task is achieved using our cwav2vec2.0 model, Fig. 6. The reason for this is due to parameters which are frozen. Please see Section C for implementation details and for parameter values. The objective 2) is also achieved as LAs preserve the previous task parameters thus preventing forgetting. Additionally we demonstrate qualitatively that the task aware design of cwav2vec2 ensures that we are learning a language specific embedding which doesn’t overlap with the previous task’s language embedding, thus helping to alleviate forgetting, see Section D.
5 Discussion and Conclusion
We introduce cwav2vec2.0, a simple and effective modular CL approach to build up self-supervised language representations from raw audio. cwav2vec2.0 builds on the successful wav2vec2.0 model, it is able to limit forgetting of our initial audio representation while learning a new task by freezing parameters and training LAs which are used to specialize to a new task. By freezing parameters and training new LA modules we are able to significantly speed up learning a new self-supervised language representation on audio. Furthermore when learning a new representation in a multi-task scenario we are required to train on all the data: first task and new task making the training even more expensive. By using cwav2vec2.0 we can train a new representation in days compared to nearly using wav2vec2.0, enabling us to be more economical with computation resources and making self supervision with audio more experimentation friendly. Additionally we can achieve close to the mono-lingual performance on a new language despite having frozen a significant fraction of the model’s parameters and training LAs only. Future directions include allowing cwav2vec2 to scale to a further task and allowing different LAs to combine.
- Ardila et al. (2019) Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., and Weber, G. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019.
- Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Baevski & Mohamed (2020) Baevski, A. and Mohamed, A. Effectiveness of self-supervised pre-training for asr. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7694–7698. IEEE, 2020.
- Baevski et al. (2019) Baevski, A., Schneider, S., and Auli, M. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453, 2019.
- Baevski et al. (2020) Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477, 2020.
- Chang et al. (2021) Chang, H.-J., Lee, H.-y., and Lee, L.-s. Towards lifelong learning of end-to-end asr. arXiv preprint arXiv:2104.01616, 2021.
- Chaudhry et al. (2018) Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
- Conneau et al. (2020) Conneau, A., Baevski, A., Collobert, R., Mohamed, A., and Auli, M. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979, 2020.
- Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Fernando et al. (2017) Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A. A., Pritzel, A., and Wierstra, D. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
- French (1999) French, R. M. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
- Goodfellow et al. (2013) Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
Graves et al. (2006)
Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.In Proceedings of the 23rd international conference on Machine learning, pp. 369–376, 2006.
- Gulati et al. (2020) Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
- Hendrycks & Gimpel (2016) Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Hernandez-Mena (2019) Hernandez-Mena, C. D. TEDx Spanish Corpus. Audio and transcripts in Spanish taken from the TEDx Talks; shared under the CC BY-NC-ND 4.0 license. Web Download, 2019.
Houlsby et al. (2019)
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q.,
Gesmundo, A., Attariyan, M., and Gelly, S.
Parameter-efficient transfer learning for nlp.In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
- Houston & Kirchhoff (2020) Houston, B. and Kirchhoff, K. Continual learning for multi-dialect acoustic models. Proc. Interspeech 2020, pp. 576–580, 2020.
- Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Kawakami et al. (2020) Kawakami, K., Wang, L., Dyer, C., Blunsom, P., and Oord, A. v. d. Learning robust and multilingual speech representations. arXiv preprint arXiv:2001.11128, 2020.
- Kessler et al. (2019) Kessler, S., Nguyen, V., Zohren, S., and Roberts, S. Hierarchical indian buffet neural networks for bayesian continual learning. arXiv preprint arXiv:1912.02290, 2019.
- Kessler et al. (2021) Kessler, S., Parker-Holder, J., Ball, P., Zohren, S., and Roberts, S. J. Same state, different task: Continual reinforcement learning without interference. arXiv preprint arXiv:2106.02940, 2021.
- Kirkpatrick et al. (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
- Li & Hoiem (2017) Li, Z. and Hoiem, D. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
- Lopez-Paz & Ranzato (2017) Lopez-Paz, D. and Ranzato, M. Gradient episodic memory for continual learning. arXiv preprint arXiv:1706.08840, 2017.
- Lu et al. (2021) Lu, K., Grover, A., Abbeel, P., and Mordatch, I. Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 2021.
- Maddison et al. (2016) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
- Mohamed et al. (2019) Mohamed, A., Okhonko, D., and Zettlemoyer, L. Transformers with convolutional context for asr. arXiv preprint arXiv:1904.11660, 2019.
- Nguyen et al. (2017) Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. Variational continual learning. arXiv preprint arXiv:1710.10628, 2017.
- Oord et al. (2018) Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Ott et al. (2019) Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.
- Panayotov et al. (2015) Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE, 2015.
- Pfeiffer et al. (2020a) Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K., and Gurevych, I. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020a.
- Pfeiffer et al. (2020b) Pfeiffer, J., Vulić, I., Gurevych, I., and Ruder, S. Mad-x: An adapter-based framework for multi-task cross-lingual transfer. arXiv preprint arXiv:2005.00052, 2020b.
- Rajasegaran et al. (2019) Rajasegaran, J., Hayat, M., Khan, S., Khan, F. S., Shao, L., and Yang, M.-H. An adaptive random path selection approach for incremental learning. arXiv preprint arXiv:1906.01120, 2019.
- Rebuffi et al. (2017) Rebuffi, S.-A., Bilen, H., and Vedaldi, A. Learning multiple visual domains with residual adapters. arXiv preprint arXiv:1705.08045, 2017.
- Rivière et al. (2020) Rivière, M., Joulin, A., Mazaré, P.-E., and Dupoux, E. Unsupervised pretraining transfers well across languages. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7414–7418. IEEE, 2020.
- Rolnick et al. (2018) Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T. P., and Wayne, G. Experience replay for continual learning. arXiv preprint arXiv:1811.11682, 2018.
- Rusu et al. (2016) Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
- Sadhu & Hermansky (2020) Sadhu, S. and Hermansky, H. Continual learning in automatic speech recognition. Proc. Interspeech 2020, pp. 1246–1250, 2020.
- Schneider et al. (2019) Schneider, S., Baevski, A., Collobert, R., and Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862, 2019.
- Schwarz et al. (2018) Schwarz, J., Czarnecki, W., Luketina, J., Grabska-Barwinska, A., Teh, Y. W., Pascanu, R., and Hadsell, R. Progress & compress: A scalable framework for continual learning. In International Conference on Machine Learning, pp. 4528–4537. PMLR, 2018.
- Shin et al. (2017) Shin, H., Lee, J. K., Kim, J., and Kim, J. Continual learning with deep generative replay. arXiv preprint arXiv:1705.08690, 2017.
- van de Ven et al. (2020) van de Ven, G. M., Siegelmann, H. T., and Tolias, A. S. Brain-inspired replay for continual learning with artificial neural networks. Nature communications, 11(1):1–14, 2020.
- Van der Maaten & Hinton (2008) Van der Maaten, L. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
- Veniat et al. (2020) Veniat, T., Denoyer, L., and Ranzato, M. Efficient continual learning with modular networks and task-driven priors. arXiv preprint arXiv:2012.12631, 2020.
- Winata et al. (2020) Winata, G. I., Wang, G., Xiong, C., and Hoi, S. Adapt-and-adjust: Overcoming the long-tail problem of multilingual speech recognition. arXiv preprint arXiv:2012.01687, 2020.
- Xu & Zhu (2018) Xu, J. and Zhu, Z. Reinforced continual learning. arXiv preprint arXiv:1805.12369, 2018.
- Xu et al. (2020) Xu, Q., Baevski, A., Likhomanenko, T., Tomasello, P., Conneau, A., Collobert, R., Synnaeve, G., and Auli, M. Self-training and pre-training are complementary for speech recognition. arXiv preprint arXiv:2010.11430, 2020.
- Yoon et al. (2017) Yoon, J., Yang, E., Lee, J., and Hwang, S. J. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
- Zenke et al. (2017) Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pp. 3987–3995. PMLR, 2017.
- Zhang et al. (2020) Zhang, Y., Qin, J., Park, D. S., Han, W., Chiu, C.-C., Pang, R., Le, Q. V., and Wu, Y. Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504, 2020.
Appendix A Related Works
a.1 Continual Learning
CL methods can be categorized into three different types: the first is regularization methods, where the optimal parameters from the previous task are used to regularize learning a new task. EWC uses sequential Bayes with a Laplace approximation to motivate an regularization with respect to previous task weights (Kirkpatrick et al., 2017). Other methods also use an regularization about the previous task’s optimal parameters but with different penalty weightings (Schwarz et al., 2018; Zenke et al., 2017) or use variational inference with a KL penalty (Nguyen et al., 2017; Kessler et al., 2019). Task specific functions can also be regularized to ensure forgetting is minimized (Li & Hoiem, 2017).
Replay methods use a single predictor for all tasks which tackles forgetting by replaying data from past tasks while training on the current task. This is performed by storing a small memory of data from previous tasks (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2018). Or using a generative model which is able to sample previous task data and augment the current task dataset for the learner to prevent forgetting of past tasks (Shin et al., 2017; van de Ven et al., 2020).
The third category is expansion methods which add new parameters to a model when learning a new task. In Progressive Nets a new NN predictor is added for each new task with connections between adjacent layers (Rusu et al., 2016), this means that the number of network parameters grows super-linearly as the number of tasks increases. Dynamically Expanding Networks uses a sparsity regularization to dynamically select and limit the number of new weights to add for each new task (Yoon et al., 2017)
. New modules can be added by a controller by casting the expansion process as a reinforcement learning problem(Xu & Zhu, 2018), this is very expensive to train. Evolution methods and random search have also been applied to select the optimal sub-network in a large NN for each task (Fernando et al., 2017; Rajasegaran et al., 2019). By finding the task subnetwork which most resembles the previous task and adding modules to this subnetwork, one can then use exhaustive search to find the subnetwork which performs best on a new task wile freezing previous task weights (Veniat et al., 2020). This approach allows the number of parameters increase sub-linearly in the number tasks. CL has also been applied to RL where different tasks are different environments (Rolnick et al., 2018; Kessler et al., 2021).
a.2 Continual Learning applied to Speech
CL has also been applied to ASR models using a modular approach with a separate model encapsulating each new English dataset (Sadhu & Hermansky, 2020). Thus a linear increase in the number of parameters and an increase of time at inference. CL has also been applied to ASR models where different tasks are different English dialects, RNN predictors are used with EWC and LwF (Houston & Kirchhoff, 2020). As far as we are aware we are the first to apply CL methods very large ASR models which leverage SSL. Differenty CL strategies have been applied to deep LSTM based models which use different English datasets as different tasks trained with the CTC loss (Chang et al., 2021). The authors compare regularization and memory based approaches and show that memory approaches (Lopez-Paz & Ranzato, 2017) work well.
a.3 Automatic Speech Recognition and Self-supervised Learning
The wav2vec model uses the CPC loss (Oord et al., 2018) for ASR (Schneider et al., 2019). Extending wav2vec, the vq-wav2vec model discretizes the audio representations (using a Gumbel Softmax (Jang et al., 2016; Maddison et al., 2016)
or K-means) with CPC style pretraining. Then in a second step performs a BERT style pre-training by masking input tokens(Devlin et al., 2018) to create a language representation for subsequent finetuning ASR (Baevski et al., 2019). It is more effective to use discrete audio representations over a vector representation (Baevski & Mohamed, 2020). The wav2vec2.0 model solves the quantization and language representation learning end-to-end (Baevski et al., 2020). Using the representation from pretraining and finetuning on a small amount of labelled ASR data the model achieves state of the art results when finetuning on as little as minutes of data.
State of the art results for ASR using SSL have been achieved by combining different methods (Zhang et al., 2020): wav2vec2.0 is used for pretraining with a Conformer encoder (Gulati et al., 2020). Then during finetuning, iterative pseudo-labelling is performed to augment the labelled dataset (Xu et al., 2020).
Using CPC (Oord et al., 2018) and a simple -layer GRU network on top of a convolutional encoder has been shown to produce representations which generalise to CommonVoice datasets when pre-trained on English (Rivière et al., 2020). Also, the speech representations from CPC on English alone have been shown to generalise to other low resource African, Asian and Latin American languages (Kawakami et al., 2020). wav2vec2.0 trained out-of-the-box on multilingual audio can learn an audio representation which can perform multi-language ASR using a single quantisation module (Conneau et al., 2020). Multi-lingual ASR by directly optimizing the CTC loss (Graves et al., 2006) is explored with language adapters and multi-lingual BERT to generalize to low resource languages (Winata et al., 2020).
Appendix B Datasets
b.1 English: LibriSpeech
We use the LibriSpeech dataset which is an English ASR dataset derived from audiobooks and containshours of speech (Panayotov et al., 2015). The corpus is divided in “clean” and “other” partitions, lower-WER speakers are placed into the clean dataset and the rest into the other dataset. We validate our models on the dev-clean dataset, the size of the datasets are described in Table 1.
We used the following training datasets for performing self-supervision of French audio, in Table 2. We use hours of raw audio to perform self-supervision with for validation. For finetuning we use hours from the common_voice training set and we valid with the hours from the common_voice dev set.
|Dataset||Train (hours)||dev (hours)||test (hours)|
|common_voice (Ardila et al., 2019)||962.4||23.8||25.0|
|fr FR HQ 333http://www.voxforge.org/fr/Downloads||142.8||-||-|
|french single speaker 444https://librivox.org/api/info||19.1||-||-|
|Siwis 77footnotemark: 7||10.7||-||-|
|African Accented 88footnotemark: 8||0||1.0||0.3|
We used the following training datasets for performing self-supervision of Spanish audio in Table 3. We use hours of raw audio to perform self-supervision with for validation. For finetuning we use hours from the common_voice training set and we valid with the hours from the common_voice dev set, similarly to French.
Appendix C Implementations
We use the fairseq framework (Ott et al., 2019) to implement our continual-wav2vec2 models. The pre-training and finetuning proceeds similarly to (Baevski et al., 2020) apart from the model changes in MH wav2vec2 and cwav2vec2. We provide a recap for completeness.
Feature extractor. The feature extractor contains several temporal convolutional layers with layer norms (Ba et al., 2016)
and GeLU activation functions(Hendrycks & Gimpel, 2016).
Quantization module. The Quantization discretizes the output of the feature extractor . Quantization module is comprised of codebooks, with entries . One entry from each codebook is chosen and concatenated: . To construct the quantizer codebook, the outputs of the feature encoder
are mapped to logits. In the wav2vec2.0 BASE model used, the outputs from the encoder and each codebook is of dimensionality
. Then a linear transformation is applied to the vectorssuch that where and .
Objective. The contrastive loss identifies the true quantized latent speech representation from masked time steps is:
where is the cosine similarity, defined as . A diversity regularization is also added to encourage the model to use the entire codebook of entries in each of the codebooks by maximizing the entropy of the averaged softmax distribution over the codebooks:
Thus the objective is where is a scalar which controls the strength of the codebook diversity regularization.
The training objective is to identify the correct quantized latent audio representation by comparing to the masked prediction. The feature encoder outputs are masked before being fed into the context network. To mask, we randomly sample a with probabilityof all time steps to be starting indices and then mask the subsequent consecutive time steps from every sampled index; spans can overlap. This masking procedure is identical that from wav2vec2.0 (Baevski et al., 2020).
Pre-training. We performed pre-training on Tesla V100 GPUs. We simulated GPUs which are used wav2vec2.0 by using accumulation steps before updating gradients. We following the pre-training procedure of wav2vec2.0 (Baevski et al., 2020). For mask starting timesteps with and mask the next timesteps. The feature extractor has seven blocks with
channels with stridesand kernel widths . A -d convolutional layer models relative positional embeddings and has size and groups (Mohamed et al., 2019). We use the BASE model configuration fro the encoder (Baevski et al., 2020): transformer blocks with dimension , inner dimension (FC 2 in Figure 3) and attention heads. Audio crops do not exceed m samples per GPU, the batch size is hours. We use Adam and warm up the learning rate for the first of updates to a maximum of , we use the same learning rate scheduling procedure for all our models and the learning rate quoted is always the maximum. We train for k steps, similarly to the wav2vec2.0 BASE model. The penalty for the diversity loss is . The Gumbel-Softmax temperature is annealed from a maximum of to a minimum of by a factor of every update (Jang et al., 2016). The temperature of the contrastive loss and use distractors. An L2 penalty is added to the activations of the final layer of the feature encoder with size .
Fine-tuning. After pretraining on each unlabelled task dataset we finetune the learned representations on hours of labelled data following the same procedure as in (Baevski et al., 2020). A new classification layers is placed on the final layer of the network, the feature extractor is frozen during finetuning. The network is optimized with Adam and the CTC loss (Graves et al., 2006) using a tri-stage learning rate: the learning rate is warmed up for the first of updates, then held constant at a maximum learning for the next of steps and then linearly decayed. We do not use a Language model for fine-tuning.
c.1 Warm-starting wav2vec2.0
If we use the default parameters for wav2vec2.0 we find that the model yields unstable results when we simply warm-start the network when learning a new audio representation for a new language for a second task. The learning rate is . We can lower the learning rate for the new task though to prevent the instability. By lowering the learning rate for a new task we are essentially balancing how much we can learn a new task versus to remaining stable with respect to forgetting of the previous task. What we find is that by simply lowering the learning rate of wav2vec2.0 when learning a new task we can restrict forgetting but also enable learning a new audio task, see Figure 4 and Figure 5. However by simply warm-starting the model we are not going to get the performance benefits for a second or third task that we would get for cwav2vec2.0.
c.2 Multi-headed wav2vec2.0
We propose a very simple baseline which is to use multi-head architectures for wav2vec2.0. Multi-head architectures are commonly used in CL to retain task specific knowledge by learning task specific mappings from hidden to output layers (Li & Hoiem, 2017; Nguyen et al., 2017). We are using heads to learn task specific mappings from the feature extractor to the encoder and the from the encoder to the objective function. After having trained on a first task, the feature extractor has learnt how to extract features from the audio, so we can freeze the feature extractor and add a language specific head before into the MHSA encoder . We also tune the learning rate by pretraining for k steps from the set and found to work well.
Additionally, we enable learning new language representation in the quantization module by adding a language specific quantizer. This will enable learning a new language without interference from a previous language. This is in contrast to previous multi-task wav2vec2.0 models which share a quantizer for all languages (Conneau et al., 2020). Instead of warm-starting the quantizer from the previous task learning a new quantizer from randomly initialized weights. In the CTC finetuning step we train the entire encoder and an additional fully connected projection layer mapping into the classes representing the vocabulary of the ASR task.
Finetuning proceeds in the same way as wav2vec2.0.
c.3 Multi-headed wav2vec2.0 and L2 regularization
To combat forgetting regularization methods like EWC (Kirkpatrick et al., 2017) and SI (Zenke et al., 2017) add a regularization penalty around the previous task’s optimal parameters with different weightings for each parameter. We do a simple regularization in a similar vein: add a regularization to each parameter in the MHSA layers. We add the regularization for task : , where are the optimal parameters for the previous task and is a hyperparameter which controls the regularization strength. We experiment with and find that works the best after training for steps of pretraining, we also experiment with regularization the top layers of the MHSA encoder and find little difference in performance after steps of pretraining, hence we regularize the top layers to make the optimization simpler and ensure we are not constraining learning the new language representation too much.
We place LAs in each layer of the MHSA encoder , the model dimension is . For French, every adapter has a bottleneck dimension , which we tune by pretraining for only from the set . We use a learning rate of which is tuned from by pretraining for only k steps from the set . For Spanish, every adapter has a bottleneck dimension , which we tune by pretraining for only from the set . We use a learning rate of which is tuned from by pretraining for only k steps from the set . Performance is very dependent on adapter size and learning rate, and further tuning could lead to better performance.
In wav2vec2 the encoder and a new classification layer are finetuned. Alternatively for cwav2vec2 we place LAs on top of the pretraining adapters and train only the newly placed adapters and the corresponding layers norms, Figure 8. The finetuning LA bottleneck dimension is and we use a tri-stage learning rate where the maximum learning rate is .
Appendix D Qualitative analysis of cwav2vec2 versus wav2vec2
In Figure 9 we show embeddings from the quantizer from cwav2vec2.0 in columns and and compare to the multi-task/multi-lingual representation from the XLSR model (Conneau et al., 2020), column . We visualize the features from the quantizers: we are able to get an understanding of the language representation from as both are learnt jointly. We notice from Figure 9 that when training on which is English only (column ) we can achieve a good -shot embedding for French and Spanish, reflected in the fact that we can get a WER when we perform finetuning in these languages without any pretraining. More importantly we show in the second column that there is a separation between languages which have been learnt by cwav2vec2.0. cwav2vec2.0 is task aware and learns to represent a new language as a different task thus the separation between language embeddings in column , note how the representation for English in has been preserved from column . This is in contrast to the final column which has been trained multi-lingually on many different languages and is not task aware, as a result the language embeddings overlap.
We can see from Figure 9 how the separability of the embeddings is produced by constraining the pretraining process for to focus specifically on learning a new language representation, distinct from .