Incremental learning is a common scenario for practical applications of deep language models. In such applications, training data is expected to arrive in batches rather than all at once, and so incremental perturbations to the model are preferred over retraining the model from scratch every time new training data becomes available for efficiency of time and computational resources. When multilingual models are deployed in applications, they are expected to deliver good performance over data across multiple languages and domains. This is why it is desirable that the model keeps acquiring new knowledge from incoming training data in different languages, while preserving its ability on languages that were trained in the past. The model should ideally keep improving over time, or at the very least not deteriorate its performance on certain languages through the incremental learning lifecycle.
It is known that incremental fine-tuning with data in different languages leads to catastrophic forgetting French (1999); Mccloskey and Cohen (1989) of languages that were fine-tuned in the past Liu et al. (2021b); Vu et al. (2022). This means that the performance on previously fine-tuned tasks or languages decreases after training on a new task or language. Multiple strategies have been proposed to mitigate catastrophic forgetting. Data-focused strategies such as augmentation and episodic memories Hayes et al. (2019); Chaudhry et al. (2019b); Lopez-Paz and Ranzato (2017), entail maintaining a cache of a subset of examples from previous training data, which are mixed with new examples from the current training data. The network is subsequently fine-tuned over this mixture as a whole, in order to help the model "refresh" its "memory" of prior information so that it can leverage previous experience to transfer knowledge to future tasks.
Closely related to our current work is the work by M’hamdi et al. (2022); Ozler et al. (2020) of understanding the effect of incrementally fine-tuning models with multi-lingual data. They suggest that joint fine-tuning is the best way to mitigate the tendency of cross-lingual language models to erase previously acquired knowledge. In other words, their results show that joint fine-tuning should be used instead of incremental fine-tuning, if possible.
Optimization focused strategies such as Mirzadeh et al. (2020); Kirkpatrick et al. (2017) focus on the training regime, and show that techniques such as dropout, large learning rates with decay and shrinking the batch size can create training regimes that result in more stable models.
Translation augmentation has been shown to be an effective technique for improving performance as well. Wang et al. (2018); Fadaee et al. (2017); Liu et al. (2021a) and Xia et al. (2019) use various types of translation augmentation strategies and show substantial improvements in performance. Encouraged by these gains, we incorporate translation as our data augmentation strategy.
In our analysis, we consider an additional constraint that affects our choice of data augmentation strategies. This constraint is that the data that has already been used for training cannot be accessed again in a future time step. We know that privacy is an important consideration for continuously deployed models in corporate applications and similar scenarios and privacy protocols often limit access of each tranche of additional fine-tuning data only to the current training time step. Under such constraints, joint fine-tuning or maintaining a cache like Chaudhry et al. (2019a); Lopez-Paz and Ranzato (2017) is infeasible. Thus, we use translation augmentation as a way to improve cross-lingual generalization over a large number of fine-tuning steps without storing previous data.
In this paper we present a novel translation-augmented sequential fine-tuning approach that mixes in translated data at each step of sequential fine-tuning and makes use of a special training regime. Our approach shows minimization of the effects of catastrophic forgetting, and the interference between languages. The results show that for incremental learning over dozens of training steps, the baseline approaches result in catastrophic forgetting. We see that it may take multiple steps to reach this point, but the performance eventually collapses.
The main contribution of our work is combining data augmentation with adjustments in training regime and evaluating this approach over a sequence of 50 incremental fine-tuning steps. The training regime makes sure that incremental fine-tuning of models using translation augmentation is robust without the access to previous data. We show that our model delivers a good performance as it surpasses the baseline across multiple evaluation metrics. To the best of our knowledge, this is the first work to provide a multi-stage cross-lingual analysis of incremental learning over a large number of fine-tuning steps with recurrence of languages.
2 Related Work
Current work fits into the area of incremental learning in cross-lingual settings. M’hamdi et al. (2022) is the closest work to our research. The authors compare several cross-lingual incremental learning methods and provide evaluation measures for model quality after each sequential fine-tuning step. They show that combining the data from all languages and fine-tuning the model jointly is more beneficial than sequential fine-tuning on each language individually. We use some of their evaluation protocols but we have different constraints: we do not keep the data from previous sequential fine-tuning steps and we do not control the sequence of languages. In addition, they considered only six hops of incremental fine-tuning whereas we are interested in dozens of steps. Ozler et al. (2020) do not perform a cross-lingual analysis, but study a scenario closely related to our work. Their findings fall in line with those of M’hamdi et al. (2022) as they show that combining data from different domains into one training set for fine-tuning performs better than fine-tuning each domain separately. However, this type of joint fine-tuning is ruled out for our scenario where we assume that access to previous training data is not available, and so we focus on sequential fine-tuning exclusively.
Mirzadeh et al. (2020) study the impact of various training regimes on forgetting mitigation. Their study focuses on learning rates, batch size, regularization method. This work, like ours, shows that applying a learning rate decay plays a significant role in reducing catastrophic forgetting. However, it is important to point out that our type of decay is different from theirs. Mirzadeh et al. (2020) start with a high initial learning rate for the first task to obtain a wide and stable minima. Then, for each subsequent task, slightly decrease the learning rate, while simultaneously reducing the batch size, as recommended by Smith et al. (2017). On the other hand, we apply our decay rate across the transformer model’s layer stack so that the deviations from the current optimum get progressively smaller as one moves down the layers and we do this at each step of incremental fine-tuning.
Memory-based approaches such as Chaudhry et al. (2019b); Lopez-Paz and Ranzato (2017) have been explored to mitigate forgetting. Such methods make use of an episodic memory or a cache which stores a subset of data from previous tasks. These examples are then used for training along with the current examples in the current optimization step. Similarly, Xu et al. (2021) suggest a gradual fine-tuning approach, wherein models are eased towards the target domain by increasing the concentration of in-domain data at every fine-tuning stage. This work builds on the findings from Bengio et al. (2009), who show that a multi-stage curriculum strategy of learning easier examples first, and gradually increasing the difficulty level leads to better generalization and faster convergence. While we cannot maintain a cache of this sort because of our constraints, we take inspiration from this line of research and generate “easier examples” using translation in languages that are expected to appear in our data.
Sequential fine-tuning of languages has not been extensively studied for long sequences. Liu et al. (2021b) and Garcia et al. (2021) go up to two stages, whereas M’hamdi et al. (2022) go upto six stages. We provide an analysis of a much longer fine-tuning sequence with fifty stages. We are also the first to present an analysis of sequences with repetition of languages.
We propose a translation augmented sequential fine-tuning approach for incremental learning in a cross-lingual setting. Our approach addresses the scenario in which a pre-trained model is incrementally fine-tuned over dozens of steps without access to previously seen training data. There is a set of languages , …, that can appear during the incremental fine-tuning steps and we assume that in each step the data comes from only one language. We exploit the benefits of data augmentation, as well as specialized optimization techniques.
We begin with a pre-trained multilingual model which will be fine-tuned over multiple stages to create incremental versions where = 0…. The training data in each incremental fine-tuning step is and is in a randomly selected language , where . At each step, we sample a small random subset from and translate that subset to all languages from except , to create multiple additional subsets of training data . We denote the augmented training set as , where
Figure 1 provides a graphical representation of our approach.
3.1 Fine-tuning regime with LLRD
Motivated by Yosinski et al. (2014), we apply a layer-wise learning rate decay (or LLRD, denoted by ) to the model parameters depending on their position in the layer stack of the model, based on the discriminative fine-tuning method proposed by Howard and Ruder (2018). Layer-wise Learning Rate Decay (LLRD) is a method that applies higher learning rates for top layers and lower learning rates for bottom layers. The goal is to modify the lower layers that encode more general information less than the top layers that are more specific to the pre-training task. This is accomplished by setting the learning rate of the top layer and using a multiplicative decay rate to decrease the learning rate layer-by-layer from top to bottom. We split the parameters into where contains the parameters of the layer of the pre-trained model. The parameters are updated as follows:
where represents the learning rate of the layer. We set the learning rate of the top layer to and use
We use the Multilingual Amazon Reviews corpus (MARC) Keung et al. (2020). This dataset is a large-scale collection of product reviews from 6 different languages and from 31 different categories. We construct our training sets by extracting reviews for ten common categories: apparel, automotive, beauty, drugstore, grocery, home, kitchen, musical instruments, sports, wireless. The number of reviews for each language-category combination are not equal, but to ensure consistency of training examples at each training step, we create two unique training sets of size 100 and 150 for each language-category combination. For our experiments, we use reviews from all 6 languages provided in the dataset (Chinese, English, French, German, Japanese and Spanish). We drop the 3-star reviews and bifurcate the rest into two class labels: positive (4-star and 5-star) and negative (1-star and 2-star) sentiment.111We ensure that both final labels contain an equal number of examples of their constituent star-ratings. E.g., the negative sentiment class will contain an equal number of reviews from 1- and 2-star reviews.
Each incremental training set contains 100 reviews from a particular language-category combination, for example, de-grocery. To ensure class balancing, we sample an equal number of positive and negative records for each training set.
We use the original test-splits for each of the 60 language-category combinations of the MARC data as our test set.
4.2 Translation augmentation
The translations were generated using the Google Translate API. In the current work we sample a fraction of 0.1 of the training examples. For example, if the training data is has 100 records, the translation augmented data will have 150 records.
4.3 Constructing the sequence
We tested our approach on a large number of incremental fine-tuning steps using data from various language-category combinations. To do that we created 3 random sequences , , with 50 training sets - each. We term each incremental fine-tuning step as a hop. Multiple hops comprise a sequence.
Each training set contains data from a particular language-category combination. To construct these 3 randomized sequences we used the following approach. We first generated all possible language-category combinations, and then sampled one combination at a time for each hop. The only constraint placed on the sampling is that it cannot choose a combination that has already appeared in the sequence. However, the same language with a different category and the same category with a different language still can occur. E.g., if de-grocery features once in the sequence, it cannot be repeated, even though de-sports or en-grocery are possible options. The plots in Fig. 2 and 3 show the language-category combination for each in all three sequences.
4.4 Model and training
We use multilingual BERT (cased) as our base model. We run a hyperparameter optimization over a relatively small search space containing values that were most effective in our preliminary experiments. Two types of settings were used for training:
LLRD: 0.75, 0.85 222Our preliminary experiments showed that LLRD values of and are the most suitable candidates for our scenario. We show in brief the results of other values for comparison in Table 1. The models are trained over the sports datasets of all six languages and compare the scores averaged over the 60 evaluation sets. and deliver the most consistent performance.
The best checkpoint from any given stage is chosen for subsequent fine-tuning over the next language dataset. At the first stage, we use the pre-trained mBERT checkpoints released by Devlin et al. (2019).333github.com/google-research/bert/blob/master/multilingual.md All experiments have been run on a single machine with a 6-core NVIDIA Tesla K80 GPU.
4.5 Experimental setup
We show the results with the following four variations. We start with the default incremental fine-tuning approach and add modifications such as translation augmentation, LLRD and the combination of translation and LLRD.
Sequential fine-tuning (SeqFT): Data is , default training settings.
Sequential fine-tuning with LLRD (SeqFT-Llrd): Data is , Trained with LLRD-enabled settings.
Translation augmented sequential fine-tuning (SeqFT-Trans): Data is , default training settings.
Translation augmented sequential fine-tuning using LLRD (SeqFT-Trans-Llrd): This is our approach. Data is , Trained with LLRD-enabled settings.
4.6 Evaluation Metrics
We evaluate our proposed approach against the baseline models on overall scores over the following metrics:
Average hop-wise : The scores over each of the 60 test sets are averaged for every single fine-tuning hop.
Overall : The averages of hop-wise scores for all stages are averaged to give the overall performance.
Forgetting (F): The average forgetting across languages at the end of sequential fine-tuning. This evaluation metric measures the decrease in performance on each of the languages between the peak score and the score after final training step of the sequence. We evaluate forgetting by language (F-lang) as well as by category (F-categ).
In-language, in-domain performance (IL/ID): These are the average scores on all the test sets corresponding to the last fine-tuned language-category combination. For example, if the current stage of fine-tuning uses Chinese zh-grocery data, then the in-language performance is the over the zh-grocery test set.
Out-of-language, in-domain performance (OL/ID): These are the average scores on all the test sets corresponding to languages that were not seen in the previous stage of fine-tuning but are of the same domain. For example, if the current stage of fine-tuning uses zh-grocery data, the test sets used to calculate OL/ID performance are English (en-grocery), French (fr-grocery), German (de-grocery), Japanese (jp-grocery) and Spanish (es-grocery).
In-language, out-of-domain performance (IL/OD): The average scores on test sets of the same language as training but corresponding to the domains that were not used during training. For example, if the current stage of fine-tuning uses zh-grocery data, the performance on the Chinese test sets of all domains other than grocery are averaged at each fine-tuning stage.
Out-of-language, out-of-domain performance (OL/OD): The average scores on the test sets corresponding to all language-category combinations except the one that was used during training.
We present below a comparison of our approach SeqFT-Trans-Llrd with different variations of sequential fine-tuning in our results.
5.1 SeqFT (baseline)
Our proposed approach SeqFT-Trans-Llrd outperforms SeqFT decisively. We see that it is able to dramatically improve the overall performance and reduce forgetting on both forgetting metrics by one order of magnitude. This is evident from Fig. 2 and Table 2. We see in the plots for average (Fig. 2) that for each of the three sequences, the default approach SeqFT results in catastrophic forgetting. It can happen at different hops. In sequence 1, at the 2nd hop, in sequence 2, at the 17th hop and in sequence 3, at the 23rd hop. But eventually, the
drops and never recovers. This highlights the importance of studying sequential fine-tuning over a large number of steps to be able to observe these effects. After the model performance collapses, we observed that the model classifies almost every example as negative. It is not clear from our results in these three different sequences if a particular language or category or combination triggers this collapse in performance. This is something we intend to explore in future work. Even in the initial hops before the collapse in performance,SeqFT under-performs our approach. From Table 2, we see that our approach outperforms this baseline on all metrics. There is at least a 36 point improvement on the overall score between SeqFT and our approach.
5.2 SeqFT-Trans (baseline)
We observe that translation augmentation on its own performs very similarly to the baseline SeqFT. It outperforms the baseline only on one sequence in terms of overall . The overall for SeqFT-Trans is significantly lower compared to our approach. The plots look similar to SeqFT, but we still provide them in Fig. 3. Augmentation seems to delay catastrophic forgetting until 6, 15 and 22 hops. However, both approaches eventually result in catastrophic forgetting. Thus, the performance at the end of the sequence is extremely low. This is reflected in the high values of the F-lang and F-categ metrics in Table 2.
In contrast to SeqFT-Trans, using only the specialized training regime SeqFT-Llrd shows a strong performance. In fact, it appears that the main advantage of our approach stems from the optimized training regime with LLRD since SeqFT-Llrd and SeqFT-Trans-Llrd have comparable performance on many evaluation metrics. For sequence 3, our full approach SeqFT-Trans-Llrd shows a slightly higher overall performance as compared to SeqFT-Llrd. But on the other two sequences SeqFT-Llrd has a higher score. However, in terms of the forgetting metrics, our approaches outperforms SeqFT-Llrd on two out of three sequences. Also, for sequence 3, the out-of-domain performance with our full approach is higher.
In summary, our approach outperforms the baseline (SeqFT) by a wide margin. Since we use multiple language and category combinations, we show results on metrics based on similarity of the train and test data with respect to language and category. Our observations are consistent across all evaluation metrics. The main performance boost for our approach comes from including LLRD in the training regime. However, our combination of LLRD and translation augmentation slightly outperforms SeqFT-Llrd in terms of both forgetting metrics.
We introduce a sequential fine-tuning approach wherein the language data for fine-tuning is augmented by a subset of translated examples. Our augmentation strategy emulates episodic memory and decreases the reliance on a cache of stored examples from previous stages. We also advocate the use of layer-wise learning rate decay and illustrate its effectiveness in mitigating forgetting. With our results, we show that the proposed approach can outperform joint fine-tuning based methods, in spite of not having access to the complete set of examples from all languages. Crucially, it achieves robust and consistent performance over multiple cross-lingual fine-tuning stages. The trajectories of performances over different languages suggest that the model can continue learning over new data (or languages) for even more stages in the sequence without undergoing a significant reduction in performance. Furthermore, our approach surpasses all baselines when evaluated on in-domain, out-of-domain, in-language and out-of-language performance, showing that the model has a strong generalization ability. All-in-all, we hope our work provides encourament to the community to pursue similar recipes that facilitate long-term continual learning.
One of the primary limitations of our work is that the analysis has only been provided for a single random sequence with only six languages. Diversification of this study with more languages, more random sequences and an even higher number of fine-tuning stages is a strong avenue for future work that we intend to pursue. Additionally, we would also like to extend this study to other cross-lingual tasks to see if the findings are similar.
Another limitation is a lack of experimentation with adapter-based methods. In the future, we would also like to experiment with varying proportions of translated examples with respect to the original training size.
We would also like to extend our work to include a more in-depth study of the underlying linguistic factors that underpin cross-lingual transfer or forgetting. A study of this kind would ideally include, but not be limited to, analyses based on word order, scripts, morphology and syntax.
The authors of this work would like to express their gratitude to Dinesh Karamchandani, for help with setting up the experimentation framework, and Dan Roth for feedback on our approach.
Bengio et al. (2009)
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston.
Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 41–48, New York, NY, USA. Association for Computing Machinery.
- Chaudhry et al. (2019a) Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K. Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. 2019a. On Tiny Episodic Memories in Continual Learning. Number: arXiv:1902.10486 arXiv:1902.10486 [cs, stat].
- Chaudhry et al. (2019b) Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet Kumar Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. 2019b. Continual learning with tiny episodic memories. CoRR, abs/1902.10486.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Fadaee et al. (2017) Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics.
- French (1999) Robert French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3:128–135.
- Garcia et al. (2021) Xavier Garcia, Noah Constant, Ankur Parikh, and Orhan Firat. 2021. Towards Continual Learning for Multilingual Machine Translation via Vocabulary Substitution. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1184–1192, Online. Association for Computational Linguistics.
- Hayes et al. (2019) Tyler L. Hayes, Kushal Kafle, Robik Shrestha, Manoj Acharya, and Christopher Kanan. 2019. Remind your neural network to prevent catastrophic forgetting.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
Keung et al. (2020)
Phillip Keung, Yichao Lu, György Szarvas, and Noah A. Smith. 2020.
multilingual Amazon reviews corpus.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4563–4568, Online. Association for Computational Linguistics.
- Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526.
- Liu et al. (2021a) Qi Liu, Matt Kusner, and Phil Blunsom. 2021a. Counterfactual data augmentation for neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 187–197, Online. Association for Computational Linguistics.
- Liu et al. (2021b) Zihan Liu, Genta Indra Winata, Andrea Madotto, and Pascale Fung. 2021b. Preserving Cross-Linguality of Pre-trained Models via Continual Learning. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 64–71, Online. Association for Computational Linguistics.
- Lopez-Paz and Ranzato (2017) David Lopez-Paz and Marc’ Aurelio Ranzato. 2017. Gradient Episodic Memory for Continual Learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Mccloskey and Cohen (1989) Michael Mccloskey and Neil J. Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation, 24:104–169.
- M’hamdi et al. (2022) Meryem M’hamdi, Xiang Ren, and Jonathan May. 2022. Cross-lingual Lifelong Learning. Number: arXiv:2205.11152 arXiv:2205.11152 [cs].
- Mirzadeh et al. (2020) Seyed Iman Mirzadeh, Mehrdad Farajtabar, Razvan Pascanu, and Hassan Ghasemzadeh. 2020. Understanding the role of training regimes in continual learning.
- Ozler et al. (2020) Kadir Bulut Ozler, Kate Kenski, Steve Rains, Yotam Shmargad, Kevin Coe, and Steven Bethard. 2020. Fine-tuning for multi-domain and multi-label uncivil language detection. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 28–33, Online. Association for Computational Linguistics.
- Smith et al. (2017) Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. 2017. Don’t decay the learning rate, increase the batch size.
- Vu et al. (2022) Tu Vu, Aditya Barua, Brian Lester, Daniel Cer, Mohit Iyyer, and Noah Constant. 2022. Overcoming catastrophic forgetting in zero-shot cross-lingual generation.
- Wang et al. (2018) Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. 2018. Switchout: an efficient data augmentation algorithm for neural machine translation.
- Xia et al. (2019) Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, and Graham Neubig. 2019. Generalized data augmentation for low-resource translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5786–5796, Florence, Italy. Association for Computational Linguistics.
- Xu et al. (2021) Haoran Xu, Seth Ebner, Mahsa Yarmohammadi, Aaron Steven White, Benjamin Van Durme, and Kenton Murray. 2021. Gradual fine-tuning for low-resource domain adaptation. In Proceedings of the Second Workshop on Domain Adaptation for NLP, pages 214–221, Kyiv, Ukraine. Association for Computational Linguistics.
Yosinski et al. (2014)
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014.
How transferable are features in deep neural networks?In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.