Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

04/27/2020 ∙ by Sanyuan Chen, et al. ∙ Harbin Institute of Technology 0

Deep pretrained language models have achieved great success in the way of pretraining first and then fine-tuning. But such a sequential transfer learning paradigm often confronts the catastrophic forgetting problem and leads to sub-optimal performance. To fine-tune with less forgetting, we propose a recall and learn mechanism, which adopts the idea of multi-task learning and jointly learns pretraining tasks and downstream tasks. Specifically, we propose a Pretraining Simulation mechanism to recall the knowledge from pretraining tasks without data, and an Objective Shifting mechanism to focus the learning on downstream tasks gradually. Experiments show that our method achieves state-of-the-art performance on the GLUE benchmark. Our method also enables BERT-base to achieve better performance than directly fine-tuning of BERT-large. Further, we provide the open-source RecAdam optimizer, which integrates the proposed mechanisms into Adam optimizer, to facility the NLP community.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Pretrained Language Models (LMs), such as ELMo Peters et al. (2018) and BERT Devlin et al. (2019)

, have significantly altered the landscape of Natural Language Processing (NLP) and a wide range of NLP tasks has been promoted by these pretrained language models. These successes are mainly achieved through

Sequential Transfer Learning Ruder (2019): pretrain a language model on large-scale unlabeled data and then adapt it to downstream tasks. The adaptation step is usually conducted in two manners: fine-tuning or freezing pretrained weights. In practice, fine-tuning is adopted more widely due to its flexibility Phang et al. (2018); Lan et al. (2019); Peters et al. (2019).

Despite the great success, sequential transfer learning of deep pretrained LMs tends to suffer from catastrophic forgetting during the adaptation step. Catastrophic forgetting is a common problem for sequential transfer learning, and it happens when a model forgets previously learned knowledge and overfits to target domains McCloskey and Cohen (1989); Kirkpatrick et al. (2017). To remedy the catastrophic forgetting in transferring deep pretrained LMs, existing efforts mainly explore fine-tuning tricks to forget less. ULMFiT Howard and Ruder (2018) introduced discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing for LMs fine-tuning. Lee et al. (2019) reduced forgetting in BERT fine-tuning by randomly mixing pretrained parameters to a downstream model in a dropout-style.

Instead of learning pretraining tasks and downstream tasks in sequence, Multi-task Learning learns both of them simultaneously, thus can inherently avoid the catastrophic forgetting problem. Xue et al. (2019)

tackled forgetting in automatic speech recognition by jointly training the model with previous and target tasks.

Kirkpatrick et al. (2017) proposed Elastic Weight Consolidation (EWC) to overcome catastrophic forgetting when continuous learning multiple tasks by adopting the multi-task learning paradigm. EWC regularizes new task training by constraining the parameters which are important for previous tasks and adapt more aggressively on other parameters. Thanks to the appealing effects on catastrophic forgetting, EWC has been widely applied in various domains, such as game playing Ribeiro et al. (2019)

, neural machine translation

Thompson et al. (2019) and reading comprehension Xu et al. (2019).

However, these multi-task learning methods cannot be directly applied to the sequential transferring regime of deep pretrained LMs. Firstly, multi-task learning methods require to use data of pretraining tasks during adaptation, but pretraining data of LMs is often inaccessible or too large for the adaptation. Secondly, we only care about the performance of the downstream task, while multi-task learning also aims to promote performance on pretraining tasks.

In this paper, we propose a recall and learn mechanism to cope with the forgetting problem of fine-tuning the deep pretrained LMs. To achieve this, we take advantage of multi-task learning by adopting LMs pretraining as an auxiliary learning task during fine-tuning. Specifically, we propose two mechanisms for the two challenges mentioned above, respectively. As for the challenge of data obstacles, we propose the Pretraining Simulation to achieve multi-task learning without accessing to pretraining data. It helps the model to recall previously learned knowledge by simulating the pretraining objective using only pretrained parameters. As for the challenge of learning objective difference, we propose the Objective Shifting to balance new task learning and pretrained knowledge recalling. It allows the model to focus gradually on the new task by shifting the multi-task learning objective to the new task learning.

We also provide Recall Adam (RecAdam) optimizer to integrate the recall and learn mechanism into Adam optimizer Kingma and Ba (2015). We release the source code of the RecAdam

optimizer implemented in Pytorch. It is easy to use and can facilitate the NLP community for better fine-tuning of deep pretrained LMs. Experiments on GLUE benchmark with the BERT-base model show that the proposed method can significantly outperform the vanilla fine-tuning method. Our method with the BERT-base model can even achieve better results than directly fine-tuning the BERT-large model. In addition, thanks to the effectiveness of pretrained knowledge recalling, we gain better performance by initializing model with random parameters rather than pretrained parameters. Finally, we achieve state-of-the-art performance on GLUE benchmark with the ALBERT-xxlarge model.

Our contributions can be summarized as follows: (1) We propose to tackle the catastrophic forgetting problem of fine-tuning the deep pretrained LMs by adopting the idea of multi-task learning and obtain state-of-the-art results on GLUE benchmark. (2) We propose Pretraining Simulation and Objective Shifting mechanisms to achieve multi-task fine-tuning without data of pretraining tasks. (3) We provide the open-source RecAdam optimizer to facilitate deep pretrained LMs fine-tuning with less forgetting.

2 Background

In this section, we introduce two transfer learning settings: sequential transfer learning and multi-task learning. They both aim to improve the learning performance by transferring knowledge across multiple tasks, but apply to different scenarios.

2.1 Sequential Transfer Learning

Sequential transfer learning learns source tasks and target tasks in sequence, and transfers knowledge from source tasks to improve the models’ performance on target tasks.

It typically consists of two stages: pretraining and adaptation

. During pretraining, the model is trained on source tasks with the loss function

. During adaptation, the pretrained model is further trained on target tasks with the loss function . The standard adaptation methods includes fine-tuning and feature extraction

. Fine-tuning updates all the parameters of the pretrained model, while feature extraction regards the pretrained model as a feature extractor and keeps it fixed during the adaptation phase.

Sequential transfer learning has been widely used recently, and the released deep pretrained LMs have achieved great successes on various NLP tasks Peters et al. (2018); Devlin et al. (2019); Lan et al. (2019). While the adaptation of the deep pretrained LMs is very efficient, it tends to suffer from catastrophic forgetting, where the model forgets previously learned knowledge from source tasks when learning new knowledge from target tasks.

2.2 Multi-task Learning

Multi-task Learning learns multiple tasks simultaneously, and improves the models’ performance on all of them by sharing knowledge across these tasks Caruana (1997); Ruder (2017).

Under the multi-task learning paradigm, the model is trained on both source tasks and target tasks with the loss function:

(1)

where

is a hyperparameter balancing these two tasks. It can inherently avoid catastrophic forgetting problem because the loss on source tasks

is always part of the optimization objective.

To overcome catastrophic forgetting problem (discussed in § 2.1), can we apply the idea of multi-task learning to the adaptation of the deep pretrained LMs? There are two challenges in practice:

  1. We cannot get access to the pretraining data to calculate during adaptation.

  2. The optimization objective of adaptation is , while multi-task learning aims to optimize , i.e., the weighted sum of and .

3 Methodology

In this section, we introduce Pretraining Simulation (§ 3.1) and Objective Shifting (§ 3.2) to overcome the two challenges (discussed in § 2.2) respectively. Pretraining Simulation allows the model to learn source tasks without pretraining data, and Objective Shifting allows the model to focus on target tasks gradually. We also introduce the RecAdam optimizer (§ 3.3) to integrate these two mechanisms into the common-used Adam optimizer.

3.1 Pretraining Simulation

As for the first challenge that pretraining data is unavailable, we introduce Pretraining Simulation to approximate the optimization objective of source tasks as a quadratic penalty, which keeps the model parameters close to the pretrained parameters.

Following Elastic Weight Consolidation (EWC; Kirkpatrick et al. 2017; Huszár 2017), we approximate the optimization objective of source tasks with Laplace’s Method and independent assumption among the model parameters. Since EWC requires pretraining data, we further introduce a stronger independent assumption and derive a quadratic penalty, which is independent with the pretraining data. We introduce the detailed derivation process as follows.

From the probabilistic perspective, the learning objective on the source tasks

would be optimizing the negative log posterior probability of the model parameters

given data of source tasks :

The pretrained parameters can be assumed as a local minimum of the parameter space, and it satisfies the equation:

Due to the intractability, the optimization objective is locally approximated with the Laplace’s Method MacKay (2003):

where is the Hessian matrix of the optimization objective w.r.t. and evaluated at . is a constant term w.r.t. , and it can be ignored during optimization.

Since the pretrained model convergences on the source tasks, can be approximated with the empirical Fisher information matrix Martens (2014):

where is the number of i. i. d. observations in ,

is the Hessian matrix of the negative log prior probability

.

Because of the computational intractability, EWC approximate by using the diagonal of and ignoring the prior Hessian matrix :

where is the corresponding diagonal Fisher information value of the model parameter .

Since the pretraining data is unavailable, we further approximate with a stronger assumption that each diagonal Fisher information value is independent of the corresponding parameter :

The final approximated optimization objective of the source tasks is the quadratic penalty between the model parameters and the pretrained parameters:

where is the coefficient of the quadratic penalty.

3.2 Objective Shifting

As for the second challenge that the optimization objective of multi-task learning is inconsistent with adaptation, we introduce Objective Shifting to allow the objective function to gradually shift to with the annealing coefficient.

We replace the coefficient in the optimization objective of multi-task learning (as shown in Eq. 1) with the annealing function , where refers to the update timesteps during fine-tuning. The loss function of our method is set to multi-task learning with annealing coefficient:

Figure 1: Objective Shifting: we replace the coefficient with the annealing function . Fine-tuning and multi-task learning can be regarded as the special cases ( and ) of our method.

Specifically, to better balance the multi-task learning and fine-tuning, is calculated as the sigmoid annealing function Bowman et al. (2016):

where and are the hyperparameters controlling the annealing rate and timesteps.

As shown in Figure 1, at the beginning of the training process, the model mainly learns general knowledge by focusing more on pretraining tasks. As training progress, the model gradually focuses on target tasks and learns more target-specific knowledge while recalling the knowledge of pretraining tasks. At the end of the training process, the model completely focuses on target tasks, and the final optimization objective is .

Fine-tuning and multi-task learning can be regarded as special cases of our method. When , our method can be regarded as fine-tuning. The model firstly gets pretrained on source tasks with the , then learns the target tasks with the . When , is a constant function, then our method can be regarded as the multi-task learning. The model learns source tasks and target tasks simultaneously with the loss function .

1:  given initial learning rate , momentum factors

, pretrained parameter vector

, coefficient of quadratic penalty , annealing coefficient in objective function
2:  initialize timestep , parameter vector

, first moment vector

, second moment vector , schedule multiplier
3:  repeat
4:      
5:        select batch and return the corresponding gradient
6:      
7:        here and below all operations are element-wise
8:      
9:         is taken to the power of
10:         is taken to the power of
11:        can be fixed, decay, or also be used for warm restarts
12:      
13:  until  stopping criterion is met
14:  return  optimized parameters
Algorithm 1 Adam and RecAdam

3.3 RecAdam Optimizer

Adam optimizer Kingma and Ba (2015) is commonly used for fine-tuning the deep pretrained LMs. We introduce Recall Adam (RecAdam) optimizer to integrate the quadratic penalty and the annealing coefficient, which are the core factors of the Pretraining Simulation (§ 3.1) and Objective Shifting (§ 3.2) mechanisms respectively, by decoupling them from the gradient updates in Adam optimizer.

Loshchilov and Hutter (2019) observed that L2 regularization and weight decay are not identical for adaptive gradient algorithms such as Adam, and confirmed the proposed AdamW optimizer based on decoupled weight decay could substantially improve Adam’s performance in both theoretical and empirical way.

Similarly, it is necessary to decouple the quadratic penalty and the annealing coefficient when fine-tuning the pretrained LMs with Adam optimizer. Otherwise, both the quadratic penalty and annealing coefficient would be adapted by the gradient update rules, resulting in different magnitudes of the quadratic penalty among the model’s weights.

The comparison between Adam and RecAdam are shown in Algorithm 1, where SetScheduleMultiplier (Line 11) refers to the procedure (e.g. warm-up technique) to get the scaling factor of the step size.

Line 6 of Algorithm 1 shows how we implement the quadratic penalty and annealing coefficient with the vanilla Adam optimizer. The weighted sum of the gradient of target task objective function and the gradient of the quadratic penalty get adapted by the gradient update rules, which derives to inequivalent magnitudes of the quadratic penalty among the model’s weights, e.g. the weights that tend to have larger gradients would have the larger second moment and be penalized by the relatively smaller amount than other weights.

With RecAdam optimizer, we decouple the gradient of the quadratic penalty and the annealing coefficient in Line 12 of Algorithm 1. In this way, only the gradient of target task objective function get adapted during the optimization steps, and all the weights of the training model would be more effectively penalized with the same rate .

Since the RecAdam optimizer is only one line modification from Adam optimizer, it can be easily used by feeding the additional parameters, including the pretrained parameters and a few hyperparameters of the Pretraining Simulation and Objective Shifting mechanisms.

4 Experiments

4.1 Setup

Model

We conduct the experiments with the deep pretrained language model BERT-base Devlin et al. (2019) and ALBERT-xxlarge Lan et al. (2019).

BERT is a deep bi-directional pretrained model based on multi-layer Transformer encoders. It is pretrained on the large-scale corpus with two unsupervised tasks: Masked LM and Next Sentence Prediction, and has achieved significant improvements on a wide range of NLP tasks. We use the BERT-base model with 12 layers, 12 attention heads and 768 hidden dimensions.

ALBERT is the latest deep pretrained LM that achieves the state-of-the-art performance on several benchmarks. It improves BERT by the parameter reduction techniques and self-supervised loss for sentence-order prediction (SOP). The ALBERT-xxlarge model with 12 layers, 64 attention heads, 128 embedding dimension and 4,096 hidden dimensions is the current state-of-the-art model released by Lan et al. (2019).

Data

We evaluate our methods on the General Language Understanding Evaluation (GLUE) benchmark Wang et al. (2019).

GLUE is a well-known benchmark focused on evaluating model capabilities for natural language understanding. It includes 9 tasks: Corpus of Linguistic Acceptability (CoLA; Warstadt et al. 2019), Stanford Sentiment Treebank (SST; Socher et al. 2013), Microsoft Research Paraphrase Corpus (MRPC; Dolan and Brockett 2005), Semantic Textual Similarity Benchmark (STS; Cer et al. 2017), Quora Question Pairs (QQP; Csernai. January 2017), Multi-Genre NLI (MNLI; Williams et al. 2018), Question NLI (QNLI; Rajpurkar et al. 2016), Recognizing Textual Entailment (RTE; Dagan et al. 2005; Szpektor. 2006; Giampiccolo et al. 2007; Bentivogli et al. 2009) and Winograd NLI (WNLI; Levesque et al. 2012).

Following previous works Yang et al. (2019); Liu et al. (2019); Lan et al. (2019), we report our single-task single-model results on the dev set of 8 GLUE tasks, excluding the problematic WNLI dataset.222https://gluebenchmark.com/faq We report Pearson correlations for STS, Matthew’s correlations for CoLA, the “match” condition (MNLI-m) for MNLI, and accuracy scores for other tasks.

Implementation

As discussed in § 3.3, we implement the Pretraining Simulation and Objective Shifting techniques with the proposed RecAdam optimizer. Our methods use random initialization because of the pretrained knowledge recalling implementation, while vanilla fine-tuning initializes the fine-tuning model with pretrained parameters.

We fine-tune BERT-base and ALBERT-xxlarge model with the same hyperparameters following Devlin et al. (2019) and Lan et al. (2019), except for the maximum sequence length which we set to 128 rather than 512. For the BERT-base model, we set the learning rate to 2e-5 and select the training step to make sure the convergence of vanilla fine-tuning on each target task. We note that we fine-tune for RTE, STS, and MRPC directly using the pretrained LM while the previous works are using an MNLI checkpoint for further performance improvement. As for the hyperparameters of our methods, we set in the quadratic penalty to 5,000, and select the best and in {100, 250, 500, 1,000} and {0.05, 0.1, 0.2, 0.5, 1} respectively for the annealing coefficient . Following previous works Yang et al. (2019); Liu et al. (2019); Lan et al. (2019), we report the score of 5 differently-seeded runs for each result.

Model MNLI QQP QNLI SST Avg CoLA STS MRPC RTE Avg Avg
392k 363k 108k 67k 10k 8.5k 5.7k 3.5k 2.5k 10k
BERT-base Devlin et al. (2019) 84.4 - 88.4 92.7 - - - 86.7 - - -
BERT-base (rerun) 84.8 91.4 88.6 93.0 89.5 60.6 89.8 86.5 71.1 77.0 83.2
BERT-base + RecAdam 85.3 91.4 89.1 93.6 89.9 62.4 90.4 87.7 74.4 78.7 84.3
BERT-base (rerun) 85.2 91.4 89.0 93.3 89.7 61.6 89.9 88.7 71.5 77.9 83.8
BERT-base + RecAdam 85.4 91.6 89.4 94.0 90.1 62.6 90.6 88.7 77.3 79.8 85.0
BERT-large Devlin et al. (2019) 86.6 91.3 92.3 93.2 90.9 60.6 90.0 88.0 70.4 77.3 84.1
XLNet-large Yang et al. (2019) 89.8 91.8 93.9 95.6 92.8 63.6 91.8 89.2 83.8 82.1 87.4
RoBERTa-large Liu et al. (2019) 90.2 92.2 94.7 96.4 93.4 68.0 92.4 90.9 86.6 84.5 88.9
ALBERT-xxlarge Lan et al. (2019) 90.8 92.2 95.3 96.9 93.8 71.4 93.0 90.9 89.2 86.1 90.0
ALBERT-xxlarge (rerun) 90.6 92.2 95.4 96.7 93.7 69.5 93.0 91.2 87.4 85.3 89.5
ALBERT-xxlarge + RecAdam 90.5 92.3 95.3 96.8 93.7 72.9 92.9 91.9 89.3 86.8 90.2
ALBERT-xxlarge (rerun) 90.7 92.2 95.4 96.8 93.8 72.1 93.2 91.4 89.9 86.7 90.2
ALBERT-xxlarge + RecAdam 90.6 92.4 95.5 97.0 93.9 75.1 93.0 93.1 91.7 88.2 91.1
Table 1: State-of-the-art single-task single-model results on the dev set of the GLUE benchmark. The number below each task refers to the number of training data. The average scores of the tasks with large training data (10k), the tasks with small training data (10k), and all the tasks are reported separately. We rerun the baseline of vanilla fine-tuning without further pretraining on MNLI. We report median and maximum over 5 runs.

4.2 Results on GLUE

Table 1 shows the single-task single-model results of our RecAdam fine-tuning method comparing to the vanilla fine-tuning method with BERT-base and ALBERT-xxlarge model on the dev set of the GLUE benchmark.

Results with BERT-base

With the BERT-base model, we outperform the vanilla fine-tuning method on 7 out of 8 tasks of the GLUE benchmark and achieve 1.1% improvements on the average median performance.

Especially for the tasks with smaller training data (10k), our method can achieve significant improvements (+1.7% on average) compared to the vanilla fine-tuning method. Because of the data scarcity, vanilla fine-tuning on these tasks are potentially brittle, and rely on the pretrained parameters to be reasonably close to an ideal setting for the target task Phang et al. (2018). With the proposed RecAdam method, we successfully achieve better fine-tuning by learning target tasks while recalling the knowledge of pretraining tasks.

It is interesting to find that compared to the median results with BERT-large model, we can also achieve better results on more than half of the tasks (e.g., +4.0% on RTE, +0.4% on STS, +1.8% on CoLA, +0.4% on SST, +0.1% on QQP) and better average results (+0.2%) of all the GLUE tasks. Thanks to the less catastrophic forgetting realized by RecAdam, we can get comparable overall performance with much fewer parameters of the pretrained model.

Results with ALBERT-xxlarge

With the state-of-the-art model ALBERT-xxlarge, we outperform the vanilla fine-tuning method on 5 out of 8 tasks of the GLUE benchmark and achieve the state-of-the-art single-task single-model average median performance 90.2% on dev set of the GLUE benchmark.

Similar to the results with the BERT-base model, We find that our improvements mostly come from the tasks with smaller training data (10k), and we can improve the ALBERT-xxlarge model’s median performance on these tasks by +1.5% on average. Also, compared to the reported results by Lan et al. (2019), we can achieve similar or better median results on RTE (+0.1%), STS (-0.1%), and MRPC (+1.0%) tasks without pretraining on MNLI task.

Overall, we outperform the average median results of the baseline with the ALBERT-xxlarge model by 0.7%, which is lower than the improvement we gain with the BERT-base model (+1.1%). With advanced model design and pretraining techniques, ALBERT-xxlarge achieves significantly better performance on GLUE benchmark, which would be harder to be further improved.

4.3 Analysis

Method CoLA STS MRPC RTE Avg
vanilla fine-tuning 60.6 89.8 86.5 71.1 77.0
RecAdam + PI 62.0 90.4 87.3 73.6 78.3
RecAdam + RI 62.4 90.4 87.7 74.4 78.7
Table 2: Comparison of different model initialization strategies: pretrained initialization (PI) and Random Initialization (RI). We report median over 5 runs.

Model Initialization

With our RecAdam method based on Pretraining Simulation and Objective Shifting, the model can be initialized with random values, and recall the knowledge of pretraining tasks while learning the new tasks.

It is interesting to see whether the choice of initialization strategies would have an impact on the performance of our RecAdam method. Table 2 shows the performance comparison of different initialization strategies for RecAdam obtained by the BERT-base model. It shows that RecAdam, with both initialization strategies, can outperform the vanilla fine-tuning method on all the four tasks. For the target task STS, the model with pretrained initialization can achieve the same result as random initialization. For the other tasks (e.g., CoLA, MRPC, RTE), Random initialize the model would be our best choice. It is because the model would benefit from a larger parameter search space with random initialization. In contrast, with pretrained initialization, the search space would be limited to around the pretraining model, making it harder for the model to learn the new tasks.

(a) Training loss on the target task
Training loss on the target task
(b) Knowledge forgetting from the source tasks
Figure 2: Learning curves obtained by BERT-base model trained with different objective shifting rate on CoLA task. We measure the knowledge forgetting by the Euclidean distance between the weights of the fine-tuning model and the pretrained model. With smaller , the model achieves less knowledge forgetting from the source tasks while it takes more timesteps to converge on the target task.

Forgetting Analysis

As introduced in § 3.2, we realize multi-task fine-tuning with the Objective Shifting technique, which allows the model’s learning objective to shift from the source tasks to the target tasks gradually. The hyperparameter controls the rate of the objective shifting.

Figure 2 shows the learning curves of our fine-tuning methods with different value obtained by BERT-base model trained on CoLA dataset. As discussed in § 3.2, Fine-tuning and multi-task learning can be regarded as the special cases ( and ) of our method.

As shown in Figure (a)a, with the larger shifting rate , the model can converge quickly on the target task. As decreases, it takes a longer time for the model to converge on the target task because of the slower shifting from the pretrained knowledge recalling to target task learning.

Figure (b)b shows the pretrained knowledge forgetting during the fine-tuning process. We measure the pretrained knowledge forgetting by the Euclidean distance between the weights of the fine-tuning model and the pretrained model. At the very early timesteps, the Euclidean distance drops sharply because of the random initialization and pretrained knowledge recalling. Then the curve rises with the growth rate slowing down because of the target task learning. As the objective shifting rate decreases, we find that the model can achieve less forgetting from the pretrained model at the end of the fine-tuning.

Overall, our methods provide a bridge between fine-tuning and multi-task learning. With smaller , the model achieves less knowledge forgetting from the source tasks but risks not converging completely on the target task. With a good balance between the pretrained knowledge recalling and new task learning, our methods can consistently outperform the vanilla fine-tuning by not only converging on target tasks but also less forgetting from source tasks.

5 Related Works

Catastrophic forgetting has been observed as a great challenge issue in sequential transfer learning, especially in the continuous learning paradigm McCloskey and Cohen (1989); French (1999); Goodfellow et al. (2013); Lange et al. (2019). Many methods have been proposed to avoid catastrophic forgetting. Replay-based methods alleviate forgetting by relaying the samples of the previous tasks while learning the new task Rebuffi et al. (2017); Shin et al. (2017); Lopez-Paz and Ranzato (2017); Aljundi et al. (2019). Parameter isolation-based methods avoid forgetting by updating a set of parameters for each task and freezing them for the new task Mallya and Lazebnik (2018); Serrà et al. (2018); Rusu et al. (2016); Xu and Zhu (2018); Rosenfeld and Tsotsos (2020). Regularization-based methods propose to recall the previous knowledge with an extra regularization term Kirkpatrick et al. (2017); Zenke et al. (2017); Lee et al. (2017); Li et al. (2018); Aljundi et al. (2018); Liu et al. (2018); Li and Hoiem (2018); Jung et al. (2016); Triki et al. (2017); Zhang et al. (2019b). We focus on regularization-based methods in this paper, because they don’t require the storage of the pretraining data, and more flexible compared to the parameter isolation-based methods.

Regularization-based methods can be further divided into data-focused and prior-focused methods. Data-focused methods regularize the new task learning by knowledge distillation from the pretrained model Hinton et al. (2015); Li and Hoiem (2018); Jung et al. (2016); Triki et al. (2017); Zhang et al. (2019b). Prior-focused methods regard the distribution of the pretrained parameters as prior when learning the new task Kirkpatrick et al. (2017); Zenke et al. (2017); Lee et al. (2017); Li et al. (2018); Aljundi et al. (2018); Liu et al. (2018). We adopted the idea of prior-focused methods because they enable the model to learn more general knowledge from the pretrained model’s parameters more efficiently. While the prior-focused methods, such as EWC Kirkpatrick et al. (2017) and its variants Schwarz et al. (2018); Chaudhry et al. (2018); Liu et al. (2018), don’t directly access to the pretraining data, they need some pretraining knowledge (e.g., Fisher information matrix of the source tasks), which is not available in our setting. Therefore, we further approximate to a quadratic penalty which is independent with the pretraining data given the pretrained parameters.

Catastrophic forgetting in NLP has raised increased attention recently Mou et al. (2016); Arora et al. (2019); Chronopoulou et al. (2019). Many approaches have been proposed to overcome the forgetting problem in various domains, such as neural machine translation Barone et al. (2017); Thompson et al. (2019) and reading comprehension Xu et al. (2019). As sequential transfer learning widely used for NLP tasks Howard and Ruder (2018); Devlin et al. (2019); Liu et al. (2019); Lan et al. (2019); Hou et al. (2019), previous works explore many fine-tuning tricks to reduce catastrophic forgetting for adaptation of the deep pretrained LMs Howard and Ruder (2018); Sun et al. (2019); Lee et al. (2019); Zhang et al. (2019a); Felbo et al. (2017). In this paper, we bring the idea of multi-task learning which can inherently avoid catastrophic forgetting, apply it to the fine-tuning process with Pretraining Simulation and Objective Shifting mechanisms, and achieve consistent improvement with only the deep pretrained LMs available.

6 Conclusion

In this paper, we solve the catastrophic forgetting in transferring deep pretrained language models by bridging two transfer learning paradigm: sequential fine-tuning and multi-task learning. To cope with the absence of pretraining data during the joint learning of pretraining task, we propose a Pretraining Simulation mechanism to learn the pretraining task without data. Then we propose the Objective Shifting mechanism to better balance the learning of the pretraining and downstream task. Experiments demonstrate the superiority of our method in the transferring of deep pretrained language models, and we provide the open-source RecAdam optimizer by integrating the proposed mechanisms into Adam optimizer to facilitate the better usage of deep pretrained language models.

References

  • R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars (2018)

    Memory aware synapses: learning what (not) to forget

    .
    In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part III, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Lecture Notes in Computer Science, Vol. 11207, pp. 144–161. External Links: Link, Document Cited by: §5, §5.
  • R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio (2019) Online continual learning with no task boundaries. CoRR abs/1903.08671. External Links: Link, 1903.08671 Cited by: §5.
  • G. Arora, A. Rahimi, and T. Baldwin (2019) Does an LSTM forget more than a cnn? an empirical study of catastrophic forgetting in NLP. In Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association, ALTA 2019, Sydney, Australia, December 4-6, 2019, M. Mistica, M. Piccardi, and A. MacKinlay (Eds.), pp. 77–86. External Links: Link Cited by: §5.
  • A. V. M. Barone, B. Haddow, U. Germann, and R. Sennrich (2017) Regularization techniques for fine-tuning in neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, M. Palmer, R. Hwa, and S. Riedel (Eds.), pp. 1489–1494. External Links: Link, Document Cited by: §5.
  • L. Bentivogli, B. Magnini, I. Dagan, H. T. Dang, and D. Giampiccolo (2009) The fifth PASCAL recognizing textual entailment challenge. In Proceedings of the Second Text Analysis Conference, TAC 2009, Gaithersburg, Maryland, USA, November 16-17, 2009, External Links: Link Cited by: §4.1.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Józefowicz, and S. Bengio (2016) Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, Y. Goldberg and S. Riezler (Eds.), pp. 10–21. External Links: Link, Document Cited by: §3.2.
  • R. Caruana (1997) Multitask learning. Mach. Learn. 28 (1), pp. 41–75. External Links: Link, Document Cited by: §2.2.
  • D. M. Cer, M. T. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, Vancouver, Canada, August 3-4, 2017, S. Bethard, M. Carpuat, M. Apidianaki, S. M. Mohammad, D. M. Cer, and D. Jurgens (Eds.), pp. 1–14. External Links: Link, Document Cited by: §4.1.
  • A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. S. Torr (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XI, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Lecture Notes in Computer Science, Vol. 11215, pp. 556–572. External Links: Link, Document Cited by: §5.
  • A. Chronopoulou, C. Baziotis, and A. Potamianos (2019) An embarrassingly simple approach for transfer learning from pretrained language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 2089–2095. External Links: Link, Document Cited by: §5.
  • K. Csernai. (January 2017) First quora dataset release: Question pairs. Note: https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs Cited by: §4.1.
  • I. Dagan, O. Glickman, and B. Magnini (2005) The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pp. 177–190. Cited by: §4.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. External Links: Link, Document Cited by: §1, §2.1, §4.1, §4.1, Table 1, §5.
  • W. B. Dolan and C. Brockett (2005) Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing, IWP@IJCNLP 2005, Jeju Island, Korea, October 2005, 2005, External Links: Link Cited by: §4.1.
  • B. Felbo, A. Mislove, A. Søgaard, I. Rahwan, and S. Lehmann (2017) Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1615–1625. External Links: Link, Document Cited by: §5.
  • R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3 (4), pp. 128 – 135. External Links: ISSN 1364-6613, Document, Link Cited by: §5.
  • D. Giampiccolo, B. Magnini, I. Dagan, and B. Dolan (2007) The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp. 1–9. Cited by: §4.1.
  • I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio (2013)

    An empirical investigation of catastrophic forgetting in gradient-based neural networks

    .
    arXiv preprint arXiv:1312.6211. Cited by: §5.
  • G. E. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. CoRR abs/1503.02531. External Links: Link, 1503.02531 Cited by: §5.
  • Y. Hou, Z. Zhou, Y. Liu, N. Wang, W. Che, H. Liu, and T. Liu (2019) Few-shot sequence labeling with label dependency transfer. arXiv preprint arXiv:1906.08711. Cited by: §5.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Cited by: §1, §5.
  • F. Huszár (2017) On quadratic penalties in elastic weight consolidation. arXiv preprint arXiv:1712.03847. Cited by: §3.1.
  • H. Jung, J. Ju, M. Jung, and J. Kim (2016) Less-forgetting learning in deep neural networks. CoRR abs/1607.00122. External Links: Link, 1607.00122 Cited by: §5, §5.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §3.3.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1, §1, §3.1, §5, §5.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: A lite BERT for self-supervised learning of language representations

    .
    CoRR abs/1909.11942. External Links: Link, 1909.11942 Cited by: §1, §2.1, §4.1, §4.1, §4.1, §4.1, §4.2, Table 1, §5.
  • M. D. Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. G. Slabaugh, and T. Tuytelaars (2019) Continual learning: A comparative study on how to defy forgetting in classification tasks. CoRR abs/1909.08383. External Links: Link, 1909.08383 Cited by: §5.
  • C. Lee, K. Cho, and W. Kang (2019) Mixout: effective regularization to finetune large-scale pretrained language models. arXiv preprint arXiv:1909.11299. Cited by: §1, §5.
  • S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang (2017) Overcoming catastrophic forgetting by incremental moment matching. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 4652–4662. External Links: Link Cited by: §5, §5.
  • H. J. Levesque, E. Davis, and L. Morgenstern (2012) The winograd schema challenge. In Principles of Knowledge Representation and Reasoning: Proceedings of the Thirteenth International Conference, KR 2012, Rome, Italy, June 10-14, 2012, G. Brewka, T. Eiter, and S. A. McIlraith (Eds.), External Links: Link Cited by: §4.1.
  • X. Li, Y. Grandvalet, and F. Davoine (2018) Explicit inductive bias for transfer learning with convolutional networks. arXiv preprint arXiv:1802.01483. Cited by: §5, §5.
  • Z. Li and D. Hoiem (2018) Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 40 (12), pp. 2935–2947. External Links: Link, Document Cited by: §5, §5.
  • X. Liu, M. Masana, L. Herranz, J. van de Weijer, A. M. López, and A. D. Bagdanov (2018) Rotate your networks: better weight consolidation and less catastrophic forgetting. In

    24th International Conference on Pattern Recognition, ICPR 2018, Beijing, China, August 20-24, 2018

    ,
    pp. 2262–2268. External Links: Link, Document Cited by: §5, §5.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §4.1, §4.1, Table 1, §5.
  • D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 6467–6476. External Links: Link Cited by: §5.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §3.3.
  • D. J. C. MacKay (2003) Information theory, inference, and learning algorithms. Cambridge University Press. External Links: ISBN 978-0-521-64298-9 Cited by: §3.1.
  • A. Mallya and S. Lazebnik (2018) PackNet: adding multiple tasks to a single network by iterative pruning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 7765–7773. External Links: Link, Document Cited by: §5.
  • J. Martens (2014) New perspectives on the natural gradient method. CoRR abs/1412.1193. External Links: Link, 1412.1193 Cited by: §3.1.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1, §5.
  • L. Mou, Z. Meng, R. Yan, G. Li, Y. Xu, L. Zhang, and Z. Jin (2016) How transferable are neural networks in NLP applications?. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, J. Su, X. Carreras, and K. Duh (Eds.), pp. 479–489. External Links: Link, Document Cited by: §5.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §1, §2.1.
  • M. E. Peters, S. Ruder, and N. A. Smith (2019) To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th Workshop on Representation Learning for NLP, RepL4NLP@ACL 2019, Florence, Italy, August 2, 2019, I. Augenstein, S. Gella, S. Ruder, K. Kann, B. Can, J. Welbl, A. Conneau, X. Ren, and M. Rei (Eds.), pp. 7–14. External Links: Link, Document Cited by: §1.
  • J. Phang, T. Févry, and S. R. Bowman (2018) Sentence encoders on stilts: supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088. Cited by: §1, §4.2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, J. Su, X. Carreras, and K. Duh (Eds.), pp. 2383–2392. External Links: Link, Document Cited by: §4.1.
  • S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017)

    ICaRL: incremental classifier and representation learning

    .
    In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 5533–5542. External Links: Link, Document Cited by: §5.
  • J. Ribeiro, F. S. Melo, and J. Dias (2019)

    Multi-task learning and catastrophic forgetting in continual reinforcement learning

    .
    arXiv preprint arXiv:1909.10008. Cited by: §1.
  • A. Rosenfeld and J. K. Tsotsos (2020) Incremental learning through deep adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 42 (3), pp. 651–663. External Links: Link, Document Cited by: §5.
  • S. Ruder (2017) An overview of multi-task learning in deep neural networks. CoRR abs/1706.05098. External Links: Link, 1706.05098 Cited by: §2.2.
  • S. Ruder (2019) Neural transfer learning for natural language processing. Ph.D. Thesis, NUI Galway. Cited by: §1.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. CoRR abs/1606.04671. External Links: Link, 1606.04671 Cited by: §5.
  • J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell (2018) Progress & compress: A scalable framework for continual learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 4535–4544. External Links: Link Cited by: §5.
  • J. Serrà, D. Suris, M. Miron, and A. Karatzoglou (2018) Overcoming catastrophic forgetting with hard attention to the task. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 4555–4564. External Links: Link Cited by: §5.
  • H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 2990–2999. External Links: Link Cited by: §5.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1631–1642. External Links: Link Cited by: §4.1.
  • C. Sun, X. Qiu, Y. Xu, and X. Huang (2019) How to fine-tune bert for text classification?. In China National Conference on Chinese Computational Linguistics, pp. 194–206. Cited by: §5.
  • I. Szpektor. (2006) The second pascal recognising textual entailment challenge.. In Proceedings of the second PASCAL challenges workshop on recognising textual entailment, pp. 6–4. Cited by: §4.1.
  • B. Thompson, J. Gwinnup, H. Khayrallah, K. Duh, and P. Koehn (2019) Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2062–2068. Cited by: §1, §5.
  • A. R. Triki, R. Aljundi, M. B. Blaschko, and T. Tuytelaars (2017) Encoder based lifelong learning. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 1329–1337. External Links: Link, Document Cited by: §5, §5.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §4.1.
  • A. Warstadt, A. Singh, and S. R. Bowman (2019) Neural network acceptability judgments. TACL 7, pp. 625–641. External Links: Link Cited by: §4.1.
  • A. Williams, N. Nangia, and S. R. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), M. A. Walker, H. Ji, and A. Stent (Eds.), pp. 1112–1122. External Links: Link, Document Cited by: §4.1.
  • J. Xu and Z. Zhu (2018) Reinforced continual learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 907–916. External Links: Link Cited by: §5.
  • Y. Xu, X. Zhong, A. J. J. Yepes, and J. H. Lau (2019) Forget me not: reducing catastrophic forgetting for domain adaptation in reading comprehension. arXiv preprint arXiv:1911.00202. Cited by: §1, §5.
  • J. Xue, J. Han, T. Zheng, X. Gao, and J. Guo (2019) A multi-task learning framework for overcoming the catastrophic forgetting in automatic speech recognition. arXiv preprint arXiv:1904.08039. Cited by: §1.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 5754–5764. External Links: Link Cited by: §4.1, §4.1, Table 1.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 3987–3995. External Links: Link Cited by: §5, §5.
  • J. O. Zhang, A. Sax, A. R. Zamir, L. J. Guibas, and J. Malik (2019a) Side-tuning: network adaptation via additive side networks. CoRR abs/1912.13503. External Links: Link, 1912.13503 Cited by: §5.
  • J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. P. Heck, H. Zhang, and C.-C. J. Kuo (2019b) Class-incremental learning via deep model consolidation. CoRR abs/1903.07864. External Links: Link, 1903.07864 Cited by: §5, §5.