Generative Feature Replay with Orthogonal Weight Modification for Continual Learning

05/07/2020 ∙ by Gehui Shen, et al. ∙ Peking University 0

The ability of intelligent agents to learn and remember multiple tasks sequentially is crucial to achieving artificial general intelligence. Many continual learning (CL) methods have been proposed to overcome catastrophic forgetting. Catastrophic forgetting notoriously impedes the sequential learning of neural networks as the data of previous tasks are unavailable. In this paper we focus on class incremental learning, a challenging CL scenario, in which classes of each task are disjoint and task identity is unknown during test. For this scenario, generative replay is an effective strategy which generates and replays pseudo data for previous tasks to alleviate catastrophic forgetting. However, it is not trivial to learn a generative model continually for relatively complex data. Based on recently proposed orthogonal weight modification (OWM) algorithm which can keep previously learned input-output mappings invariant approximately when learning new tasks, we propose to directly generate and replay feature. Empirical results on image and text datasets show our method can improve OWM consistently by a significant margin while conventional generative replay always results in a negative effect. Our method also beats a state-of-the-art generative replay method and is competitive with a strong baseline based on real data storage.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has achieved remarkable levels of performance for AI, exceeding the abilities of human experts on several particular tasks. However, neural networks (NN) are prone to suffer from catastrophic forgetting [McCloskey and Cohen1989, French1999] when learning multiple tasks in a sequential manner. This phenomenon results from the interference between the knowledge of previous tasks and current task because the data of previous tasks are unavailable and leads to significant degradation of previous tasks’ performance. In contrast, humans excel at learning new skills and accumulating knowledge continually throughout their lifespan. Continual learning (CL) [Parisi et al.2019] aims to bridge this gap and has became an important challenge of AI research. It allows intelligent agents to reuse and transfer old knowledge meanwhile meets the real-world situation where training data are hardly available simultaneously.

According to whether task identity is provided and whether it must be inferred during test, there are mainly three continual learning scenarios [van de Ven and Tolias2018, Hsu et al.2018] i.e. task incremental learning, domain incremental learning and class incremental learning. Class incremental learning (CIL) is the most challenging scenario in which the classes of each task are disjoint and the model is trained to distinguish classes of all tasks with a shared output layer a.k.a. “single-head”. In this paper we focus on CIL.

CIL corresponds to the problem of learning new classes of objects incrementally which is widespread in real-world applications. As the data of old classes are unavailable and the class distribution is changing continually, the shared output layer can exacerbate the forgetting of previous tasks. Some CL methods [Kirkpatrick et al.2017, Li and Hoiem2018] performing well in the other two scenarios almost totally get failed in CIL [van de Ven and Tolias2018, Hsu et al.2018]

. A naive approach to alleviate this problem is to store a subset of real data of previous tasks and replay them to classifier when learning new tasks  

[Rebuffi et al.2017, Nguyen et al.2018]. However it violates the main protocol in CL that only data of current task are available. Data privacy concerns also question the practical value of real data replay methods. In this paper, we aim to design CIL method without any real data storage.

Inspired by complementary learning systems (CLS) theory [O’Reilly and Norman2002] about biological mechanisms, generative replay [Shin et al.2017] approach has made some progress in CIL. Instead of storing the real data, a generative model, such as generative adversarial networks (GANs) [Goodfellow et al.2014b]

or variational autoencoder (VAE) 

[Kingma and Welling2014] is trained to learn the data distribution of previous tasks. During training, both the real data of the current task and the synthesized data sampled from the generator are fed into classifier. As the synthesized data represent the distribution of old classes approximately, classifier can retain the knowledge of previous tasks. While learning the new knowledge, the generator are trained in the same manner to prevent catastrophic forgetting. Despite generative replay works well on simple datasets such as MNIST, it is far from solving CIL completely. As pointed out in [Lesort et al.2019], training a generative model in the CL scenario is not as easy as training a generative model in a joint training manner. For example, generative replay totally fails when being applied on CIFAR10, a real-world image dataset [Lesort et al.2019]. pgma pgma also find it is impractical to replay text data with a generative model.

Recently, orthogonal weight modification (OWM) [Zeng et al.2019] algorithm has been proposed which is applicable in the above three CL scenarios and can be considered as the state-of-the-art method for CIL. The main idea behind OWM algorithm is that to protect previously learned knowledge, we modify the neural network’s weights only in the direction orthogonal to the subspace spanned by all previous inputs fed into the network. In this way, during training new inputs, the network can keep the learned input-output mappings invariable.  owm owm provide an online approximate iterative method to update the direction orthogonal to the input space so that OWM is compatible with mini-batch optimization. As OWM algorithm only introduces an extra modification operation on gradient of weights during training thus no task information is required during test, it can be used in CIL.

In this paper, on the basis of OWM algorithm, we propose to generate and replay feature instead of raw data to improve the performance of class incremental learning. Specifically, we utilize OWM algorithm on classifier meanwhile we train a generative model to learn the distribution of the feature of penultimate layer. When training the subsequent tasks, generated pseudo features are paired with the new features in the same layer and fed into the last fully connected (FC) layer. Due to the effect of OWM, the features in each layer are stable which makes the feature replay feasible potentially.

Our motivation is three-fold: Firstly, although OWM can keep the ability of distinguishing the classes within one task, data from classes that belong to different tasks are never fed into classifier simultaneously. Therefore classifier is prone to confuse about task identity when classifying over all classes. Replaying data of previous tasks can alleviate this problem. Secondly, as raw data contain many details not related to class information, learning to continually generate real-world data is hard. However, the distribution of high-level features are relatively simple, which lessens the difficulty of training the generator. Thirdly, almost all NN-based classifiers’ output layers are full-connected so that generating the feature of penultimate layer is a universal strategy. Experimental results on several real-world image and text datasets show the superiority of proposed method to state-of-the-art CIL methods, including OWM.

2 Related Work

The study of catastrophic forgetting in neural networks originated in the 1980s [McCloskey and Cohen1989, Robins1993]. Under the revival of neural networks, overcoming catastrophic forgetting in continual learning setting has drawn much attention again [Goodfellow et al.2014a, Rusu et al.2016, Kirkpatrick et al.2017]. In this section, we mainly review recent continual learning literature which is closely related to our work. Contemporary continual learning strategies can be divided roughly into four categories which are regularization, task-specific, replay and subspace respectively. It should be noted that some existing CL works propose hybrid approaches incorporating more than one strategy.

The most famous regularization method is elastic weight consolidation (EWC) [Kirkpatrick et al.2017]

which adds a weighted L2 penalty term on NN’s parameters. The weight of L2 term is defined as Fisher’s information which can measure the importance of each parameter to previously learned knowledge. lwf lwf propose another type of regularization method which encourages the current classifier’s output probabilities on old classes to approximate the outputs of the old classifier. Such regularization methods are effective on task/domain incremental learning scenarios, however when being applied to CIL scenario 

[van de Ven and Tolias2018, Hsu et al.2018], they almost totally fail.

Task-specific methods aim at prevent knowledge interference by establishing task-specific modules for different tasks. The task-specific modules can be hidden units [Masse et al.2018], network parameters [Mallya and Lazebnik2018] and dynamically growing sub-networks [Rusu et al.2016]. This type of strategy is designed for task incremental learning. During test these methods need task identity to choose corresponding task-specific modules therefore they are not applicable for class incremental learning.

Replay (also called rehearsal) strategy is initially proposed to relearn a subset of previously learned data when learning the current task [Robins1993]. Some recent works [Rebuffi et al.2017, Wu et al.2019] storing a subset of old data fall into this category which we call real data replay. Real data replay not only violates the continual learning requirement that old data are unavailable, but also is against the notion of bio-inspired design. According to CLS theory [O’Reilly and Norman2002] the hippocampus encodes and replays recent experiences to help the memory in the neocortex consolidate. Some evidence illustrates hippocampus works like a generative model than a replay buffer, which has inspired the proposal of deep generative replay (DGR) [Shin et al.2017]. Generative replay utilizes GAN framework to train a generator to learn the distribution of old data. When learning new tasks the pseudo data generated by the generator are replayed to classifier. Due to the power of approximating distribution of GAN, replayed data reduce the shift of data distribution, especially in CIL scenario, thus can alleviate catastrophic forgetting. As the generator is also trained continually with replayed data, DGR may break down when it encounters complex real-world data [Hu et al.2019, Lesort et al.2019]. A remedy is to encode the raw data into features with a feature extractor pre-trained on large-scale datasets and replay features [Hu et al.2019, Xiang et al.2019]. However, such a pre-trained model is not often easily obtained. In addition, learning from scratch for moderately large data can more accurately reflect the CL method’s performance. In contrast, we firstly propose a successful method to replay features without pre-training.

The last strategy we call subspace methods [He and Jaeger2018, Zeng et al.2019] retains previously learned knowledge by keeping the old input-output mappings that NNs induce fixed. To meet this goal, the gradients are projected to the subspace that is orthogonal to the inputs of past tasks.  cab cab and owm owm respectively propose “conceptor-aided backprop” (CAB) and OWM algorithm which resort to different mathematical theories to compute the orthogonal subspace approximately. This strategy makes the features of past tasks stable when learning new tasks, which allows us to conduct generative feature replay.

(a) Training Generator on the -th task.
(b) Training Classifier on the (+1)-th task.
Figure 1: Model overview of proposed method. In each subfigure, the modules in red dashed box are trained and the others are fixed.

3 Methodology

In this section, we describe our continual learning framework which is a hybrid approach incorporating subspace method and generative replay strategy. As illustrated in Figure 1, our framework is composed of a classifier and a GAN-based generative model where and are generator and discriminator respectively [Goodfellow et al.2014b]. We divide into two parts: the last FC layer and all previous layers which is treated as feature extractor. Our framework is similar with DGR [Shin et al.2017], the initial generative replay method. However, there are two main differences between them: 1) We optimize using projected gradients which are computed with OWM algorithm [Zeng et al.2019]. 2) In this way, we propose to train a GAN to generate and replay the penultimate layer feature , the output of , to alleviate the catastrophic forgetting in the last FC layer.

3.1 Training Classifier with OWM algorithm

Conventionally, for an FC layer the weight matrix is updated by gradient descent algorithm and learning the -th task leads to the change where is learning rate and represents the gradient computed by back propagation (BP) during training . When testing on the previous tasks , the FC layer’s output is deviated from the optimum value after learning , i.e. . The deviation accumulates across layers and causes catastrophic forgetting on previous tasks.

To overcome this, owm owm have developed OWM algorithm to project to the subspace orthogonal to the input of all previous tasks: , where matrix

consists of input vectors which has been trained before as its columns

111Without loss of generality, here we treat as a vector for brevity, which means each task has only one input data. and is a small constant to resist in noise. The gradient is modified with the projector : . Because for any input in the input space we have , OWM can keep the output invariable approximately after gradient descent update:

Such a property makes the network capable of maintaining previously learned input-output mappings meanwhile the new tasks can be learned. The extra operation is computing projector and projected gradient before running gradient descent. We can calculate in an efficient online manner as described in [Zeng et al.2019]:

where indexes the mini-batch and is the mean of the

-th mini-batch’s inputs. Although we use FC layer to explain here, OWM algorithm can be applied to other NNs such as convolution neural networks (CNN). See the original paper 

[Zeng et al.2019] for more details about OWM.

3.2 Generative Feature Replay

Although OWM is state-of-the-art method in CIL scenarios, projected gradients can mainly protect the knowledge of distinguishing the classes in the same task. The classifier potentially has confusion about the task identity during test as the data from different tasks are never trained together, which is the major difficulty in CIL.

To alleviate this problem, we propose Generative Feature Replay (GFR), which replay the penultimate layer feature instead of the raw data to improve OWM. We use an Auxiliary Conditional GAN (AC-GAN) [Odena et al.2017] model as generator, which allows conditional generation with labels and has been proven to work better than vanilla GAN for continual learning [Wu et al.2018]. We call the existing generative replay as Generative Input Replay (GIR).

In our framework, the whole model is comprised of a classifier , a generator and a discriminator . The discriminator gives two outputs: means the probability that the input is real, like vanilla GANs and predicts the input’s class. The generator takes a noise and a class label as inputs. We further divide into two parts: the last FC layer and feature extractor . As is trained with OWM, given the inputs of previous tasks, each layer’s outputs, including remain stable when learning new tasks continually. Thus we can employ the AC-GAN to generate for replay, which can reduce the difficulty of training the generator compare with GIR.

We implement AC-GAN with WGAN-GP technique [Gulrajani et al.2017]

for more stable training. The loss functions which

and are trained to minimize when training the -th task are as follows:

The is from to discriminate real and fake features:

where is the real data distribution of -th task. is the noise distribution. and represent label distribution for the first tasks and the -th task respectively. The first term refers to real features of data in while the third term corresponds to fake features generated by the current generator . To allow to be able to generate features of previous tasks, the fake features generated by the old generator are also considered as real features which corresponds to the second term. Similarly, the also has three parts:

where means the cross-entropy loss. Here discriminator and generator are both trained to minimize the classification loss for all real and fake features. is gradient penalty term in WGAN-GP [Gulrajani et al.2017] and we set in all experiments. It should be noted that when training , and are fixed.

Generator is used to generate replayed features of the first tasks when training classifier in . The loss function of classifier is as follows:

where the two terms correspond to the loss of true data and replayed features respectively. In the second term, we use the probabilities predicted by the old classifier as soft labels to train replayed features, which allows us to utilize distillation loss [Hinton et al.2014] for better performance.

In fact, our feature replay strategy only has effects on the last FC layer which is ubiquitous in NN-based classifiers. Thus for different types of input data and NN models, GFR can be applied universally without any special modifications. However, GIR needs to design different types of generator for different data. For example, a CNN-based GAN cannot be used to generate text data. Some recent works [Wu et al.2019, Hou et al.2019] also improve CIL performance with some special treatments on the last FC layer. However they depend on real data replay and are perpendicular to our work.

Type #Train/#Test #Class #Task
SVHN Image 73257/26032 10 5
CIFAR10 Image 50000/10000 10 5
CIFAR100 Image 50000/10000 100 2/5/10/20
THUCNews Text 50000/15000 10 5
DBPedia Text 560000/70000 14 5
Table 1: Details about five datasets.

(5 tasks)
(5 tasks)
(2 tasks)
(5 tasks)
(10 tasks)
(20 tasks)
(5 tasks)
(5 tasks)

EWC [Kirkpatrick et al.2017]
12.250.13 18.530.11 24.310.63 12.530.69 7.560.25 4.150.11 19.880.03 15.491.42
DGR [Shin et al.2017] 67.500.83 22.390.83 36.480.61 25.520.46 15.230.62 9.190.43 38.675.57 N/A
PGMA [Hu et al.2019] 40.47 52.93 69.68
OWM [Zeng et al.2019] 73.150.71 54.180.66 42.280.35 34.160.43 30.540.91 27.640.42 79.650.64 91.450.62
OWM+GIR 73.650.94 52.581.06 44.130.37 32.700.50 28.670.42 26.300.60 72.153.19 N/A
OWM+GFR (Proposed) 75.550.54 55.850.68 42.580.50 35.350.15 32.300.61 27.840.63 81.220.25 92.570.28
iCaRL (B=200) [Rebuffi et al.2017] 45.961.72 46.461.46 21.300.83 16.620.70 11.540.51 6.570.79 80.740.59 91.871.17
iCaRL (B=2000) [Rebuffi et al.2017] 67.910.84 57.660.86 36.091.34 30.070.98 19.531.02 10.370.65 87.261.12 96.030.38
Joint Training (Upper Bound) 92.34 76.87 44.77 96.54 98.77
Table 2:

Test accuracy after all tasks are learned. We report the mean and standard error over 5 runs with different seeds.

indicates the results taken from the original paper. indicates according to a two-sided -test between the results of OWM and OWM+GFR.
(a) SVHN
(b) CIFAR10
(c) THUCNews
(d) DBPedia
(e) CIFAR100 (2 tasks)
(f) CIFAR100 (5 tasks)
(g) CIFAR100 (10 tasks)
(h) CIFAR100 (20 tasks)
Figure 2: Test accuracies on all classes of already learned tasks after each task is learned in all 8 settings.

4 Experiments

To evaluate our method, we conduct experiment in CIL settings, where each task corresponds to a disjoint subset of classes of the whole dataset and the classifier only has one shared output layer. Our method is called OWM+GFR.

4.1 Datasets, Baselines and Model Settings

We use image datasets and two text datasets which are in detail in Table 1. To make a fair comparison, we randomly select a subset of test data as validation data and the left data are considered as test data following pgma pgma. For THUCNews/DBPedia, the size of validation dataset is 5000/10000 and for other datasets the size is 30% of the original test dataset. For CIL scenario, we split all datasets into 5 tasks and the number of class of each task is equal except DBPedia, where the 5 tasks have 3, 3, 3, 3, and 2 classes respectively. We also establish the settings of 2/10/20 tasks on CIFAR100 to further evaluate the performance of our method under different numbers of tasks.

The three image datasets we use are collected from real world and challenging for training generative model in continual learning scenario. We also choose text datasets as generating text is even harder than image as we know. We think these datasets can reflect the effectiveness of proposed method compared to existing methods.

We utilize the following baselines for comparison: 1) EWC [Kirkpatrick et al.2017], a representative regularization method; 2) DGR [Shin et al.2017]: the classical GIR framework. For fair comparison, we exploit AC-GAN as generator, which is proposed in mergan mergan for image datasets. For text datasets, the generator is SeqGAN [Yu et al.2017] which is designed for generating text. 3) Parameter Generation and Model Adaptation (PGMA) [Hu et al.2019]: a state-of-the-art CIL method which integrates parameter generation strategy and GIR; 4) OWM [Zeng et al.2019], the state-of-the-art subspace method which is the basis of our method. 5) OWM+GIR: The OWM baseline incorporated with GIR in which the generator is the same as in DGR baseline. We also compare with iCaRL [Rebuffi et al.2017], a strong baseline with real data storage, to show the superiority of our method. We evaluate iCaRL with 2 sizes of storage budget: B=200 and B=2000.

We reimplement all baselines except PGMA. To make a fair comparison as much as possible, for images dataset we use classifier with the same architecture as in [Zeng et al.2019] which has 3-layer CNN with 64, 128, 256 22 filters and 3-layer MLP with 1000 hidden units. We double the number of filters on CIFAR100. For text datasets, we use one 1D CNN layer with 1024/200 filters for THUCNews/DBPedia respectively and the same MLP as in image datasets222pgma pgma use pre-trained feature extractors whose architectures are unclear. The feature classifier of PGMA is the same MLP as we use. As their feature extractors are pre-trained on large external dataset, we think our classifiers are even weaker than theirs.. We also use the same fixed pre-trained word embedding as in [Hu et al.2019].

In our method, the generator is to generate the 1000d and , are both 3-layer MLP for all datasets. In GIR, the , for image data have 3 deconvolution and convolution layer respectively; for text data, is a 1-layer LSTM and is a 1-layer 1D CNN. We make the number of parameters in and comparable for GIR and GFR.

4.2 Main Results

We display the final test accuracies over the whole test datasets after all tasks are learned in Table 2. The test accuracies on all learned tasks after each task is learned are also plotted in Figure 2. It should be pointed out that as the vocabulary size of DBPedia is too large to fit SeqGAN in a 1080Ti GPU with 11G memory, we cannot obtain the results of DGR and OWM+GIR on this dataset.

Table 2 shows that OWM performs much better than other existing methods, i.e. EWC, DGR and PGMA, in all settings. For relatively simple SVHN in which the images are digit numbers, DGR is almost comparable with OWM however it works much worse on CIFAR10/100 in which the images are more complex objects and THUCNews. From Figure 2 we can find after learning the second task DGR works well and on SVHN and THUCNews even better than OWM, which attributes to when training on the generator has not been trained in CL manner therefore there is no catastrophic forgetting in generator. As the number of tasks increases, DGR degrades dramatically. This phenomenon verifies that training generator continually on complex data can hardly succeed. It is worthy noticing that on CIFAR100 OWM even can improve the performance after learning .

We find proposed method OWM+GFR performs best in 7 of 8 settings among all methods without real data storage. Moreover, in 6 of 8 settings our method improves OWM baseline by a significant margin. In contrast, combining GIR with OWM always has negative effects on OWM except in SVHN and 2 tasks CIFAR100 settings, where DGR is effective. Generally, from or , our method outperforms OWM+GIR consistently because the sequential training of generator for input data becomes harder and harder. We think those results demonstrate GFR can indeed improve OWM and our method makes substantial contribution to CIL.

In addition, our method is even comparable to the real data replay baseline iCaRL which works very well on simple datasets THUCNews and DBPedia. However iCaRL is ineffective on CIFAR100 where large number of classes restricts the ability of iCaRL’s nearest mean-of-exemplars classifier.

4.3 Error Analysis

In this subsection, we delve into how GFR can improve OWM’s performance. An important deficiency of OWM is that data from different tasks are never trained together so that inferring task identity is potentially hard for classifier. We expect GFR can alleviate this problem. To verify this conjecture, we first divide the final classification error into two types: Inter-Task Error and Inner-Task Error. The former corresponds to task identity inference error while the latter refers to the classification error when task identity is inferred correctly. Formally they are defined as follows:

where is test dataset and is the indicator function. and are the true label and predicted label of -th test sample respectively and and are respectively the task which and belong to. We run OWM and OWM+GFR with the same random seed and calculate the above two types of error in all 8 settings. The results are displayed in Table 3.

Method Inter-Task Error(%) Inner-Task Error(%)

OWM 22.03 5.22
OWM+GFR 20.22(-1.81) 4.01(-1.21)
CIFAR10 OWM 41.94 4.00
OWM+GFR 39.50(-2.44) 4.73(+0.73)

CIFAR100 (2 tasks)
OWM 27.41 29.80
OWM+GFR 29.11(+1.70) 27.81(-1.99)

CIFAR100 (5 tasks)
OWM 52.66 13.59
OWM+GFR 51.23(-1.43) 13.49(-0.10)

CIFAR100 (10 tasks)
OWM 62.67 6.91
OWM+GFR 60.36(-2.31) 7.53(+0.62)
CIFAR100 (20 tasks) OWM 69.54 3.21
OWM+GFR 69.66(+0.12) 2.91(-0.30)
THUCNews OWM 15.83 4.25
OWM+GFR 14.20(-1.63) 4.38(+0.13)
DBPedia OWM 5.79 2.78
OWM+GFR 4.48(-1.31) 2.64(-0.14)

Table 3: Comparison of two types of error between OWM and OWM+GFR in different settings.

Table 3 shows except in the 2 tasks CIFAR100 setting, Inter-Task Error dominates the classification error of OWM. Furthermore, the improvement of OWM+GFR over OWM is dominated by that on Inter-Task Error in 6 of 8 settings except in 2 tasks CIFAR100 and 20 tasks CIFAR100 settings. It should be noticed that Table 2 shows in the same 6 settings, OWM+GFR outperforms OWM significantly. We think these results can verify the conjecture that GFR can improve OWM by reducing Inter-Task Error.

In 2 tasks CIFAR100 setting, each task has 50 classes and Inner-Task Error has a similar level with Inter-Task Error. When applying GFR, the classifier tends to reduce Inner-Task Error which makes our method perform similar with OWM. In 20 tasks CIFAR100 setting, too many tasks impede the training of generator thus GFR can hardly improve OWM. As displayed in Figure (h)h, after training 10 tasks, the superiority of OWM+GFR to OWM almost vanishes. However, GFR is much more powerful than GIR which is applicable for the scenarios with 2-3 tasks. We will explore how to make GFR work well in the scenarios with more tasks in future work.

(a) SVHN
(b) CIFAR10
Figure 3: The visualization of real features () and fake features () using t-SNE on SVHN and CIFAR10 datasets.

4.4 Visualization of Features

We visualize the generated penultimate layer features as well as the real features to further explain why GFR is effective. For real features, we randomly select 100 samples in test dataset for each class and encode them using the final feature extractor . We also randomly sample 100 samples for each class from the conditional generator . We project all features into a joint 2D space using t-SNE. The visualization results are in Figure 3. We only plot the features on CIFAR10 and SVHN datasets due to space limit.

Although the real features from test datasets are never seen during training by generator, we can observe a large part of generated features are clustered near some real feature clusters from the same class. Therefore, the generated features can provide useful information for classifier to adjust the decision surfaces for all classes simultaneously. We also observe some generated feature clusters, such as in grey and red, are far from corresponding real feature clusters. We find these classes are from the first or second task thus this phenomenon should attribute to the catastrophic forgetting of generator.

5 Conclusion

In this paper, we focus on class incremental learning, a challenging continual leaning scenario. On the basis of OWM algorithm, we propose generate and replay the penultimate layer feature instead of input data to alleviate catastrophic forgetting. We think this is the first successful attempt to replay the features of a neural network trained continually. Our method achieves state-of-the-art performances on several image and text datasets without real data storage.


  • [French1999] Robert French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 1999.
  • [Goodfellow et al.2014a] Ian J. Goodfellow, Mehdi Mirza, Da Xiao, Aaron C. Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgeting in gradient-based neural networks. In Proceedings of ICLR, 2014.
  • [Goodfellow et al.2014b] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of NIPS, 2014.
  • [Gulrajani et al.2017] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In Proceedings of NIPS, 2017.
  • [He and Jaeger2018] Xu He and Herbert Jaeger.

    Overcoming catastrophic interference using conceptor-aided backpropagation.

    In Proceedings of ICLR, 2018.
  • [Hinton et al.2014] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2014.
  • [Hou et al.2019] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified classifier incrementally via rebalancing. In Proceedings of CVPR, 2019.
  • [Hsu et al.2018] Yen-Chang Hsu, Yen-Cheng Liu, and Zsolt Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines. In NeurIPS Continual Learning workshop, 2018.
  • [Hu et al.2019] Wenpeng Hu, Zhou Lin, Bing Liu, Chongyang Tao, Zhengwei Tao, Jinwen Ma, Dongyan Zhao, and Rui Yan. Overcoming catastrophic forgetting for continual learning via model adaptation. In Proceedings of ICLR, 2019.
  • [Kingma and Welling2014] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of ICLR, 2014.
  • [Kirkpatrick et al.2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, et al. Overcoming catastrophic forgetting in neural networks. PNAS, 2017.
  • [Lesort et al.2019] Timothée Lesort, Hugo Caselles-Dupré, Michaël Garcia Ortiz, Andrei Stoian, and David Filliat. Generative models from the perspective of continual learning. In Proceedings of IJCNN, 2019.
  • [Li and Hoiem2018] Zhizhong Li and Derek Hoiem. Learning without forgetting. TPAMI, 2018.
  • [Mallya and Lazebnik2018] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of CVPR, 2018.
  • [Masse et al.2018] Nicolas Y Masse, Gregory D Grant, and David J Freedman. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. PNAS, 2018.
  • [McCloskey and Cohen1989] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation. Elsevier, 1989.
  • [Nguyen et al.2018] Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, and Richard E. Turner. Variational continual learning. In Proceedings of ICLR, 2018.
  • [Odena et al.2017] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In Proceedings of ICML, 2017.
  • [O’Reilly and Norman2002] Randall C O’Reilly and Kenneth A Norman. Hippocampal and neocortical contributions to memory: Advances in the complementary learning systems framework. Trends in cognitive sciences, 2002.
  • [Parisi et al.2019] German Ignacio Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.
  • [Rebuffi et al.2017] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of CVPR, 2017.
  • [Robins1993] Anthony V. Robins. Catastrophic forgetting in neural networks: the role of rehearsal mechanisms. In First New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, 1993.
  • [Rusu et al.2016] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, et al. Progressive neural networks. CoRR, abs/1606.04671, 2016.
  • [Shin et al.2017] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In Proceedings of NIPS, 2017.
  • [van de Ven and Tolias2018] Gido M. van de Ven and Andreas S. Tolias. Three scenarios for continual learning. In NeurIPS Continual Learning workshop, 2018.
  • [Wu et al.2018] Chenshen Wu, Luis Herranz, Xialei Liu, Yaxing Wang, Joost van de Weijer, and Bogdan Raducanu. Memory replay gans: Learning to generate new categories without forgetting. In Proceedings of NeurIPS, 2018.
  • [Wu et al.2019] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In Proceedings of CVPR, 2019.
  • [Xiang et al.2019] Ye Xiang, Ying Fu, Pan Ji, and Hua Huang.

    Incremental learning using conditional adversarial networks.

    In Proceedings of ICCV, 2019.
  • [Yu et al.2017] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In Proceedings of AAAI, 2017.
  • [Zeng et al.2019] Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 2019.