Self-Supervised Learning Aided Class-Incremental Lifelong Learning

06/10/2020 ∙ by Song Zhang, et al. ∙ Peking University 0

Lifelong or continual learning remains to be a challenge for artificial neural network, as it is required to be both stable for preservation of old knowledge and plastic for acquisition of new knowledge. It is common to see previous experience get overwritten, which leads to the well-known issue of catastrophic forgetting, especially in the scenario of class-incremental learning (Class-IL). Recently, many lifelong learning methods have been proposed to avoid catastrophic forgetting. However, models which learn without replay of the input data, would encounter another problem which has been ignored, and we refer to it as prior information loss (PIL). In training procedure of Class-IL, as the model has no knowledge about following tasks, it would only extract features necessary for tasks learned so far, whose information is insufficient for joint classification. In this paper, our empirical results on several image datasets show that PIL limits the performance of current state-of-the-art method for Class-IL, the orthogonal weights modification (OWM) algorithm. Furthermore, we propose to combine self-supervised learning, which can provide effective representations without requiring labels, with Class-IL to partly get around this problem. Experiments show superiority of proposed method to OWM, as well as other strong baselines.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep neural networks have shown remarkable performance in a wide variety of individual tasks, and even surpass human experts in certain fields. However, humans and animals are better at continually acquiring, fine-tuning and transferring knowledge and skills throughout their lifetime, which benefits from a good balance between synaptic plasticity and stability (Abraham2005Memory, )

. In a lifelong learning scenario, an intelligent system requires sufficient plasticity to integrate novel information and stability to prevent significantly interfering with consolidated knowledge. For machine learning and neural network models, lifelong learning represents a long-standing challenge. Since continual acquisition of information from non-stationary data distributions generally leads to catastrophic forgetting

(Mccloskey1989Catastrophic, ; Robins1993Catastrophic, ; French1993Catastrophic, ), in which new knowledge overwrites old knowledge, leading to a quick, pronounced drop of the performance on previous tasks. Catastrophic forgetting has been a key obstacle for deep neural networks to learn sequentially.

In the setting of lifelong learning, only data from the current task is available during training and the tasks are assumed to be clearly separated. In general, there are three main types of lifelong learning scenarios (van2019Three, ; Hsu2018Re, ), based on whether task identity is provided at test time and, if not, whether task identity must be inferred. They are task-incremental learning (Task-IL), domain-incremental learning (Domain-IL) and class-incremental learning (Class-IL). In this paper, we focus on Class-IL since it is the most challenging one. Class-IL includes the common problem of incrementally learning new classes, in which the classes of each task are disjoint and the model must be able to both solve each task seen so far and infer which task they are presented with, in other words, distinguish all the classes.

Recently, there have been multiple attempts to mitigate catastrophic forgetting (coop2013ensemble, ; Goodrich2014Unsupervised, ; gepperth2016bio, ; Fernando2017PathNet, ; Lee2017Lifelong, ; Lopez2017Gradient, ). And many state-of-the-art methods (Li2016Learning, ; Kirkpatrick2017Overcoming, ; Zenke2017Continual, ) in other two scenarios are not capable to handle Class-IL (van2019Three, ; Hsu2018Re, ). Basically, there are there types of strategies which work in Class-IL scenario: exemplar replay (Rebuffi2016iCaRL, ; Nguyen2017Variational, ), generative replay (shin2017continual, ; Kamra2017Deep, ) and gradient projection (he2018overcoming, ; zeng2019continual, ). exemplar replay is a simple strategy to suppress catastrophic forgetting by selecting and storing a subset of input data under certain constraint. However, such methods violate the purpose of lifelong learning since data of previous tasks should be unavailable. An alternative is to approximate the data distribution of previous tasks with a separate generative model for replay. The performance of generative replay relies on the complexity of training data, as the generator is also trained in a incremental manner, which is much difficult than that in joint manner (Timothee2019Generative, ). The last kind of method retains previously learned knowledge by keeping the old input-output mappings stable. Orthogonal Weights Modification (OWM) (zeng2019continual, ) is a typical method and can be seen as the state-of-the-art method for Class-IL scenario.

Though OWM showed remarkable performance on different datasets, it is confronted with another problem in addition to catastrophic forgetting, compared with the other two strategies. Given images of one category, humans could observe and extract multiple meaningful features based on their common sense, which the lifelong learning models are not equipped with. Their optimization is motivated by the objective function of classification, which would not demand for features unnecessary for current task, even though they might be required in following tasks. We refer to this phenomenon as prior information loss (PIL), as the model has no idea of the features necessary for joint classification as prior knowledge. For example, model can easily classify dog and bird by learning to count legs, and when cat appears in next task, more attributes would be required (Figure

1). However without replay of previous inputs, it is unable to mine extra information of previous classes as they are not available anymore. As it goes on, more features would be selected, which are only extracted for current and following classes and the missing parts affect the joint classification accuracy.

Figure 1: Relying on features learned in previous tasks, e.g., number of legs (red box), model might confuse previous classes(e.g., dog) with current ones(e.g., cat). Proxy task of prediction rotation requires modeling distinctive shapes, e.g., head (blue box) on each task, which helps classification.

To solve the problem adequately, a straight way is combining OWM with generative replay. However, due to its own difficulty, generative replay has negative effect on OWM (unknown, ). Without opportunity to train on current data later, we need to extract features in an unsupervised way as more as possible for backup. In this paper, we propose to exploit self-supervised learning (SSL) as an alternative substitution, which can provide effective representations without labels, to partly solve the problem. Specifically, we train the model to classify and predict self-supervised labels in the manner of multi-task learning. As the same representations are shared for predicting original labels and self-supervised signals, the knowledge acquired is enriched by that learned via SSL. In following tasks, we have more informative features of previous classes for distinguishment, while the improvement of accuracy hinges on the relevance between the originally missing features and selected self-supervised signals.

Our contributions are two-fold: Firstly, we first propose the problem of prior information loss (PIL), another obstacle besides the catastrophic forgetting, which is applicable to models without input replay, among which we regard OWM as backbone. We also design experiments on different datasets to show its effect, which restricts the performance of OWM with a significant margin. Secondly, we combine self-supervised learning with Class-IL to make up for this margin, which is simple but effective manner to extract more valuable features in each task. Experimental results on several datasets show that proposed method gets steady improvement over OWM, and performs better than state-of-the-art methods of other strategies.

2 Related Work

2.1 Lifelong Learning

Catastrophic forgetting firstly explored in 1980s and 1990s (Mccloskey1989Catastrophic, ; Robins1993Catastrophic, ; French1993Catastrophic, ), has attracted more and more attentions in recent years and lifelong learning has also been extended to unsupervised settings (rao2019continual, ) and semi-supervised settings (smith2019unsupervised, ). For supervised lifelong learning, many approaches have been presented. While not exhaustive, we roughly divide them into five strategies, including task-specific, regularization, exemplar replay, generative replay and gradient projection.

The motivation of task-specific methods is that previously acquired knowledge can be preserved by only optimizing part of the network. Several papers use this strategy, with different approaches for selecting the parts of the network for each task, e.g., dynamically expanding (Rusu2016Progressive, ), randomly selecting (Masse2018Alleviating, )

, evolutionary algorithms

(Mallya2018PackNet, ) or attention (Serr2018Overcoming, ). However, these methods are only be applied to Task-IL, in which task identity is required to select corresponding parts during test.

Regularization methods, as well as the following methods, do not need task identity as the entire network is shared among different tasks. Regularization

methods avoid interfering with prior knowledge by adding constraints to the weights updates. By estimating the importance of each weight for previous tasks, different penalty of changes are imposed to them, like Elastic Weight Consolidation (EWC)

(Kirkpatrick2017Overcoming, ) and Synaptic Intelligence (SI) (Zenke2017Continual, ). While these methods are effective on Task-IL and Domain-IL scenarios, they show poor performance on Class-IL scenario (Hsu2018Re, ; van2019Three, ).

Though violating the purpose of lifelong learning, exemplar replay methods provide a strong baseline, by storing data from previous tasks, called "exemplars". A typically example is iCaRL[(Rebuffi2016iCaRL, )

], which uses neural network for feature extraction and nearest mean of exemplars for classification. It manages an exemplar set with a fixed size and combines it with current inputs.

Generative replay methods gain more support from biological evidence, which suggests that the hippocampus is more than a simple experience replay buffer and reactivation of the memory traces yields rather flexible outcomes (ramirez2013creating, ). The strategy is to train a separate generative model sequentially, which approximates the data distributions of all previous tasks. When presented with a new task, generated samples are interleaved with new data to update the generator and classifier. (shin2017continual, ) proposed a model of deep generative replay (DGR) in the generative adversarial networks (GANs) framework(goodfellow2014generative, ) and the complementary learning systems (CLS) theory (Randall2002Hippocampal, ; Kumaran2016What, ) can also be modelled (Kamra2017Deep, ).

Finally, gradient projection methods try to retain learned knowledge by keeping the mappings trained on different tasks fixed, and learn new ones while avoiding conflicts with them. (zeng2019continual, ) achieves this goal by orthogonal weights modification (OWM), in which weights are only allowed to be modified in the direction orthogonal to the subspace spanned by all previous inputs. OWM shows good ability for overcoming catastrophic forgetting, and exhibits superiority in comparison with other methods in Class-IL scenario.

2.2 Self-Supervised Learning

Self-supervised learning (SSL) is applied to improving representations when labeled data is expensive and impractical to scale up. Recent literatures (ji2018invariant, ; oord2018representation, ; hjelm2018learning, ; henaff2019data, ; zhang2019aet, ), have proven that semantic representations can be learned by predicting labels obtained from the input signals without any human annotations. Several proxy tasks with different signals have been proposed, including relative position of image patches (doersch2015unsupervised, )

, colorization after shuffle

(larsson2017colorization, ), image rotations (gidaris2018unsupervised, ) and so forth (dosovitskiy2015discriminative, ; noroozi2016unsupervised, ).

While originally focus on unsupervised learning, it has also been extended to semi-supervised learning and many methods have been state-of-the-art

(dosovitskiy2015discriminative, ; zhai2019s4l, ). Recently, there are also many attempts for related tasks, e.g., adversarial generative networks (chen2019self, ). However, while combined with fully-supervised learning, SSL is generally used for data augmentation, and has negative effect on accuracy, though benefits on generality, robustness and uncertainty (hendrycks2019using, ).

Commonly, original task and proxy task are combined in a multi-task learning strategy, while sharing feature representations. (Lee2019Rethinking, )

proposed not to assign the same label to all augmented samples of the same source. Instead, learning the joint distribution of the original lebels and self-supervised signals of augmented samples, and combining the predictions from different augmented samples help improve the performance of supervised learning.

3 Methodology

3.1 Formulation and Analysis

In this paper, we study the problem of Class-IL, which is characterized by tasks of supervised learning, with dataset . Dataset of each task consists of labeled training samples of corresponding classes, i.e., , where and is the class set of task . Note that class sets of different tasks are disjoint and the model is trained sequentially which means when presented with task , only current training set is available. After training on all tasks, the model is evaluated on test data of all classes that have been learned.

Neural networks for Class-IL scenario all shares the same output layer for different tasks, as task identify is not provided during test. For the convenience of discussion, we divide these models into two parts: the classifier which corresponds to the last fully-connected layer, and the feature extractor which corresponds to all previous layers. After training on task , the output of such models is:

where and mean the parameters of and respectively. From this perspective, previous works for alleviating catastrophic forgetting can be divided into two categories, despite of the difference of implementation method:

  • [leftmargin=*]

  • When presented with a new task , replay-based methods both modify their representations of previous inputs initiatively to distinguish between them and current classes, i.e. . Therefore, the classifier would also have to relearn the mapping from feature space to classes, based on its latest distribution , sampled from both current real data and exemplars or generated pseudo-data.

  • On the other hand, regularization and subspace methods try to keep the features extracted from previous data stable, i.e. . However, previous representations are learned to distinguish classes that have been seen at that time. Without update, their information is insufficient for the following classification requirements, which we refer to as prior information loss (PIL).

For formalization of PIL, we define to be the universal set of disentangled features that can be extracted from sample . In the scenario of joint classification, there might be different subsets of that satisfy this task without redundancies, , and we refer to them as eligible subsets. Ignoring the capacity limit of , we obtain covering at least one eligible subsets, e.g., . When it turns to multiple sequential tasks, ideally, we also obtain while training on task and retain it till the end. However, except for the effect of catastrophic forgetting, another problem is that the model has no idea about the eligible subsets, while it might be common sense for human beings. When training on the first task, we obtain and there exists eligible subsets covering it, e.g., . Generally, they are not equal unless it is as difficult as joint classification. Suppose catastrophic forgetting does not occur in following procedure. When training on task t, more attributes are added into while that of old classes remains unknown, i.e. for , . Without loss of generality, we assume these added attributes are also in except the redundant ones. Finally, we obtain for each task and there exists difference set , which affects the classification accuracy.

3.2 Combination with self-supervised learning

PIL seems intractable when the size of is large and we can only obtain a limited number of them. To a certain extent, we can make up for this problem, by imposing prior information about which attributes might be important. We propose that self-supervised learning (SSL) (larsson2017colorization, ; gidaris2018unsupervised, ) is a simple but effective way to achieve this goal. Since in the training procedure of Class-IL scenario, representation learning without any information of following tasks, can also seen as a form of unsupervised learning, which is the strength of SSL. For a selected proxy task of SSL with as the set of transformations and given a transformed sample , the model learns to predict which transformation is applied. Supervised classification learning and self-supervised learning are trained at the same time during each task, in a multi-task manner. Besides, they share the same feature extractor while SSL has a separate output layer . Let

be the cross-entropy function, the overall loss function is as follow:

where is the parameter of and is a hyper-parameter which decreases when stitching to next task. As is shared between the two parts, we now have , where and represent features required by supervised learning and self-supervised learning respectively. Improvement on final classification accuracy relies on the intersection, . Note that data augmentation is not applied as there is no need to improve upon the generalization ability in the case of under-fitting. Besides, we need not to alleviate catastrophic forgetting of SSL, as its output layer is not used for prediction during test.

3.3 Training with OWM

We optimize with gradient computed by back propagation (BP) while optimize and with OWM algorithm (zeng2019continual, ). Here we give a brief introduction of OWM algorithm. For a FC layer with weight matrix , the key to overcome catastrophic forgetting in sequential learning, is the orthogonal projector defined on its input space for learned tasks. In general, the projector is defined as for training task . Matrix

consists of all trained input vectors spanning the input space that have been trained as its columns, e.g.,

, and donates a small constant for avoiding the ill-conditioning problem in the matrix-inverse operation. Moreover, can be updated recursively based on current input and , which greatly reduces the complexity of calculation.

Given for current task and gradient computed by back propagation , the gradient is modified with . For input of previous tasks , which is spanned by , we have:

which means that the output is approximately invariable after training on new task. For mini-batch training, can also be updated successively after current task is completed. Each time we calculate the mean of the inputs for the batch , then can be calculated iteratively as follow:

in which and after iteration, we have

. This algorithm can also be extended to convolution neural networks (CNN).

4 Experiments

In this section, we conduct several experiments on Class-IL scenario, to show the impact of PIL, and applicability of our proposed method, which will be referred to as OWM+SSL in the following text. In section 4.1, we introduce our datasets, neural network structure and setting of baselines. In section 4.2, we evaluate our method comparing with several state-of-the-art baselines of different strategies. In section 4.3, we explore the upper bound of OWM by training with distillation of joint pre-trained representations, to observe its performance without PIL. Finally in section 4.4, we explore how different SSL usages and choices of proxy tasks affect the result.

4.1 Settings

We use three image datasets collected from real world:

  • [leftmargin=*]

  • SVHN: digit images from house numbers, contains 32x32 colour images in 10 classes. There are 73257 training images and 26032 test images.

  • CIFAR10: consists of 60000 32x32 colour images in 10 classes of common objects, with 6000 images per class. There are 50000 training images and 10000 test images.

  • CIFAR100: just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.

And we conduct Class-IL experiments with 5 tasks for SVHN and CIFAR10, 2/5/10 tasks for CIFAR100. For each dataset, a subset of test set in randomly selected for validation, while the rest used for final test set. The neural network is composed of a 3-layer CNN with 64, 128, 256 22 filters and 3-layer MLP with 1000 hidden units. The number of filters are doubled for CIFAR100. We compared the following methods, which shares the same architecture of neural network:

  • [leftmargin=*]

  • EWC: a classical regularization method.

  • iCaRL: a strong baseline of exemplar replay, with a memory budget of 2000.

  • DGR: a separate AC-GAN (Wu2018Memory, ) is trained for generative replay. The generator and discriminator have 3 deconvolution and convolution layer respectively. The replayed images were labeled with the most likely class predicted by a copy of the main model stored after training on the previous task.

  • OWM: state-of-the-art method for Class-IL scenario, as well as the basis of our method.

4.2 Classification Results

In this subsection, we display the test accuracies of all the datasets after all tasks are learned in Table 1. For proposed OWM combined with SSL (OWM+SSL), we exploit predicting the rotation of images (gidaris2018unsupervised, ) as proxy task. As conclusion in (Hsu2018Re, ; van2019Three, ), EWC totally fails in Class-IL scenario. Its correct predictions are almost all from the last task, which means that previously learned knowledge are totally forgetten. In most setting, iCaRL and DGR show comparable results, except in 5-tasks CIFAR10. In tasks of CIFAR100, performance of iCaRL and DGR both fall faster as the number of task increases, compared with OWM. Since it is getting more difficult to approximate data distribution of all learned classes for them. Compared with other baselines, original OWM has shown a clear superiority in all settings.

(5 tasks)
(5 tasks)
(2 tasks)
(5 tasks)
(10 tasks)
EWC (Kirkpatrick2017Overcoming, )
iCaRL (Rebuffi2016iCaRL, )
DGR (shin2017continual, )
OWM (zeng2019continual, )
OWM+SSL (ours)
Table 1: Results of OWM with feature distillation.

Based on OWM, proposed OWM+SSL brings steady improvement and it is especially obvious on tasks of SVHN and CIFAR10. In these settings, as there are relatively less classes in each task, the phenomenon of PIL is more serious, which can lead to greater decline in test accuracy. At the same time, the amount of classes for joint testing is relatively small, which demands less feature for distinguishment. Thus, the proportion of the supplement from SSL is larger, which contributes to classification.

When it turns to fine-grained classification on CIFAR100, much more detailed attributes are needed, most of which are not necessary for predicting the rotations of images. While single proxy task help extracting more useful features, it is not enough to hold a considerable proportion of the missing parts. In other words, there is still much room for improvement. To tackle with this problem, perhaps the combination of different proxy tasks is a feasible option.

4.3 Upper Bound of OWM

In this subsection, we explore the upper bound of OWM without PIL. First, we pre-train a model with the same structure in joint learning scenario, and we stored its feature extractor as the teacher model. Then we train a model in Class-IL scenario using OWM, and during each task, the model learns to extract all the necessary features from by distillation, while learns to classify. The loss function is as follow:

where represents the mean square error and

is a hyperparameter. In such setting, the model is able to capture the same features with

, thus no PIL. Compared with joint learning, the decline of classification is attributed to catastrophic forgetting and information loss of distillation. Here we approximate it as the the upper bound of OWM, though the bound should be higher if there is no loss in distillation or the model can learn these representations by itself.

(5 tasks)
(5 tasks)
(2 tasks)
(5 tasks)
(10 tasks)
OWM (zeng2019continual, )
Joint Training
Table 2: Results of OWM with feature distillation.

In this experiment, We set for SVHN and CIFAR10, and for CIFAR100 to rescale the losses. The results are shown in Table 2. We report the classifications accuracies after joint training, which can be seen as the upper bound for lifelong learning with the same neural network. The models are then used as the teacher model for OWM with feature distillation (OWM+FD). There is an evident gap between OWM and OWM+FD in all settings, which shows the significant impact of PIL. While the gap between OWM+FD and the teacher model is less obvious. Thess results proves that OWM has achieved relative success in alleviating catastrophic forgetting, however is restricted by PIL to a greater extent, and SSL helps make up part of it.

(a) SVHN (5 tasks)
(b) CIFAR10 (5 tasks)
(c) CIFAR100 (2 tasks)
(d) CIFAR100 (5 tasks)
(e) CIFAR100 (10 tasks)
Figure 2: Test accuracy on all learned tasks after each task.

From the results on CIFAR100, it can be concluded that the disparity between OWM and OWM+FR widens as the number of tasks increases. This is also in line with our expectations, consistent with the seriousness of PIL. We also plot the training curve of each setting in Figure 2, which shows test accuracy on all learned tasks after each task. During the training procedure, the gap between OWM and OWM+FD expands gradually, as more lack of information are reflected in following tasks. However, the rate of expanding is falling in general, as more and more features are extracted and the average amount of missing information decreases. Above results verify our analysis of PIL in Section 3.1.

4.4 Study of SSL

In this section, we explore different ways for SSL to assist the Class-IL. Conventionally, SSL is used for data augmentation in supervised learning. It is not applied in our method because data augmentation usually hurts the performance on classification. Another alternative is inspired by self-supervised data augmentation methods with aggregation (SDA+AG) (Lee2019Rethinking, ), which improves result of origin task. As it may not directly apply to lifelong learning, we attempt to modify it as follow:

Besides, choice of proxy task is decisive of the improvement as discussed in Section 3.1. On SVHN and CIFAR10, where single proxy task can make a difference, we compare results with rotation (gidaris2018unsupervised, ) and RGB shuffle (larsson2017colorization, ) as transformations, combined with above OWM+SSL and OWM with modified SDA+AG, referred to as OWM+SAA, as shown in Table 3.

Strategy Transformation SVHN CIFAR10
(5 tasks) (5 tasks)
OWM (zeng2019continual, ) -
OWM+SSL Rotation
OWM+SAA Rotation
Table 3: Results of different methods of SSL.

With each form of SSL strategies, it can be concluded that predicting rotation improves accuracy on both datasets, while predicting shuffle of RGB channels only gain benefits on CIFAR10. This conclusion is in line with our intuition, as the former requires features of shape which overlaps with the requirement of classification, while the latter models color which has no significance for digit recognition and only increases training burden. Under conditions that the transformation signal is useful, OWM+SSL achieves better results than base OWM and OWM+SAA, proving that it can maximize the strengths of SSL. However, negative effect of added training burden would also be enlarged when a false signal is chosen, e.g., RGB shuffle on SVHN. In comparison, performance of OWM+SAA is more stable. In the improper setting, its accuracy is approximately equal with that of OWM, which means that the negative effects of SSL and data augmentation can be counteracted with the positive effect of aggregated prediction. While this stability also restricts the effect of SSL even though it is beneficial, compared with OWM+SSL.

5 Conclusion, Future Work and Broader Impact

In this paper, we propose and explore the problem of prior information loss (PIL), which has been ignored but causes significant negative effect on models for continual learning. Our empirical results show that by mitigating this problem, the upper bound of Class-IL can be improved obviously based on current approaches. Besides, we combine lifelong learning with self-supervised learning, and proves that it is an effective way to alleviate it. However, there is still a huge room for improvement, as its performance relies on the selection of proxy task. How to design or select different self-supervised learning signals remains to be a challenge. Besides, combination of different signals is also one direction worthy exploring, which is essential for solving difficult tasks and approximating the upper bound. Finally, this work does not present any foreseeable societal consequence.


  • (1) Wickliffe C. Abraham and Anthony Robins. Memory retention – the synaptic stability versus plasticity dilemma. Trends in Neurosciences, 28(2):0–78, 2005.
  • (2) Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, and Neil Houlsby. Self-supervised gans via auxiliary rotation loss. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 12154–12163, 2019.
  • (3) Robert Coop, Aaron Mishtal, and Itamar Arel. Ensemble learning in fixed expansion layer networks for mitigating catastrophic forgetting. IEEE transactions on neural networks and learning systems, 24(10):1623–1634, 2013.
  • (4) Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015.
  • (5) Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence, 38(9):1734–1747, 2015.
  • (6) Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. 2017.
  • (7) Robert M. French. Catastrophic interference in connectionist networks: Can it be predicted, can it be prevented? In Advances in Neural Information Processing Systems, 1993.
  • (8) Alexander Gepperth and Cem Karaoguz. A bio-inspired incremental learning architecture for applied perceptual problems. Cognitive Computation, 8(5):924–934, 2016.
  • (9) Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
  • (10) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • (11) Ben Goodrich and Itamar Arel.

    Unsupervised neuron selection for mitigating catastrophic forgetting in neural networks.

    In IEEE International Midwest Symposium on Circuits & Systems, 2014.
  • (12) Xu He and Herbert Jaeger.

    Overcoming catastrophic interference using conceptor-aided backpropagation.

  • (13) Olivier J Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019.
  • (14) Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self-supervised learning can improve model robustness and uncertainty. In Advances in Neural Information Processing Systems, pages 15637–15648, 2019.
  • (15) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
  • (16) Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, and Zsolt Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines. 2018.
  • (17) Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv:1807.06653, 2018.
  • (18) Nitin Kamra, Umang Gupta, and Yan Liu. Deep generative dual memory network for continual learning. 2017.
  • (19) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, and Agnieszka Grabska-Barwinska. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci U S A, 114(13):3521–3526, 2017.
  • (20) Dharshan Kumaran, Demis Hassabis, and James L. Mcclelland. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in Cognitive Sciences, 20(7):512–534, 2016.
  • (21) Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6874–6883, 2017.
  • (22) Hankook Lee, Sung Ju Hwang, and Jinwoo Shin. Rethinking data augmentation: Self-supervision and self-distillation. 2019.
  • (23) Jeongtae. Lee, Jaehong. Yun, Sungju. Hwang, and Eunho. Yang. Lifelong learning with dynamically expandable networks. 2017.
  • (24) Timothee Lesort, Hugo Caselles-Dupre, Michael Garcia-Ortiz, Andrei Stoian, and David Filliat. Generative models from the perspective of continual learning. In 2019 International Joint Conference on Neural Networks (IJCNN), 2019.
  • (25) Zhizhong Li and Derek Hoiem. Learning without forgetting. 2016.
  • (26) David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. 2017.
  • (27) Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • (28) Nicolas Y. Masse, Gregory D. Grant, and David J. Freedman. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences, 2018.
  • (29) M. Mccloskey. Catastrophic interference in connectionist networks. Sequential Learning Problem, 1989.
  • (30) Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, and Richard E. Turner. Variational continual learning. 2017.
  • (31) Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
  • (32) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • (33) Steve Ramirez, Xu Liu, Pei-Ann Lin, Junghyup Suh, Michele Pignatelli, Roger L Redondo, Tomás J Ryan, and Susumu Tonegawa. Creating a false memory in the hippocampus. Science, 341(6144):387–391, 2013.
  • (34) Randall, C, O’Reilly, , , Kenneth, A, and Norman. Hippocampal and neocortical contributions to memory: advances in the complementary learning systems framework. Trends in Cognitive Sciences, 2002.
  • (35) Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pascanu, Yee Whye Teh, and Raia Hadsell. Continual unsupervised representation learning. In Advances in Neural Information Processing Systems, pages 7645–7655, 2019.
  • (36) Sylvestre Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. 2016.
  • (37) Anthony Robins. Catastrophic forgetting in neural networks: the role of rehearsal mechanisms. In New Zealand International Two-stream Conference on Artificial Neural Networks & Expert Systems, 1993.
  • (38) Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. 2016.
  • (39) Joan Serrà, Dídac Surís, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. 2018.
  • (40) Gehui Shen, Song Zhang, Xiang Chen, and Zhi-Hong Deng. Generative feature replay with orthogonal weight modification for continual learning, 05 2020.
  • (41) Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017.
  • (42) James Smith and Constantine Dovrolis. Unsupervised progressive learning and the stam architecture. arXiv preprint arXiv:1904.02021, 2019.
  • (43) Gido M van de Ven and Andreas S Tolias. Three scenarios for continual learning. 2019.
  • (44) Chenshen Wu, Luis Herranz, Xialei Liu, Yaxing Wang, Joost van de Weijer, and Bogdan Raducanu. Memory replay gans: learning to generate images from new categories without forgetting. 2018.
  • (45) Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364–372, 2019.
  • (46) Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. 2017.
  • (47) Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE international conference on computer vision, pages 1476–1485, 2019.
  • (48) Liheng Zhang, Guo-Jun Qi, Liqiang Wang, and Jiebo Luo. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2547–2555, 2019.