The recent development of deep learning has achieved great success in a broad range of computer vision tasks. However, it is still far away from the purpose of artificial general intelligence(AGI). In the real world, artificial intelligence(AI) visual systems are usually exposed to dynamic environments where new visual concepts need to learn are emerging over time. In order to make the artificial visual systems move closer towards general intelligence, it is essential to make them possess the capability ofcontinual or lifelong learning which means to continually learn over time by accommodating new knowledge while retaining previously learned experience.
In traditional learning scenario, successful deep neural networks(DNNs) are usually trained using stochastic gradient descent(SGD) optimization in batch mode where all training data are given at the same time and all classes are known in advance. However, in continual learning scenarios, we have a neural network with the capability of recognizing some learned visual concepts and want to extend its capability of recognizing more visual concepts while giving new concept data. In practice, the most pragmatic way to achieve this goal is merging the data of learned concepts with the data of new concepts and retrain a deep neural network from scratch. Nevertheless, this methodology is pretty inefficient, since it requires large storage space and long retraining time. And in some cases, the previously learned concept data is never available. If we directly train a trained neural network on new concept data, its performance of recognizing previously learned concepts will significantly decreases. This phenomenon is known ascatastrophic forgetting or catastrophic interference [2, 3] which refers to training a model using only new information without old information can lead to a severe performance degradation to old learned knowledge. Therefore, developing more flexible and intelligent methodologies to overcome catastrophic forgetting problem and achieve the continual learning goal is considerably significant and meaningful. Furthermore, gradually increasing the recognition ability of neural networks is a critical step to make learning systems closer to AGI.
From the perspective of human learning, they usually exhibit a strong ability to continually learn and accumulate knowledge throughout their lifespan. In particular, when they learn new knowledge, the new knowledge will not interfere with previous knowledge. Therefore, the catastrophic forgetting problem does not happen in human learning systems, which is a crucial characteristic for humans to behave intelligently. This is due to the special neurophysiological and biological mechanism of learning and memory of human brain which have been widely studied[4, 5]. Among these studies, one important study is about the stability-plasticity dilemma which refers to the extent of a learning system to be plastic in order to integrate novel information and stable in order not to severely interfere with consolidated knowledge. The lifelong learning in human brain is mediated by a rich set of neurophysiological principles that regulate the stability-plasticity balance of the different brain areas. Another important study is the complementary learning systems(CLS) theory which illustrates the significant contribution of dual memory systems involving the hippocampus and the neocortex in learning and memory. It suggests that there are specialized mechanisms in the human cognitive system for protecting consolidated knowledge. The hippocampal system rapidly encodes recent experiences and exhibits short-term adapation. The long-term learning happens in neocortex systems through the activation synchronized with multiple replays of the encoded experience. These studies of learning and memory of human brain have motivated many previous continual learning approaches[7, 8, 9]. Because of the essential difference of learning mechanisms between biological neural networks and artificial neural networks, it is difficult to design one-to-one corresponding counterparts of hippocampal system and a neocortex system in artificial neural network learning systems. However, the fact that plasticity-stability balance and memory retrieval mechanism is indispensable ingredient in human learning, is convinced.
According to different forms of deep neural networks, we can categorize the memory in artificial neural networks into three different types: implicit memory, explicit memory and recall memory. The implicit memory refers to the learned neural weights of DNNs. Although a single connection weight may not show any concrete meanings, the whole learned neural weights in a neural network can exhibit the functionality for solving a specific task. The implicit memory is embedded in the neural weights and represents a latent form of learned knowledge. The explicit memory refers to the external memory augmented in deep neural networks[10, 11, 12]
. Compared with the implicit memory, the explicit memory is usually formed in definite structure such as a module of sequential vectors or matrix. This kind of memory can explicitly write and read to store and access information. The recall memory refers to the concrete memory information of previous learned knowledge. This kind of memory can be represented by generative models[13, 14] which can recall pseudo learned information.
In order to make deep neural networks capable of continually learning to recognize more visual concepts, we should consider both the plasticity-stability balance during learning new concepts and how to utilize different types of memory to alleviate catastrophic forgetting problem in neural network models. Therefore, in this paper, we propose an incremental concept learning framework based on the characteristic of different types of artificial neural networks memory. It consists of two main components: an incremental concept learning network(ICLNet) and an incremental generative memory recall network(RecallNet).
The ICLNet is a remoulded feedforward neural network. It consists of a trainable feature extractorand a concept memory matrix which can be regared as implicit memory and explicit memory respectively. During incremental concept learning, we gradually increase new concept vectors in the concept memory matrix to make a neural network capable of recognize new concepts. Traditionally, the classifier of deep neural networks is usually a fully connected layer with softmax activation function and trained with cross entropy loss. However, training in this way will lead to large feature variance[15, 16], which will result in larger weights variation and harm the implicit memory during continual learning. Hence, we propose a contrastive concept loss for ICLNet which forces the learned features into a compact representation form. In this way, the learning capability of a neural network is not limited and the ability of preserving old concepts is improved. Hence, the plasticity and stability dilemma is balanced in when continually training neural networks.
The RecallNet aims to continually retrieve old concept memories while learning new concepts and incrementally consolidate new concept memories as well as old concepts. We train conditional generative adversarial networks(cGANs) with least square loss as the RecallNet which is regarded as recall memory. By using cGANs, we can directly recall pseudo samples of old concepts. While training cGANs in incremental concept learning scenarios, we will also confront forgetting problems since existing generative models are still not satisfactory. In order to maximally maintain old concept memories while incrementally training cGANs, we propose a balanced online recall strategy. In addition, this strategy can reduce the storage cost by not storing pseudo samples and also help other pseudo-rehearsal methods in continual learning.
Ii Related work
In recent years, the research into continual learning is fairly compelling and has gradually gained widespread attention. The catastrophic forgetting effect is still a fundamental limitation for neural networks 
to continually learn new knowledge. It occurs when the data distribution of new tasks to learn is significantly different from previously observed data. The main reason of the catastrophic forgetting phenomenon is that classical deep neural networks are trained with gradient-based optimization algorithms which aim at adapting neural weights in the network to fit current data distribution and loss function. In continual learning scenarios, the data distribution changes over time. When the neural network fits the new data distribution, the previously learned knowledge in the shared representational neural weights in the neural network will be overwritten by new information. Recently, many methods have been proposed for alleviating the catastrophic forgetting problem.
Ii-a Continual Learning Strategies
From the perspective of learning strategies to alleviate catastrophic forgetting problem, recent methods can be roughly divided into four categories.
Regularization methods. The regularization methods alleviate catastrophic forgetting by imposing different kind of constraints while updating the neural weights. One kind of regularization method is elastic weight consolidation(EWC) model  and its following methods IMM  and R-EWC 
which estimate the importance weights of a neural network for previous tasks and impose the constraint on the corresponding parameters while learning new tasks. The limitaions of this approach are that the shared parameters may contain conflicting constraints for different tasks and the importance weight matrix should be pre-estimated and stored. Another well known regularization method is distillation. They generally use additional objective terms to regularize changes in the mapping function of a neural network. Learning without forgetting(LwF) enforce the predictions of a neural network for previously learned tasks to maintain similar while learning new tasks by using knowledge distillation . And the following work propose to use distillation on hidden activations , preserved old tasks data [27, 9] or additional dataset  as regularization. However, sequentially training with regularization terms usually limit the learning capability of neural networks in some extent.
Dynamic architecture methods. Dynamic architecture methods dynamically change the architecture of neural networks to accommodate new neural resources to learn new knowledge. The progressive networks 
approach proposed an architecture with explicit support for transfer learning across sequences of tasks. It expands the architecture by allocating new network branches with fixed capacity to learn new information. Dynamically expanding network(DEN) can incrementally learn new tasks by increasing the number of trainable parameters. DEN is efficiently trained in an online manner by performing selective retraining, and automatically determine how much neural networks capacity to expand. Fixed expansion layer(FEL) network mitigates the forgetting by selectively update neural weights in the network rather than expanding the neural weights during continual learning. Comparing with dynamic architecture methods, our approach only dynamically increase new concept vectors in concept memory matrix.
Rehearsal based methods. The rehearsal based approaches select and store real representative samples from previous task training task which can maintain the information the past. ICaRL  proposed a practical strategy for simulaneously learning a nearest-mean-of-exemplars classifier and a feature representation in the class incremental learning setting. ICaRL used herding to create a fixed-sized represetation set of previous task data samples and the knowledge distillation to maintain past knowledge. Gradient episodic memory(GEM)  and  also require part of the old data. The main difference lies on how to select the representative exemplars and how to use the representative exemplars to help maintain knowledge of past tasks. The existing limitations of rehearsal based approaches is that it violates the data availability assumption and requires additional storage space. The continual learning performance will be influenced by the quality and quantity of stored representative examplars. Furthermore, in some real world applications, due to the privacy and legal concerns, real data may not be allowed to be stored for a long period of time.
Pseudo-rehearsal based methods. The pseudo-rehearsal based methods [7, 8, 20, 9] take inspiration from the CLS theory and make use of different promising generative models to learn the data distribution of previous tasks. During continually learning new tasks, the old task knowledge should be complemented by memory replay from learned generative models. The deep generative replay(DGR), Memory Replay GANs  and  utilize variants of generative adversarial networks(GAN) as generative memory. The deep generative dual memory network(DGM)
utilizes variational autoencoder(VAE) as the long-term memory which stores information of all previously learned tasks. Our approach belongs to pseudo-rehearsal based method. In order to maintain information of generative models during sequentially learning new tasks, we propose a balanced online recall stratey which a general strategy and can be applied to other pseudo-rehearsal approaches.
Ii-B Continual Learning Scenarios
Although promising results are reported by many methods, the experimental protocols and continual learning scenarios are different in their evaluation. And different continual learning scenarios have different learning difficulty levels. But they all confront catastrophic forgetting. Therefore, from the perspective of learning scenarios, we can divide three distinct continual learning scenarios.
Task incremental learning scenario. In this scenario, a new output-layer should be appended to the neural network while learning a new task. But during inference, the task identity should be provided to select which output-layer to use.
Domain incremental learning scenario. In this scenario, the same output-layer is shared while learning a new task. Therefore, for different learning tasks, they are constrained to have the same number of output units. And during inference, the task identity is also required.
Class incremental learning scenario. In this scenario, the model can incrementally learn to recognize new classes and do not require to provide task identity during inference. There is only one incremental output-layer which continually increase the number of output units during learning new classes. In this paper, we refer the incremental concept learning as the same meaning with class incremental learning.
In this section, we describe the main components of our approach and explain how we design them. In Section III-A and III-B , we describe the details of the incremental concept learning network(ICLNet) and the incremental generative memory recall network(RecallNet). In Section III-C , we describe the balance online recall strategy which helps to improve the continual learning ability in RecallNet. At last, in Section III-D , we summarize the whole training procedure in an algorithm framework.
Iii-a Incremental Concept Learning Network
From the perspective of training data in the continual learning scenarios, we have to face two major challenges: (a). The new concept data has different distribution compared with old concept data. (b). And intact old concept data is usually not available while learning new concepts. And from the perspective of training algorithm, we usually use stochastic gradient descent(SGD) algorithms to update neural weights, which aims at adjusting the functionality of neural networks on current training data. And these two aspects are the main reasons for catastrophic forgetting problem happening in continually training deep neural networks. In order to balance the plasticity and stability in deep neural networks to allivate the catastrophic forgetting problem, we reinterpret the DNNs and utilize the implicit and explicit memory in DNNs.
The classical deep feedforward neural networks for classification tasks usually consists of a trainable feature extractor and a classifier. The output classifier is usually a fully connected layer which can produce raw value logits for each category. Followed by a softmax activation function layer, the network can predict the probabilities of each learned category for input samples. If the final fully connected layer classifier does not contain biases, it is equivalent to compute the inner product between input sample feature and each learnable vector in the weight matrix. Since the result of inner product is unbounded, if we train a classification neural network in this way, it will lead the feature extractor to produce features with large intra-class variance[15, 16], as shown in first row of figure 2. However, with large variance of the previous concept features, the neural networks weights become less stable during learning non-stationary data distribution in continual learning scenarios. The reasion is that the larger variance features make greater change in the neural weights while adapting to new concept data. Therefore, the old concept knowledge is tend to be forgotten more. Previous methods [24, 27, 9] use distillation loss to keep the ability of learner on previous concepts. But these methods require precomputed soft targets of replayed samples  or representative samples , which increases the computational cost.
Because of previously mentioned concerns, we remould the neural network structure as ICLNet which is specifically designed for class incremental learning as shown in Figure 1. We denote the ICLNet as with learned concepts. As illustrated in Figure 1, the ICLNet consists of a trainable feature extractor and a dynamically growing concept memory matrix
. The trainable feature extractor is an ordinary deep neural network such as multi-layer convolutional neural networks. And the neural weights of the feature extractor can be regarded as the implicit memory. Each row vector in the concept memory matrix represents a concept prototype the network has learned. We can regard the memory matrix as the explicit memory. When the ICLNet need to learn new concepts, the concept memory matrix dynamically increase corresponding number of concept vectors. Unlike the external memory in NTM which learns differentiable controller to read and write memory, we directly update the concept memory matrix via gradients.
In our approach, we compute the distance between sample features and each vector in concept memory matrix to predict the concept of samples. The closer the distance between features and concept vectors, the higher probability they belong to the corresponding concept. Specifically, we use Euclidean distance to measure the similarity. The probability of a sample belongs to concept is computed via a softmax function :
where and is the number learned concepts of ICLNet and is the number of training samples. And we use cross entropy loss as our objective function:
where is the groundtruth of sample .
In the class incremental learning scenario, we need to train ICLNet with replayed samples combined with new concept data. However, directly train the ICLNet with on merged data will still lead to large intra-class feature variance in some learned concepts as shown in the second row of Figure 2.
In order to reduce the intra-class feature variance of learned concepts, we propose a concept contrastive loss to learn to represent the concept feature into a more compact form. The concept contrastive loss aims at constraining the sample features to stay closer to corresponding concept vector but keep away from other concept vectors. The loss function is be formulated as following:
where is margin and
Training the ICLNet with will make learned sample features near the corresponding concept vectors. Hence, during incrementally learning new concepts in ICLNet, the learned old concept vectors in concept memory matrix are almost fixed while learning new concepts. The neural weights of feature extractor are adjusted with smaller extent to keep old knowledge and learn new concepts.
Iii-B Incremental Generative Memory Recall Network
In the concept incremental learning scenario, the intact old concept data is not available when learning new concepts. We propose to use a generative model to learn the distribution of old concepts as recall memory. Currently, generative adversarial networks are the most promising generative models. In order to generate high quality samples and eliminate computational cost, we choose to use generative adversarial networks with auxiliary classifier(ACGAN) architecture which can directly generate samples for specified concepts. There are many candidate objective functions for training GANs, such as wasserstein generative adversarial networks(WGAN), WGAN with gradient penalty(WGAN-GP) and least square generative adversarial networks(LSGAN)  etc. Different from training cGANs on a single dataset, we require continually training cGANs on non-stationary datasets in continual learning scenarios. Therefore, the recall memory consolidation speed and stability are very important. However, WGAN and WGAN-GP methods converge slowly and are hard to tune. Therefore, we choose to use least square generative adversarial networks(LSGAN)  which are more stable and faster on training ACGAN.
We denote the incremental generative memory recall Network(RecallNet) which can recall concepts as . When we are learning a new class incremental task(cit), we denote as the parameter of generator, discriminator and auxillary classifier respectively within LS-ACGAN while learning the th class incremental task. We alternatively train the generator and the discriminator with classifier by solving both adversarial game. The generator optimizes the following objective function:
where is the training set which combines recalled old concept samples and new concept samples, are the sampling distributions.
Similarly, the discriminator and auxillary classifier optimize the following objective function:
Iii-C Balanced Online Recall
Even though LS-ACGAN can stablize training and generate decent images, there are still many limitations in the exisiting techniques. Previous approaches[7, 9, 20] leverage memory replay to create an extended dataset that contains both real new concept data and recalled pseudo samples for old concept data. However, creating an extended dataset will occupy more storage space and lose information within a limited amount of stored samples. If we continually train cGANs with the extended dataset, the information of long memory will gradually lose after a long number of class incremental tasks. Because a limited amount of recalled samples of old concepts can not cover the full data distribution. If RecallNet fails to generate satisfactory images at one class incremental task learning stage, the following class incremental tasks learning will all be influenced.
Because of these mentioned problems, we propose a balanced online recall strategy. Since we usually train neural network with mini-batch stochastic gradient descent algorithm, we can directly recall old concept samples and combine with samples from new data to consist a training mini-batch. Therefore, we do not need to store the recalled old concept data. Due to the concepts amount imbalance between the old memory and the new data, the conditional generator is not able learn to generate correct samples according to correct labels if we train RecallNet on the imbalanced image mini-batches. And if we train ICLNet on the imbalanced batches, it also will lead to bias problems. This is confirmed by the experimental results. Therefore, for each data batch, the amount of each concept samples should be proportional to the number of concepts in recall memory and new class incremental task. In practice, for implementation simplicity, we proposed a balance sampling method which only requires to compute the number of samples for old concepts data and new concepts data. Assume that the training batch size is and the number of old concepts and new concepts are and respectively. We have
where and is amount of old concept samples and new concept samples in a training batch respectively. By solving the simultaneous equations, we can get
Sampling samples from RecallNet and samples from new data can consititue a class balance training batch. And we train both ICLNet and RecallNet using this strategy.
Iii-D Algorithm Framework
In Algorithm 1, we present overall training procedure of our approach. When we need to learn a new class incremental task, we have a new dataset containing new concepts, the ICLNet which is capable of recognizing concepts and RecallNet which can recall concepts samples. First, we increase new concept vectors in the concept memory matrix of ICLNet. Therefore, the ConceptNet can learn to distinguish concepts. Next, we use balanced online recall strategy to get balance training batches to train the ConceptNet. Finally, we initialize a new RecallNet which can learn to recall concepts samples. And then train the new RecallNet with balanced online recall strategy to consolidate the currently learned concepts memory.
In this section, we perform a variety of experiments to demonstrate the advantage of our approach. We present the implementation details in section IV-A . To illustrate the class incremental learning procedure, we present a feature visualization in section IV-B . We compare the continual learning performance with recently proposed related methods in section IV-C . In section IV-D , we perform ablation experiments to demonstrate the usefulness of key components of our approach. And show more experiments in section IV-E and IV-F .
Iv-a Implementation Details
Benchmark: We use the average incremental accuracy benchmark protocal suggested by  to evaluate the performance. It means, in class incremental scenarios, after each class incremental task(cit), the classifier is evaluated on the combined test set consisting of new and old class samples. We evaluate and compare on three different datasets. MNIST  consisting of images of handwritten digits and Fashion-MNIST  consisting of images of different fashion clothes are resized pixels in our experiment. And SVHN  contains cropped digits of house numbers from real-world street images. Each of these datasets contains 10 classes. We can split each dataset into 5 class incremental tasks, each of which consists of 2 randomly picked classes. In our experiments, we directly pick classes for each task in the original label order which is for better understanding.
: For MNIST and Fashion-MNIST, the feature extractor in ICLNet consists of 3 convolutional layers with batch normalization followed by a fully connected layer. And for SVHN, the feature extractor in ICLNet consists of 6 convolutional layers with batch normalization which is the structure provided by PyTorch official examples. Since the memory consolidation is more challenging for Fashion-MNIST and SVHN which contain much more variability than MNIST, we use self-attention and spectral normalization techniques in the RecallNet. The detailed network implementations can be found in supplemental materials.
In this section, a toy example on MNIST dataset is presented. We reduce the output dimension of feature extractor to 2 as well as the dimension of concept vectors. So we can directly plot the features on 2-D surface for visualization. We show the class incremental learning procedure of different concepts on 5 class incremental tasks in Figure 2 . Three conditions of different classifiers and losses are compared.
In the first row, when we use traditional fully connected layer as classifier with softmax cross entropy loss, we can see that the numerical scale of feature and intra-class variance are gradually increasing as the continual learning progresses. In the continual learning scenario, when the neural networks incrementally learn new concepts, the large changes in feature scales can lead to large network weight changes. Thereby, it increases the impact of catastrophic forgetting problems. When train ICLNet only with softmax cross entropy loss, it reduces the feature scale but still has large intra-class feature variance as shown in the second row. To further reduce the intra-class feature variance, we train ICLNet with softmax cross entropy loss and concept contrastive loss as the results shown in the third row. We can observe that the features belong to the same concept concentrate to its corresponding concept vectors and keep away from other concept vectors. This phenomenon verify the purpose of concept contrastive loss. Furthermore, it also improves the continual learning performance and we show results in the ablation study IV-D .
Iv-C Results and Comparisons
In this section, we perform both quantitative and qualitative evaluations on our approach. We compare with DGR and MeRGAN , since we all belong to pseudorehearsal-based methods  and use GAN-based models for memory recall and consolidation. For fair comparisons, we reimplement them using the same network structures as ours in PyTorch with reference to their official code.
The DGR method is originally evaluated on domain incremental learning problems. Therefore, in order to make original DGR approach fit the class incremental learning scenario, we make some reasonable modifications. Following , we separately train a feedforward neural network classifier and an unconditional generative adversarial network. The replayed images are stored and labeled with the most likely category predicted by the classifier trained on the previous task. However, this way may introduce some incorrect labels for continuous training and harm the classifier in continually learning new classes.
The MeRGAN method proposes to use a conditional GAN to replay past data and integrates the classifier into the discriminator of cGANs. It can help to avoid potential classification errors and biased sampling towards recent categories. They use replay alignment techniques to further prevent forgetting. But we compare with the version without replay alignment.
The AuxClassifier method refers to the auxilliary classifier in the AC-GAN of our model which also contains the ability of classification in the class incremental learning scenario. Comparing with MeRGAN, AuxClassifier does not inherit the AC-GAN parameter from previous AC-GAN. Hence, it lacks of the implicit memory of previous learned concepts. But we use balanced online recall strategy rather than creating an extended dataset of current task and recalled samples to train.
The Upper Bound refers to directly train the classifier with cross entropy loss using the data of all class incremental tasks. We regard it as the performance upper bound for continual learning.
We show the comparison results on MNIST, Fahsion-MNIST and SVHN datasets in table I . Our results outperform other pseudorehearsal methods. For MNIST, the average precison on 5 class incremental tasks of our approach is close to upper bound. However, for the Fashion-MNIST and SVHN, there is still a big gap with upper bound.
We visualize the samples of each learned concepts recalled from the RecallNet after consolidating memory for each class incremental tasks in Figure 3. We can see that for MNIST, the RecallNet can generate clear and diverse images even after learning the last incremental task. However, for Fashion-MNIST and SVHN, although we can recognize the generated samples, they gradually lose details and variablity.
We draw the average incremental accuracy of different methods on different datasets during learning class incremental tasks in Figure 4. We can see that our approach performs better than other pseudo-rehearsal approaches.
Iv-D Ablation Study
|Net Type||Recall Strategy||Loss Function||MNIST||Fashion-MNIST||SVHN|
|TradNet||Balanced Online Recall||96.45||74.98||66.15|
|ICLNet||Balanced Online Recall||97.24||78.82||69.62|
|ICLNet||Balanced Online Recall||98.62||80.79||72.20|
|ICLNet||Imbalanced Online Recall||95.21||73.38||58.45|
|ICLNet||Offline Balanced Recall||93.24||54.38||43.24|
We perform a series of ablation experiments to demonstrate that the key components of our approach indeed improve the performance of class incremental learning. The variants are as follows. (i). Which kind of classifier do we use, the traditional fully connected layer(TradNet) or the ICLNet structure(ICLNet). (ii). Which kind of memory recall strategy do we use, balanced online recall (Balanced Online Recall), imbalanced online recall (Imbalanced Online Recall) or storing recalled samples with balanced sampling(Offline Balanced Recall). (iii). Which kind of loss function do we apply, only softmax cross entropy loss() or with concept contrastive loss(). The results in table II show that each of proposed component of our approach makes significant contribution to class incremental learning tasks. The ICLNet with concepts contrastive loss can reduce the feature scale and intra-class variance to help alleviate forgetting problem when recalled samples are not realistic enough. Storing recalled sample will lose the information during continual learning of RecallNet and decrease the quality of recalled samples in the subsequent tasks. And without balanced sampling, online recall will fail to learn to generate samples corresponding to class labels when the number of learned concepts and new concepts are extremely imbalance. Hence, it results in failure in continual learning.
Iv-E Learn Dissimilar Concepts
We conduct a concept incremetal learning experiment across MNIST and Fashion-MNIST to show that our approach also works on continually learning dissimilar concepts from different datasets. In this experiment, we split the whole training data of MNIST and Fashion-MNIST into four class incremental tasks which contain concepts respectively. The average incremental accuracy of our approach have still surpass other pseudo-rehearsal based approach, DGR and MeRGAN, as shown in figure 6. We also visualize the recalled samples of each learned concepts after consolidating memory for each class incremental task in figure 5. The quality incates that our approach is capable of maintain the memory of dissimilar concepts.
Iv-F Compare with Rehearsal-based Method
The ICLNet can also execute class incremental learning on the rehearsal-based situation which use selected representative sample set. We compare with the well-known rehearsal-based method iCaRL which proposes to select representative samples for old class incremental tasks and use a distillation loss to train. In our reimplementation, we set the capacity of representative set and use random exemplar selection rather than exemplar selection by herding since there is no substantial difference as claimed in . And we train our ICLNet with the same selected representative sample set for fair comparison. We conduct experiments on MNIST, Fashion-MNIST and SVHN and use the same the feature extractor in ICLNet and iCaRL. As shown in figure 7, the ICLNet get better performance than iCaRL.
In this paper, we propose a novel approach to solve catastrophic forgetting problem in class incremental learning scenario. In order to fit for class incremental learning scenario, we design a readily comprehensible network architecture named ICLNet which only need to increase the number of concept vectors in concept memory matrix while learning new concepts. To alleviate the catastrophic forgetting problem caused by large neural weights change, we propose a concept contrastive loss which help to concentrate the features to their corresponding concept vectors and keep away from other concept vectors. We can interpret the concept contrastive loss with ICLNet as a kind of regularization approach. Unlike EWC and IMM which regularize with precomputed importance weights and distillation methods [24, 9] which regularize with soft targets of samples, we regularize with remoulded neural network structure and an loss function for continual learning. We believe that redesigning computational architecture of neural networks will be a proper way to achieve efficient continual learning in the future. On the other hand, we propose a balanced online recall strategy to make LS-ACGAN capable of incrementally consolidating generative memory with high quality in continual learning scenario. Because of the advantage of the two proposed methods, we outperform other pseudo-rehearsal based approaches on MNIST, Fashion-MNIST and SVHN. And our ICLNet also performs better than well known rehearsal-based approach, iCaRL, in the same rehearsal scenario.
The disadvantage of our approach, which is also the disadvantage of other pseudo-rehearsal based approaches, is that we depend on the capability of the generative models used as recall memory. This may hinder applying pseudo-rehearsal based continual learning approaches to large scale dataset. However, the current development of generative models is very rapid. Excellent GANs like BigGAN  and SGAN  have been proposed and more progress will be made in the future. Hence, it is still promising to use GANs as recall memory.
In the future, we aims to combine the ICLNet and RecallNet into an integrated model to improve computational efficiency and continual learning ability. Because the implicit and explicit memory in ICLNet should help consolidate recall memory in RecallNet. And the information contained in RecallNet should also be used to help ICLNet continually learn new concepts.
-  G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual lifelong learning with neural networks: A review,” Neural Networks, vol. 113, pp. 54–71, 2019.
-  M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” in Psychology of learning and motivation. Elsevier, 1989, vol. 24, pp. 109–165.
-  J. L. McClelland, B. L. McNaughton, and R. C. O’reilly, “Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.” Psychological review, vol. 102, no. 3, p. 419, 1995.
-  L. R. Squire, “Memory and brain,” 1987.
-  J. D. Power and B. L. Schlaggar, “Neural plasticity across the lifespan,” Wiley Interdisciplinary Reviews: Developmental Biology, vol. 6, no. 1, p. e216, 2017.
-  M. Mermillod, A. Bugaiska, and P. Bonin, “The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects,” Frontiers in psychology, vol. 4, p. 504, 2013.
-  H. Shin, J. K. Lee, J. Kim, and J. Kim, “Continual learning with deep generative replay,” in Advances in Neural Information Processing Systems, 2017, pp. 2990–2999.
-  N. Kamra, U. Gupta, and Y. Liu, “Deep generative dual memory network for continual learning,” arXiv preprint arXiv:1710.10368, 2017.
-  Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, Z. Zhang, and Y. Fu, “Incremental classifier learning with generative adversarial networks,” arXiv preprint arXiv:1802.00853, 2018.
-  A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” arXiv preprint arXiv:1410.5401, 2014.
-  S. Sukhbaatar, J. Weston, R. Fergus et al., “End-to-end memory networks,” in Advances in neural information processing systems, 2015, pp. 2440–2448.
J. Wang, L. Zhang, Q. Guo, and Z. Yi, “Recurrent neural networks with auxiliary memory units,”IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 5, pp. 1652–1661, May 2018.
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
C. Luo, J. Zhan, X. Xue, L. Wang, R. Ren, and Q. Yang, “Cosine normalization: Using cosine similarity instead of dot product in neural networks,” inInternational Conference on Artificial Neural Networks. Springer, 2018, pp. 382–391.
Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” inEuropean conference on computer vision. Springer, 2016, pp. 499–515.
Y. LeCun, “The mnist database of handwritten digits,”http://yann. lecun. com/exdb/mnist/, 1998.
-  H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPS workshop on deep learning and unsupervised feature learning, 2011.
-  C. Wu, L. Herranz, X. Liu, Y. Wang, J. van de Weijer, and B. Raducanu, “Memory replay gans: learning to generate images from new categories without forgetting,” in Advances in Neural Information Processing Systems, 2018, pp. 5962–5972.
-  J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, p. 201611835, 2017.
S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, and B.-T. Zhang, “Overcoming catastrophic forgetting by incremental moment matching,” inAdvances in Neural Information Processing Systems, 2017, pp. 4652–4662.
-  X. Liu, M. Masana, L. Herranz, J. van de Weijer, A. M. López, and A. D. Bagdanov, “Rotate your networks: Better weight consolidation and less catastrophic forgetting,” 24th International Conference on Pattern Recognition (ICPR), pp. 2262–2268, 2018.
-  Z. Li and D. Hoiem, “Learning without forgetting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in NIPS workshop on deep learning and unsupervised feature learning, Montreal, Canada, 2014.
-  H. Jung, J. Ju, M. Jung, and J. Kim, “Less-forgetting learning in deep neural networks,” arXiv preprint arXiv:1607.00122, 2016.
-  S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” in Proc. CVPR, 2017.
-  J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. Heck, H. Zhang, and C.-C. J. Kuo, “Class-incremental learning via deep model consolidation,” arXiv preprint arXiv:1903.07864, 2019.
-  A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016.
-  J. Lee, J. Yun, S. Hwang, and E. Yang, “Lifelong learning with dynamically expandable networks,” arXiv preprint arXiv:1708.01547, 2017.
-  R. Coop, A. Mishtal, and I. Arel, “Ensemble learning in fixed expansion layer networks for mitigating catastrophic forgetting,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 10, pp. 1623–1634, Oct 2013.
-  D. Lopez-Paz et al., “Gradient episodic memory for continual learning,” in Advances in Neural Information Processing Systems, 2017, pp. 6467–6476.
-  G. M. van de Ven and A. S. Tolias, “Generative replay with feedback connections as a general strategy for continual learning,” arXiv preprint arXiv:1809.10635, 2018.
-  S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016.
-  M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
-  X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017, pp. 2813–2821.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W, 2017.
-  H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” arXiv preprint arXiv:1805.08318, 2018.
-  T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  A. Robins, “Catastrophic forgetting, rehearsal and pseudorehearsal,” Connection Science, vol. 7, no. 2, pp. 123–146, 1995.
-  K. Javed and F. Shafait, “Revisiting distillation and incremental classifier learning,” arXiv preprint arXiv:1807.02802, 2018.
-  A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018.
M. Lučić, M. Tschannen, M. Ritter, X. Zhai, O. Bachem, and S. Gelly,
“High-fidelity image generation with fewer labels,” in
International Conference on Machine Learning, 2019, pp. 4183–4192.