Overcoming Long-term Catastrophic Forgetting through Adversarial Neural Pruning and Synaptic Consolidation

by   Jian Peng Bo Tang, et al.
Central South University

Enabling a neural network to sequentially learn multiple tasks is of great significance for expanding the applicability of neural networks in realistic human application scenarios. However, as the task sequence increases, the model quickly forgets previously learned skills; we refer to this loss of memory of long sequences as long-term catastrophic forgetting. There are two main reasons for the long-term forgetting: first, as the tasks increase, the intersection of the low-error parameter subspace satisfying these tasks will become smaller and smaller or even non-existent; The second is the cumulative error in the process of protecting the knowledge of previous tasks. This paper, we propose a confrontation mechanism in which neural pruning and synaptic consolidation are used to overcome long-term catastrophic forgetting. This mechanism distills task-related knowledge into a small number of parameters, and retains the old knowledge by consolidating a small number of parameters, while sparing most parameters to learn the follow-up tasks, which not only avoids forgetting but also can learn a large number of tasks. Specifically, the neural pruning iteratively relaxes the parameter conditions of the current task to expand the common parameter subspace of tasks; The modified synaptic consolidation strategy is comprised of two components, a novel network structure information considered measurement is proposed to calculate the parameter importance, and a element-wise parameter updating strategy that is designed to prevent significant parameters being overridden in subsequent learning. We verified the method on image classification, and the results showed that our proposed ANPSC approach outperforms the state-of-the-art methods. The hyperparametric sensitivity test further demonstrates the robustness of our proposed approach.



page 1

page 7

page 10


Overcoming Catastrophic Forgetting by Soft Parameter Pruning

Catastrophic forgetting is a challenge issue in continual learning when ...

Attention-Based Structural-Plasticity

Catastrophic forgetting/interference is a critical problem for lifelong ...

Overcoming Catastrophic Forgetting in Convolutional Neural Networks by Selective Network Augmentation

Lifelong learning aims to develop machine learning systems that can lear...

Piggyback GAN: Efficient Lifelong Learning for Image Conditioned Generation

Humans accumulate knowledge in a lifelong fashion. Modern deep neural ne...

Learning with Long-term Remembering: Following the Lead of Mixed Stochastic Gradient

Current deep neural networks can achieve remarkable performance on a sin...

Memory Aware Synapses: Learning what (not) to forget

Humans can learn in a continuous manner. Old rarely utilized knowledge c...

Analyzing Knowledge Transfer in Deep Q-Networks for Autonomously Handling Multiple Intersections

We analyze how the knowledge to autonomously handle one type of intersec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Humans can learn consecutive tasks and memorize acquired skills, such as running, biking and reading, throughout their lifetimes. This ability, namely, continual learning, is crucial to the development of artificial general intelligence[pratama2013panfis]. Existing models lack this ability mainly due to catastrophic forgetting, which means that networks forget knowledge that has been learned from previous tasks when learning new tasks [McCloskey1989Catastrophic]. To mitigate the catastrophic forgetting, a straightforward approach is to retrain the model by mixing previous data with new data; however, this manner is inefficient for networks with low storage and with a high model update frequency[Li2017Learning]. Rusu et al. [Rusu2016Progressive], Fernando et al. [Fernando2017Pathnet:] and Coop et al. [coop2013ensemble] attempted to reserve task-specific structures for single tasks, such as layers or modules. In works [Lopez-Paz2017Gradient][Rebuffi2013icarl:][robins1993catastrophic][diaz2014incremental] based on the rehearsal strategy, previous memories are reinforced by replaying experiences. All of these methods require additional incremental network capacity for retaining previous tasks.

An ideal learning system could sequentially learn tasks without increasing the memory space or the computational cost[bargi2017adon]. The regularization-based methods satisfy these requirements. For instance, elastic parameter updating, [Kirkpatrick2016Overcoming][ritter2018online]

finds the joint distribution of tasks by protecting parameters with higher importance. However, this approach suffers from insufficient memory when learning long sequences of tasks. This is because this approach has difficulty finding a common parameter subspace that satisfies the requirements of all tasks, which leads to it to become entangled when utilizing more capacity to memorize previous tasks or to learn a current task.

One of the major challenges in long-term learning is that the size of the shared parameter subspace of previous tasks decreases as the number of tasks increases. Existing weight consolidation approaches that search for a common solution face two problems. First, the L2 distance is adopted as the overall measurement index. Hence, the update of each parameter cannot be precisely controlled, thereby leading to the failure to protect important parameters. Second, the topological properties of the network are an important factor in knowledge representation[courbariaux2016binarized]. Several works[Zenke2017Continual][Aljundi2017Memory] have attempted to calculate the importances of parameters, e.g., the sensitivity of parameters to tiny perturbations; however, none of them have considered the topological relationship between the network structure and parameters.

In this paper, we proposed a novel method, namely, the Adversarial Neural Pruning and Synaptic Consolidation(ANPSC), to overcome the long-term catastrophic forgetting. We believe that the cause of this problem is the decrease in the size of the shared parameter subspace and the accumulated error increase as new tasks arrive. We solve the former via an online neural pruning strategy. This approach distills the current task into a few parameters, which indirectly expands the common solution space, and it frees up capacity for subsequent tasks to access the memory. To tackle the latter, we design a momentum-based weight consolidation policy to protect critical parameters element by element. In addition, we claim that the information on the network topology is significant, and we propose using the connectivity of network to measure the importance of parameters. The main contributions of this paper are as follows:

  1. []

  2. We analyze the reason of long-term catastrophic forgetting in neural networks and propose a mechanism of adversarial neural pruning and synaptic consolidation to tackle long-term catastrophic forgetting.

  3. To precisely protect significant parameters from being destroyed, we design a weight update policy through revising the gradient step with momentum.

  4. To consider the structural information of networks, we propose a novel measurement of parameter-importance. This measure utilizes parameter-connectivity to abstract the topological characteristics of a network in parameter space in a label-free manner. The experimental results demonstrate that this measure is accurate, centralized and polarized.

  5. We investigated a series of regularization methods of overcoming forgetting. Experiment results show that our method is superior to other mainstream methods and has strong robustness and generalization ability.

Ii Related Works

This paper we focus on non-additional network structure added methods. Such methods include model pruning, knowledge distillation and regularization strategies.

Ii-a Model Prune And Knowledge Distillation

Parameter pruning methods [LeCun2015Optimal][Hassibi2014Second] [smith2018neural] are based on the hypothesis that some parameters have little effect on the model loss after being erased. Thus, the key strategy is to search for the optimum parameters that have minimal influence on the loss. An effective approach for narrowing the representational overlap between tasks is to reduce parameter sharing among tasks with limited network capacity. Another approach is knowledge distillation, which packs the knowledge of complex network into a lightweight network using the teacher-student model , which is also used to tackle the problem of catastrophic forgetting[Li2017Learning].

PackNet [Mallya2017Packnet:] sequentially compresses multiple tasks into a single model by pruning redundant parameters. The dual memory network[kamra2017deep] partially drew on this idea to overcome catastrophic forgetting by an external network. Inspired by model compression, our method utilizes parameter connectivity to establish a soft mask rather than hard pruning based on a binary mask[courbariaux2016binarized]; it does not completely truncate the unimportant parameters, but adaptively adjusts them according to subsequent tasks, shares parameters among multiple tasks, conserves the model capacity compared with hard pruning, and has lower performance penalties.

Ii-B Regularization Strategies

Various methods reduce representational overlap among tasks to overcome catastrophic forgetting via regularization, such as such as weights freezing and weight consolidation. Weight freezing, which was inspired by the distributed encoding of human brain neurons, tries to avoid overlaps between crucial functional modules of tasks. For instance, Path-Net

[Fernando2017Pathnet:] establishes a large neural network and fixes a module of the network to avoid interference from later tasks. The progressive neural network (PNN) [Rusu2016Progressive] and [sun2018concept] allocates separate networks for each task and performs multitasks by a progressive expansion strategy. Methods of this type fix important parameters of a task to prevent the network from forgetting. However, these methods lack flexibility when facing a long sequence of tasks, and the memory footprint and computational complexity will linearly increase with the number of tasks.

Weight consolidation tries to identify parameters that are important for previous tasks and punish them when training new tasks. The classic method [Aljundi2017Memory, Zenke2017Continual] is elastic weight consolidation (EWC) [11], which is inspired by the mechanism of synaptic plasticity. EWC updates parameters elastically according to parameter importance. EWC measures the parameter importance by approximating the Fisher information matrix. Methods of this type encode more tasks with lower network capacity and lower computational complexity compared with Path-Net and PNN. The measurement of parameter importance is crucial. Most methods calculate parameter importance based on the parameter sensitivity; however, none of these methods consider network topological properties.

Iii Methods

Iii-a Problem Definition

Given a sequence of tasks that are defined by datasets , and a neural networks defined by parameters . The objective of continual learning is to sequentially learn all tasks . To overcome catastrophic forgetting, a classic approach is to find a distribution that fits all task data from the previous parameter space of tasks (Figure 1.a), namely,


This goal is realized by consolidating important parameters of previous tasks. The cause of long-term catastrophic forgetting is that it is intractable to search for a solution that satisfies all tasks from the intersection of the parameter subspaces of tasks. The fundamental problem is that the shared parameter subspace is either small or does not exist, and the cumulative error of the weight consolidation strategy causes the solution to deviate from the parameter subspace with low error.

Fig. 1: ANPSC for overcoming catastrophic forgetting. a, The classic process of the consolidation regularizer. For example, EWC, which is denoted by red dashed line. EWC obtains the common solutions of tasks by finding the solution of the current task from the parameter space of previous tasks. The intersection of the parameter subspaces (denoted by triangles and pentagons) decreases as the number of tasks increases. In addition, the end point may drift away from the true common solution space of the tasks because of the error of the parameter constraints. b, The black dashed arrow represents the neural pruning, which yields an approximate solution of current tasks using a few parameters, which expands the parameter space of the current learning task. The green solid arrow denotes the revised error. This error revises the consolidation strategy, which is more likely to arrive to the true common solution space of tasks. c,The feasible region of the original model, which is defined by three parameters. d,The feasible region of the model, which is defined by two parameters with neural pruning. This region relaxes some nonsignificant parameters, which expands the scope of the feasible region implicitly.

Iii-B Adversarial Solution

To alleviate long-term catastrophic forgetting, two key strategies are employed: one strategy is to expand the overlap of the parameter subspaces of tasks(Figure 1.a, in which the region is denoted with triangles and pentagons), and the other strategy is to more precisely protect the parameters(Figure 1.b). We believe that by approximating the solution of the current task through a subset of parameters, while keeping the approximation error low, the parameter constraint field can be effectively relaxed and the parameter subspace of the current task can be expanded(Figure 1.d) compared with the original model(Figure 1.c):


which is the subset of parameters , and is the approximate solution.

To decrease the accumulated error of parameter consolidation, we modify it in two ways: first, a novel weight-wise consolidation approach, namely, momentum updating, is designed. This approach revises the direction of optimization according to the importance of parameters while learning new tasks (Figure 2). Second, as the previous measurements of importance did not consider the structural information that is hidden in the parameters, a novel parameter measure is proposed. This novel approach measures the importance of a parameter according to the state of the connection between two neurons.

Neural pruning abandons as many parameters that are not important to the new task as possible to enlarge the parameter subspace of task sharing, while synaptic consolidation requires that the parameters of the old task be protected as much as possible from being destroyed. This adversarial mechanism enables the model to compress task knowledge into a small number of parameters with high representation, while protecting a small number of parameters to balance the performance of old and new tasks.

Iii-C Neural Pruning

Most techniques for model pruning are conducted offline[cheng2017quantized]; hence, the prune operation must be implemented after training, which is inflexible and time-consuming. In addition, this approach requires the reuse of previous data. In this paper, we selectively prune parameters with less salience to the output of the model during training, and we implement it in an iterative training-pruning way. This approach implicitly distills the previous training phase into fewer parameters during the pruning phase. Thus, it can be also considered as an online pruning approach. The objective of pruning is defined as:


When learning a task, we train the parameters on the given training dataset and calculate the salience of parameters . The salience measures the influence of a parameter on the performance of the model: a higher value corresponds to a larger decrease in performance if it is pruned. Then, we generate the mask of parameters that correspond to the saliency to prevent insignificant parameters from being updated according to the threshold . These parameters are not actually pruned but are reserved for later tasks.

We utilize optimal brain surgery[LeCun2015Optimal] to measure the salience of a parameter. This approach prunes the parameters that contribute little to the loss. Given a well-trained model, we try to train parameters W on input X to reduce the error , and the model learned can be expressed as . If we set to zero, the change in the error that corresponds to can be expressed as . The larger the value of is, the more important is. The formula of the Taylor expansion is:


is the Hessian matrix of the parameters, and represents the gradient on W. The gradient will be close to zero when the model converges, and the first term on the right-hand side will be too small to calculate a precise value of the error change in response to the parameter perturbation. The second-order approximate solution is used instead. Therefore, an accurate value can be obtained regardless of whether the model is convergent or not to ensure that online pruning is effective throughout the training stage.

The calculation of the Hessian matrix is complex and computation-intensive[xu2015optimization]. In this paper, we introduce the diagonal Fisher information matrix [Pascanu2013Revisiting] to approximate Hessian matrix. The main advantage is that its computational complexity is linear to the number of dimensions. And it can be quickly solved through gradient. However, the diagonalization may lead to a loss of precision. We think we can obtain better results if we adopt a better Hessian approximation method, which usually has a higher computational burden.

Fig. 2: Weight optimization process with momentum.When learning a task, we save the state of the weights as a memory checkpoint of previous tasks. The step that is denoted by a blue solid array is composed of a gradient step (sgd) and a momentum step, which is denoted by a green solid arrow. The momentum step is related to the gradient decay, of which the direction is opposite to that of the actual step and to the memory step, of which the direction is toward the memory checkpoint.

Iii-D Modified Synaptic Consolidation

Iii-D1 Momentum based parameter updating.

To ensure that the end point of optimization is not far from the previous task when learning a new task, we designed a momentum-based updating policy for revising the gradient direction

which is calculated via stochastic gradient descent. This policy is implemented as follows:


As illustrated in Figure 2, when the optimization point moves toward a new task which is analogous to a ball rolling up a hill, there are three forces that are related its movement. The gradient step of the ball is driven by the force of the target function, which is calculated via classical stochastic gradient descent(sgd). The memory step is driven by the force that keeps the ball from leaving the previous memory checkpoint, which ensures the stability of the learning system. The gradient decay is the resistance, whose direction is opposite that of the actual step; it is the momentum of one parameter. Thus, we define the memory momentum as follows:


where is a hyper-parameter, of which a large value corresponds to strong momentum, and is the importance of the parameter . We can use the frictional coefficient of the ball as an analogy of it. One parameter with great importance should be prevented from further changing.

Iii-D2 Measuring the parameter importance through the connectivity of neurons.

We design a novel method for calculating the parameter importance by measuring the magnitude of the change in the target function when changing the connectivity state of two neurons. This method considers the structural knowledge of a model that is related to one task. Similar to the salience of a parameter, we utilize this method to measure the influence of the connectivity of two neurons on the model. Most of these measurements must utilize the labels, which limits the scope of application. To eliminate the need for the labels, we use the information entropy to approximate the error because the distribution p and the predicted distribution q are proximate on a well-trained model. Thus, the parameter connectivity is in terms of:



. The strategy is to measure the steady state of a learning system utilizing information entropy. We explain this strategy as follows: the output distribution of the model will gradually evolve from a random state into a stable state, with decreasing entropy. When the model converges, the system will perform stably on the training data, with low entropy and a known output distribution. Therefore, entropy change is an effective substitute for the loss function change for measuring the steady state of a learning system.

Given a set of t+1 tasks, we calculate the importance of the parameter after learning the task. Where i and j represent the connections between the neuron and the neuron, respectively, in neural networks.


According to Eq.(5), the direction of the gradient decay is always opposite that of the actual step. Thus, we set the negative value to zero. After learning task t+1, we sum the importances of previous tasks to obtain the accumulated values:


We present our algorithm in Algorithm 1.

0:    : old task parametersW: new task parameters: training data and ground truth on the new task : total number of tasks : information entropy of the output H: Hessian matrix : threshold of salience of parameters for pruning
1:  for each  do
2:      assign W //Update the old task parameters
3:      //Calculate the importance of the parameters of the T-1 tasks
4:        //Cumulative importance computation
5:     Define:   //new task output
6:        //Update the new task parameters
8:      //Update the gradients
9:        //online prune parameters
10:  end for
Algorithm 1 Pseudo for overcoming catastrophic forgetting by ANPSC

Iv Experiments and Analysis

Iv-a Experimental Setting

We tested the proposed method on four tasks: an image classification with a convolutional neural network (CNN) and a multi-layer perception, long-sequence of incremental classification tasks, a generative task with the variational autoencoder (VAE) model and a generative adversarial network (GAN).

Data In image classification task, the permuted MNIST[Srivastava2014Compete] or split MNIST [Lee2015Overcoming] is applied to MLP. The Cifar10 [Krizhevsky2009Learning], the NOT-MNIST [Bulatov2011Notmnist], the SVHN [Netzer1989Reading], and the STL-10 [Coates2015An] datasets, which are all sets of RGB images of size of 32*32 pixels, are chosen. For long-term incremental learning tasks, Cifar100 [Krizhevsky2009Learning] is used for medium scale network model, and Caltech101 [Fei-Fei2006One-shot] is used for large scale network models (shown in supplement). In the generative task, celebA [Liu2018Large-scale] and anime face, which were crawled from the web, are selected as test data. Both databases share the same resolution. In the generative adversarial network, we choose three categories of SVHN[Netzer1989Reading] as a sequence of tasks.

Baseline We compared our method with state of the art methods, including LWF [Li2017Learning], EWC [Kirkpatrick2016Overcoming], SI [Zenke2017Continual] and MAS [Aljundi2017Memory], and classic methods, including standard SGD with a single output layer(single-headed SGD), SGD with multiple output layers, SGD with frozen intermediate layers (SGD- F), and SGD with fine-tuned intermediate layers (finetuning). We defined a multitask joint training with SGD (Joint) [yuan2012visual] as the baseline for evaluating the difficulty of a sequential task.

Evaluation – We utilize the average accuracy(ACC), forward transfer (FWT), and backward transfer (BWT) [Lopez-Paz2017Gradient]

to estimate the model performance: (1) the ACC, for evaluating the average performance of processing tasks; (2) the FWT, for describing the suppression of former tasks on later tasks; and(3)the BWT, for describing the forgetting of previous tasks. Evaluating the difficulty of an individual task by testing the model using multitask joint training

[yuan2012visual] is more objective than testing the model of a single task. Therefore we propose a modified version. Given tasks, we evaluate the previous t tasks after training on the task. Denoting the result of task i being tested on the task model as , and the accuracy on the task through joint learning as , We use three indicators:


A higher value of ACC corresponds to superior overall performance, and higher values of BWT and FWT correspond to better trade-off between memorizing previous tasks and learning new ones.

Training – All models share the same network structure with a dropout layer[Goodfellow2013An]

, and we initialized all parameters on MLP with random Gaussian distributions that have the same mean and variance(

),and we applied Xavier on CNN. We optimized models by SGD with an initial learning rate of 0.1, 0.01, or 0.001, with a decay ratio of 0.96, and with uniform batch size. We trained models with fixed epoch and global hyperparameters for all tasks. We identified the optimal hyperparameters by greedy search. Beta is uniformly set as 5%.

Method FWT(%) BWT(%) ACC(%)
SGD -0.31 -34.01 61.53
SGD-F -18.6 -12.9 84.82
Fine-tuning -0.29 -13.9 82.04
EWC [Kirkpatrick2016Overcoming] -4.99 -6.43 88.75
SI [Zenke2017Continual] -6.19 -3.51 90.67
MAS [Aljundi2017Memory] -4.38 -2.08 94.09
LWF [Li2017Learning] -4.42 -2.04 94.08
Joint [yuan2012visual] / / 99.87
Ours -0.44 -0.75 98.31

Iv-B Experimental Results and Analysis

Iv-B1 Sequential learning on split MNIST and permuted MNIST by MLP

We divided the data into 5 subdatasets and trained an MLP with 784-512-256-10 units. In Table I, we present the experimental results on split MNIST. Not all continual learning strategies perform well on all indices. Fine-tuning and SGD perform best on FWT because no free capacity is required for the subsequent tasks, and some features may be reused to improve the learning of the new tasks if the tasks are similar. LWF, MAS and SI perform well on BWT and ACC, and our method achieves the best performance on both indices, except compared to the joint learning method. We conclude that the model learns the general features from multiple datasets; hence, the model implicitly benefits from data augmentation. Our results in terms of ACC and FWT rival the best on single indices. In addition, our model has the least severe forgetting problem on BWT, and it has only a reduction of 1.5% in ACC after learning 10 tasks. Overall, our method outperforms the other eight approaches.

Method FWT(%) BWT(%) ACC(%)
SGD 1.11 -18.05 70.45
SGD-F -14.90 0.10 81.99
Fine-tuning 0.75 -6.21 80.69
EWC [Kirkpatrick2016Overcoming] -0.98 -2.57 91.97
SI [Zenke2017Continual] -0.56 -4.40 90.21
MAS [Aljundi2017Memory] -1.23 -1.61 92.6
LWF [Li2017Learning] 0.67 -24.02 74.15
Joint [yuan2012visual] / / 95.05
Ours 2.33 -3.22 94.51

We evaluate our method on 10 permuted MNIST tasks. In Table II, we present the results of our approaches and those of others. As expected, our method performs best on FWT; it outperforms SGD, which we attribute to the possibility of some features of the lower layer being stared by new tasks and to sufficient capacity for relieving the pressure on the capacity demand in new tasks. SGD-F obtains the highest score on BWT because SGD-F has fixed parameters, which help protect the parameters of previous tasks from being overwritten, but at the cost of reduced ability to learn new tasks flexibly. LWF performs worse on permuted MNIST than on split MNIST despite a satisfactory score, which may be attributed to the change in the dataset , as discussed above for FWT. Our method performs comparably in terms of ACC.

Iv-B2 CNN & image recognition

We test out method on natural image datasets that is based on VGG [Simonyan2014Very]

with 9 layers and a batch normalization layer to prevent gradient explosion. We train and test on MNIST, notMNIST, SVHN, STL-10 and Cifar10 sequentially, which have been processed into the same numbers of training images and categories (50,000 and 10, respectively). Overall, our method achieved the best performance in terms of FWT, BWT and ACC. According to Figure 3, our method realizes FWT that is almost one-third of those for LWF and MAS. Thus, our proposed method performs well in alleviating the dilemma of memory, and the test accuracy is close to the baseline. Our method also obtains the top result on BWT; hence, it ensures that the network retains the ability to handle previous tasks. On ACC, our method realized comparable performance to multitask joint training; hence, networks effectively trade-off capacity for tasks. The result of fine-tuning is better than that of SGD; thus, the use of an independent classifier for each task can prevent forgetting. We speculate that this is because the features of tasks at the high layer are highly entangled, and using individual classifiers can slightly alleviate this situation.

Fig. 3: Performances of various methods for overcoming catastrophic forgetting on a sequence of image datasets. The method that is based on regularization has an effect starting from EWC, although the effect is limited; MAS and LWF are close. Our method achieves the best performance on all the indicators.
Fig. 4: Performance of ANPSC under various hyperparameter values. The horizontal axis represents the value of the hyper-parameter. The vertical axis represents the results in terms of three indicators. The dotted black line indicates the baseline of accuracy.
Fig. 5: Overcoming catastrophic forgetting from the face dataset to the anime dataset using VAE. To guarantee the objectivity of the results, we utilize various data and various network structures. left: The test sample of human faces with a generator from the human face dataset; middle: the test sample of human faces with a generator after training from celebA to the anime face dataset, without using our approach; right: the test sample of human faces with a generator after training from celebA to the anime face dataset using our approach.

Iv-B3 Robust analysis

To test the stability of our method with respect to the hyperparameter, we test the method under various values of based on the above experiment. The results show that our method is robust to hyperparameter variation in a range of values. According to Figure 4, when is 0.01, the network is almost impervious to the resistance of previous tasks, which means that no capacity is assigned to previous memory. In this case, the values of all three indicators are extremely poor, and the proposed method and SGD are almost the same at this time. When reaches 0.1, the proposed method has realized relatively satisfactory performance and has substantially improved on all three indicators. If is in the range of 0.5 to 4, the performance is relatively stable. The proposed method achieves the best performance with . As continues to rise, the network memorizes too much, which results in lack of capacity to learn new tasks; hence, the performance on new tasks is lower than would be realized by training from scratch.

Fig. 6: Performance of C-GAN with multitask joint training, ANPSC and SGD. The images in the first row are the samples that were generated by C-GAN with joint training. The images in the second row are the samples that were generated by C-GAN with SGD. The third row presents the samples that were generated with ANPSC.

Iv-B4 Continual learning in VAE

To evaluate the generalization performance of our method, we apply it in variational automatic coding (VAE). We carry out tasks from human faces and anime faces and resize the samples of the two datasets to the same size of 96*96. We set up a VAE with a conv-conv-fc encoder layer and a fc-deconv-deconv layer on both sides. Then, we use a separate latent variable to train a single task, which is essential because of the significant difference between the distributions of the two datasets.

We trained models by three approaches: (1) training on the Celeba dataset from scratch; (2) training on the Celeba dataset and, subsequently, training on the anime face dataset with SGD; and (3) training on the Celeba dataset and subsequently training on the anime face dataset with ANPSC. In Figure 5, we present samples of human faces that were produced by the three models. The results demonstrate that our approach can well preserve the skill of human face generation while learning anime faces. The model with ANPSC performs as well as the model that was trained on the Celeba, whereas the model with SGD loses its ability. This finding proves that ANPSC has strong generalization performance in MLP, CNN and VAE.

Iv-B5 Continual learning in GAN

We further apply the ANPSC to a generative adversarial network[goodfellow2014generative]. We implement the ANPSC by assuming that the model sequentially learns several datasets. Then, the model should be able to generate images that belongs to any specified dataset. To achieve this goal, we train C-GAN[mirza2014conditional] on SHVN[Netzer1989Reading], because the C-GAN is assigned a classifier and label, and these can be used to control the generation according to the order of the tasks. To evaluate the performance of ANPSC in terms of long-term memory, we sequentially train a model on digit 0 to digit 10 and test the model on the previous 5 tasks separately. The method of joint training is the ceiling of the result and the method of sgd is utilized for comparison. Figure 6 presents the results of C-GAN with ANPSC. The model still memorizes most knowledge of previous tasks, and it generates 5 digits well, which is similar to joint training. We conclude that ANPSC performs well on the generative adversarial network, and it is an effective approach for alleviating catastrophic forgetting.

Fig. 7: Distributions of the parameter connectivity. Left: The distribution of parameter importance on MNIST with MLP; middle: the distribution of parameter importance on MNIST with Vgg9; right: the distribution of parameter importance on CIFAR10 with Vgg9. The horizontal axis is the connectivity value; the vertical axis is the density; the blue solid line is the calculation result of the Fisher information matrix in EWC; the orange solid line is the result of the method of MAS; the green solid line is the result of our method. Left: the distribution of the parameter-connectivity measure on the MLP model that was trained with Permuted-MNIST; middle: the result measure on vgg9 that was trained with MNIST; and right: the result measure on vgg9 that was trained with CIFAR10.
Fig. 8: Parameter space similarity and change analysis on Pemuted-MNIST sequential tasks. Each red line corresponds to our method, each blue line corresponds to fine-tuning and each green line corresponds to standard SGD with a single head; (a): Overall average accuracy on 6 permuted MNIST sequential subtasks; (b): similarly of the parameter space; and (c): the parameter variance between parameters of tasks.
Fig. 9: Visualizations of the connectivity and variance of parameters. The horizontal axis represents the neurons of the output layer, the vertical axis represents the neurons of the input layer, and each element represents the connection between the neurons of the input and output layers. Left: the variance of parameters between two tasks. The colder the color is, the smaller the variance is; right: the connectivity of parameters of the first task. The warmer the color is, the more significant the parameter is.

V Discussion

V-a Analysis of parameter-connectivity

In Figure 7, we plot the distributions of parameter importance that were obtained by the three methods. The results shows that a concentrated and polarized distribution of importance contributes to overcoming catastrophic forgetting. The left figure shows that our distribution is sharp at low importance and high importance on the MLP and CNN model. In addition, the figure on the right shows similar results based on CNN. In the right figure, our method also shows polarization compared with the other methods. The distribution is concentrated and polarized, and this is suitable for various models and datasets. Hence, our method distills previous knowledge into fewer parameters and frees more parameters for learning new tasks.

V-B Parameter space similarity and changing analysis

We conducted six tasks with ANPSC on Permuted-MNIST and analyze the experiment results by comparison with single-headed SGD and multi-headed fine-tuning:

  1. []

  2. The evolution of the overall average accuracy is shown in Figure 8(a), which indicates that our method is more stable and achieves more accurate results as the number of tasks increases;

  3. To determine whether the model can efficiently preserve its memory of previous tasks, we utilize the Frechet distance [frechet1906quelques] to measure the similarity of the parameter distributions between the first tasks and the last tasks; see Figure 8(b). In The F value of our method is far greater than those of the other two methods; hence, our method can effectively control parameter updates according to importance. The F values are greater in deeper layers of the networks; thus, the forgetting occurs mainly in deeper layers, and strengthening the protection of parameters in deep layers may tremendously help in tackling catastrophic forgetting;

  4. In Figure 8(c), we utilized the weighted sum of the squares of differences between the first and the last task to measure the parameter change. The result that parameters in deeper layers change less shows that the consolidation of the shallow layer is more flexible. In addition, the fluctuation of parameters based on our methods is much larger compared to other methods. Thus, our method can preserve the former memories; however, it leads to higher network capacity for learning new tasks.

V-C Visualization analysis.

We visualize the negative absolute values of the parameter changes (left) and compare them with the connectivity of the parameters (right). Our results demonstrate that our method can prevent significant parameters from being updated and can fully utilize nonsignificant parameters to learn new tasks. In Figure 9, in the black dotted bordered rectangle of the 1st row, parameters with warm color change little. In contrast, parameters in the second column of the picture on the right change substantially because they are unimportant to previous memorization. Thus, our method can precisely capture the significant parameters and prevent them from being updated to prevent forgetting.

Vi Conclusions and future works

Long-term catastrophic forgetting limits the application of neural networks in practice. In this paper, we analyze the causes of long-term catastrophic forgetting in neural networks: the shrinking of the shared parameter subspace of tasks and the accumulated error of weight consolidation as tasks arrive. We proposed the adversarial neural pruning and synaptic consolidation approach to overcome long-term catastrophic forgetting. This approach balances the short-term and long-term profits of a learning model by online weight pruning and revised weight consolidation. The calculation of parameter saliency is similar to optimal brain surgery[Hassibi2014Second], however, our method frees parameters from updating and spares them for later tasks instead of throwing them away. In addition, we assume that the structural knowledge of the model is significant and measure it with neuron connectivity, which provides a new perspective from which to represent network knowledge. The experimental results demonstrate several advantages of our method:

  1. []

  2. Efficiency: our approach performs comparably on a variety of datasets and tasks;

  3. Robustness: our approach has low sensitivity to hyperparame-ters;

  4. Universality: our approach can be extended to generative models.

The evidence suggests that finding an approximate solution of a sequence of tasks is effective in alleviating the dilemma of memory. Online neural pruning is not the only approach for achieving this solution; other methods such as knowledge distillation are also feasible. We conclude that the concentration and polarization proper- ties of the parameter distribution are significant for overcoming long-term catastrophic forgetting. Protecting some parameters through measurement based on a single strategy is not entirely effective. We suggest that well-structured constraints for controlling parameter behavior or well-designed patterns of distributions of parameters may be crucial to the satisfactory performance of a model in overcoming forgetting. In addition, research on human brain memory is providing a potential approach for solving this problem [Hassabis2017Neuro]. The problem of overcoming catastrophic forgetting remains open.

Appendix A Incremental learning

A-a Large scale dataset from Caltech-101

Large-scale dataset from Caltech-101. To evaluate the performance of our method on a larger dataset, we randomly split the Caltech-101 dataset into 4 subsets with 30, 25, 25, and 22 classes and divided each part of the subsets into training and validation sets according to the ratio of 7:3. In the experiment, we resized the images to [224,224,3], normalized the pixels into [0,1] and randomly flipped the images from left and right to augment the data in preprocessing. We employed ResNet-18 as the basic network. Because the categories of four datasets are not consistent, we added a new separate classifier and a fully connected layer before the classifier for each task. Each new fc layer has 2048 neural units, and the dropout rate is set to 0.5. The iteration size and batch size of every task are 100 epochs and 128, respectively. The initial learning rate is set to 0.001, and decay is used every 100 epochs to 90% of the original. To prevent overfitting, we randomly select the hyperparameter in the range from 0.5 to 30. Due to the inconsistent numbers of categories in the four subsets, we do not compare our method with SI.

A well-functioning model is expected to be stable under abrupt changes of tasks. To evaluate the stability of the model on unseen tasks, we designed an indicator, namely, SMT, as follows:


Where is the variance of a single task for sequential learning, which reflects the performance fluctuations of the task.

A-B Long sequence for CIFAR100

As shown in Figure 11, all current methods do not perform well on large-scale datasets as the number of tasks increases. In the fourth task, the ACC of our method is less than that of SGD-F. However, it outperforms EWC, MAS and LWF. In terms of SMT, when the model learns the second and third tasks, our method is outperformed by SGD and MAS. On the fourth task, our method performs better than all the remaining methods in terms of BWT and SMT; hence, our method can preserve the memory of tasks with longer sequences and has higher stability. In FWT, our method outperforms all the methods except MAS. Overall, our method outperforms the state-of-the-art methods that are based on regularization.

Fig. 10: Performance of C-GAN with ANPSC and training a single task with SGD. The images in the first row are the samples that were generated by C-GAN with ANPSC in various epochs. The images in the second row are the samples that were generated by C-GAN with SGD for a single task in various epochs.

The results in Figure 12 idemonstrate that it remains difficult to construct models that are capable of long-term memory, especially in complex tasks. Our method yielded similar results in terms of overall performance among regularization methods; however, SGD- F and fine-tuning outperformed it when the number of learning tasks was large, and LWF almost lost its learning ability. On BWT, our method and MAS achieved a better results. SGD-F performed best on preventing forgetting because the weights were completely fixed. LWF shows higher BWT; however, the data are useless due to the loss of learning ability. On FWT, our method realized the best results; thus, our method has little impact on the learning of new tasks while preserving previous knowledge.

Fig. 11: Performance on a subset of Caltech101. The x-axis denotes the tasks that were trained on resnet-18, and the y-axis denotes the indicators of ACC, SMT, FWT and BWT. We present the negative values of FWT and BWT in the figures.
Fig. 12: Performance in incremental learning on CIFAR100. The x-axis denotes the tasks that were trained on vgg with 9 layers. Each task contains 5 categories. The y-axis denotes the indicators ACC, FWT and BWT.

Appendix B Sequentially generate new categories

We apply the ANPSC to generate new categories sequentially instead of learning them with old categories. To achieve this goal, we train C-GAN[mirza2014conditional] on SHVN[Netzer1989Reading]. We sequentially train the model on digit 0, digit 1 and digit 2 and separately test the model on 3 tasks. Figure 10 presents the results of C-GAN with ANPSC. The results prove that it model 3 digits well, which is similar to a single task.

Appendix C Model compression

We compress the LeNet[lecun1998gradient] that was trained on MNIST. The maximum number of epochs is set to 50, the batch size is set to 100 and the learning rate is set to 0.01. We calculate the importances of parameters after training and prune the insignificant parameters according to the importance threshold. We sequentially conduct this procedure 5 times with various thresholds, and we set the best value as [0.8, 0.7, 0.5, 0.4, 0.1]. In TableIII, the experimental results show that the model that is compressed with ANPSC balances a high compression ratio and low accuracy loss.

Prune iters Original model iter 1 iter 2 iter 3 iter 4 iter 5
param W1 800 392 271 184 134 129
param b1 32 7 3 2 2 2
param W2 51200 27923 18582 14182 11593 10220
param b2 64 13 4 2 2 2
param W3 1605632 310500 87135 38756 21978 19565
param b3 512 100 30 15 9 9
param W4 131072 24075 6895 2674 1565 1400
param b4 256 51 16 8 5 5
total params 1789568 363061 112936 55823 35288 31332
compressed times / 4.93x 15.85x 32.06x 50.71x 57.12x
prune ratio / 79.71% 93.69% 96.88% 98.03% 98.25%
test acc 98.94% 98.87% 98.87% 98.83% 98.67% 98.55%