Towards Better Plasticity-Stability Trade-off in Incremental Learning: A simple Linear Connector

Plasticity-stability dilemma is a main problem for incremental learning, with plasticity referring to the ability to learn new knowledge, and stability retaining the knowledge of previous tasks. Due to the lack of training samples from previous tasks, it is hard to balance the plasticity and stability. For example, the recent null-space projection methods (e.g., Adam-NSCL) have shown promising performance on preserving previous knowledge, while such strong projection also causes the performance degradation of the current task. To achieve better plasticity-stability trade-off, in this paper, we show that a simple averaging of two independently optimized optima of networks, null-space projection for past tasks and simple SGD for the current task, can attain a meaningful balance between preserving already learned knowledge and granting sufficient flexibility for learning a new task. This simple linear connector also provides us a new perspective and technology to control the trade-off between plasticity and stability. We evaluate the proposed method on several benchmark datasets. The results indicate our simple method can achieve notable improvement, and perform well on both the past and current tasks. In short, our method is an extremely simple approach and achieves a better balance model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/08/2019

EILearn: Learning Incrementally Using Previous Knowledge Obtained From an Ensemble of Classifiers

We propose an algorithm for incremental learning of classifiers. The pro...
03/31/2021

DER: Dynamically Expandable Representation for Class Incremental Learning

We address the problem of class incremental learning, which is a core st...
10/11/2021

Addressing the Stability-Plasticity Dilemma via Knowledge-Aware Continual Learning

Continual learning agents should incrementally learn a sequence of tasks...
03/12/2021

Training Networks in Null Space of Feature Covariance for Continual Learning

In the setting of continual learning, a network is trained on a sequence...
10/23/2021

Multi-Domain Incremental Learning for Semantic Segmentation

Recent efforts in multi-domain learning for semantic segmentation attemp...
12/29/2020

AILearn: An Adaptive Incremental Learning Model for Spoof Fingerprint Detection

Incremental learning enables the learner to accommodate new knowledge wi...
10/09/2021

Flattening Sharpness for Dynamic Gradient Projection Memory Benefits Continual Learning

The backpropagation networks are notably susceptible to catastrophic for...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

In recent years, deep neural networks have been reported promising performance on various tasks. In the dynamic world, the deep model also needs to be updated as new data becomes available. Hence, Incremental Learning (IL) 

Delange et al. (2021); Mundt et al. (2020) has received much attention, which studies the problem of continually learning from sequential tasks.

The main criterion of IL Douillard et al. (2020); Rebuffi et al. (2017) is that only a little or no data from the previous tasks is stored when continually refining the model as new data becomes available. It is a direct cause of catastrophic forgetting problem Li et al. (2019), and the plasticity-stability dilemma Chaudhry et al. (2018); Mermillod, Bugaiska, and Bonin (2013) is a more general problem: (1) plasticity: the deep model should learn the new knowledge of the current task, and (2) stability: it should also preserve the knowledge of previous tasks.

Many algorithms have been proposed to strike a balance between plasticity and stability. An intuitive solution is to store data from previous tasks. The replay methods Isele and Cosgun (2018) use such a strategy. For example, iCaRL Rebuffi et al. (2017) stores a small number of samples per class. ILCAN Xiang et al. (2019)

generates samples to preserve the old knowledge. The regularization-based methods use the extra regularization term in the loss function to consolidate previous knowledge, such as EWC 

Kirkpatrick et al. (2017) using the Fisher information to calculate each parameter’s importance. LwF Li and Hoiem (2018) uses distillation loss to preserve the old model’s output. The architectural methods Li et al. (2019) learn the dynamic architecture of the deep network, e.g., DER Yan, Xie, and He (2021) freezes the previously learned representation and dynamically expands the network for new task. The algorithm-based methods learn parameter updating rules to preserve the performance of previous tasks. For example, GEM Lopez-Paz and Ranzato (2017) constrains new task updates which do not interfere with the previous knowledge. Adam-NSCL Wang et al. (2021) updates network parameters in the null space of all previous tasks and achieves a promising performance on remembering previous knowledge. Although Adam-NSCL can preserve previous knowledge very well, the strong null-space projection also hurts the performance of the current task.

On the other side, many researches have focused on the connectivity in neural network loss landscapes Frankle et al. (2020); Fort and Jastrzebski (2019). A deep network always has many local minima, the previous works Choromanska et al. (2015) show that the local minima give very similar performance. Further, Garipov et al. (2018) found that two minima of independently trained deep networks can be connected in weight space, where the loss along the path remains low. Frankle et al. (2020) and Wortsman et al. (2021)

showed that there exists a linear path of high accuracy to connect two minima when the networks share only a few epochs of the initialized SGD trajectory.

Inspired by the the connectivity in deep neural network, we firstly view plasticity and stability as two independent optimization problems of deep neural networks. Hence, we create two copies of deep networks both initialized with the previously learned model, one for plasticity and another for stability. To preserve the previous knowledge, we use the existing null-space projection method, e.g., Adam-NSCL, to update the model. Then, we use the new coming data to train another deep network, which only considers learning the new task. One network performs well on previous tasks and another focuses on new task.

Then, we propose a simple linear connector to attain a better balance between these two networks. Central to our method is that we uncover a simple way to achieve better plasticity-stability trade-off. The interesting thing found by this paper is that a simple averaging of two networks achieves a better trade-off solution, which leads to higher accuracy neural network. Hence, we can use such simple linear connector of two networks to obtain a better plasticity-stability trade-off, which provides a new insight to understand, analyze and control this trade-off.

Preliminaries and Related Methods

Incremental Learning Methods

We review several categories of the existing deep incremental learning methods for plasticity-stability trade-off.

Regularization-based methods: This line of approaches introduces an extra regularization term to balance the trade-off. According to the regularization term was explicitly added to model’s parameters or outputs, these methods can be further divided into structural and functional regularization methods Mundt et al. (2020). Structural regularization methods constrain changes on model’s parameters. For example, EWC Kirkpatrick et al. (2017), SI Zenke, Poole, and Ganguli (2017), MAS Aljundi et al. (2018) and UCL Ahn et al. (2019) explicitly added a regularization term to networks’ parameters. The functional regularization methods, also known as the distillation-based methods, use the distillation loss between predictions from the previous model and the current model as the regularization term. The representative works include LwF Li and Hoiem (2018), EBLL Rannen et al. (2017), GD-WILD Lee et al. (2019), etc.

Rehearsal methods: This line of works preserves existing information by replaying data from previous tasks. Some algorithms store a subset of previous data, e.g., iCaRL Rebuffi et al. (2017) and GeppNet Gepperth and Karaoguz (2016). While the storage space is limited, it is important to find a suitable subset of data that can approximate the entire data distribution, e.g., SER Isele and Cosgun (2018) focuses on exemplar selection techniques. Another way to solve this limitation is using the generative modelling approaches Xiang et al. (2019) to generate a lot of samples of previous tasks. For example, DGR Shin et al. (2017) is a framework with a deep generative model and a task solving model.

Architectural methods: These methods modify the underlying architecture to alleviate catastrophic forgetting, e.g., HAT Serrà et al. (2018) proposes a task-based binary masks that preserve previous tasks’ information. UCB Ebrahimi et al. (2020) uses uncertainty to identify what to remember and what to change. The dynamic growth approaches Yan, Xie, and He (2021) are also proposed, e.g., DEN Yoon et al. (2017) dynamically expands network capacity when arrival of new task. Learn-to-Grow Li et al. (2019) proposes modifying the architecture via explicit neural structure learning.

Algorithm-based methods: These methods carefully design network parameter updating rule, which constrain new task updates that do not interfere the previous tasks. GEM Lopez-Paz and Ranzato (2017) and A-GEM Chaudhry et al. (2018) are two representative works. OWM Zeng et al. (2019) is orthogonal weight modification method to overcome catastrophic forgetting.

The most similar work is Adam-NSCL Wang et al. (2021), which uses the null space of all previous data to remember the existing knowledge. The main problem of Adam-NSCL is that such strong constraint would hurt the performance of the current task. Our method can be viewed as an extension of Adam-NSCL, which achieves a better balance model for both previous and current tasks.

Nullspace Projection

Nullspace Projection Consider a linear model and a dataset , the prediction is the dot product between the model and a data point : . If we can find an update direction such that for all data points, we have , where is any stepsize.

This suggests a simple method for remembering the knowledge of the learned model when we use the gradient descent, e.g., SGD, to update the model. Formally, to find the null-space of a data matrix , let

be a set of eigenvectors of

with the corresponding eigenvalues are zero, that is

. With that, we can obtain the projection matrix: . Hence, we can update the model that can remember the knowledge of (i.e., ) as

(1)

where and is a any gradient.

Adam-NSCL Adam-NSCL is also based on the null-space projection, and achieves an impressive performance on IL task. Here, we give a brief review of Adam-NSCL. In the continual learning task, we have a model trained on the previous data . Due to the privacy issues or storage constraints, is not available when training the new task. To overcome this problem, Adam-NSCL stores the uncentered feature covariance for guaranteeing stability. Then, it uses the SVD of feature covariance to find the null space of , denoted as . The projection matrix is obtained as .

Now when new data is available, the should be updated to learn the new task as

(2)

where and is the gradient only calculated on new data.

The Adam-NSCL can preserve very well the previous knowledge, while the the updates of new task are limited because of the strong null-space projection.

Linear Model Connectivity

Optimizing a neural network involves finding a minimum in a high-dimensional non-convex objective landscape, where some forms of stochastic gradient descent (SGD) are used as optimization methods for learning the parameters of deep network. Since the deep neural network are non-convex, there are many local minima. Given a deep network

with an initial weight , the weight is iteratively updated and the learnt weight at epoch is denoted as . Two copies of the deep networks are trained (e.g., using different data augmentations or projections), producing two optimized weights and .

Recently, a lot of work Wortsman et al. (2021); Choromanska et al. (2015); Dinh et al. (2017) has been developed to study the neural network optimization landscape. Many intriguing phenomena have been found. For example, one of the interesting observations Draxler et al. (2018); Garipov et al. (2018) is that there exists a connector between two optima. The loss minima are not isolated.

Observation 1 (Connectivity)

Draxler et al. (2018); Garipov et al. (2018) There exists a continuous path between minima of neural network architectures, where each point along this path has a low loss.

To find the continuous path, e.g., from to , Draxler Draxler et al. (2018) proposed a method based on Nudged Elastic Band (NEB) Jonsson, Mills, and Jacobsen (1998) to find the smooth and low-loss nonlinear path. Further, Frankle et al. (2020) showed that two minima can be connected by a low-loss linear path in some cases.

Observation 2 (Linear Connectivity)

Frankle et al. (2020); Wortsman et al. (2021) There exists a linear connector from to when is not randomly initialization but is trained to a certain spawn epoch.

This condition is easy to be satisfied. When the optimization trajectory of is shared, the two optima can be connected in a linear path. With the above observations, in this paper, we also consider to connect the two networks.

Method

In this section, we first give the problem formulation of incremental learning problem. Let sequential incremental learning tasks be denoted as , and each task includes a set of disjoint classes. In the -th task, we are only given the -th training dataset , where is the number of training samples, and the previous model , we need to update the previous model to a new model such that two inherent properties should be considered: 1) stability: the new model should retain the knowledge of previous tasks, and 2) plasticity: the ability to learn the new knowledge of the -th task.

To solve the plasticity-stability dilemma, we need a lot of training data from both the previous tasks and current task. However, there is no data from the previous tasks (only the previous model) by the definition of incremental learning, which is the origin of the dilemma. Specially, we only have the new data to restart the training of the neural network with the previous model as initialization. The iteration method, e.g., SGD, was applied to update the neural network. The larger the number of iteration is, the more previous knowledge is forgotten. When we finish the retraining, the new model almost forgets the old concepts, which is also known as catastrophic forgetting. Of course, we can use fewer number of iterations or knowledge distillation or other approaches to remember the old knowledge, while the ability of learning the new task is limited.

In this paper, we manage to design a better plasticity-stability trade-off. We divide the continual learning into two stages. In the first stage, we train two independent neural networks, which separately consider the plasticity and stability. One network is to preserve knowledge of previous task. Another network is to learn new knowledge of current task. In the second stage, we combine two neural networks into a better balanced network. We design a simple linear connector, which achieves notable improvement. It also provides us a new perspective to study and control the trade-off. In the following, we give the details of two stages.

Figure 1: Illustration of our method. First, we separately train two networks: and , where is for remembering the previous tasks, and is for learning the new -th task. Then, we use a simple averaging of two models to obtain a more balanced model .
Input: A set of sequential learning tasks , and their training datasets ; A neural network and learning rate
Train the first task to get # compute the null space Use the model and to obtain feature covariance and the null-space projection matrix for task  do
       # init the two networks Let , and while not converged do
             Sample a mini-batch from Compute the gradient # preserve previous knowledge # learn new knowledge
      # linear connector # compute the null space Use the model , and to obtain feature covariance and the null-space projection matrix
Output:
Algorithm 1 Linear connector for plasticity-stability trade-off

In the first stage, since preserving previous knowledge and learning new concepts are two different goals that are in conflict, we use two independent neural networks to study them separately, each network focuses on a single goal.

Remembering knowledge of previous tasks

Catastrophically forgetting is the major challenge since there are only small number of samples or no data from the previous tasks when learning the new task. In this paper, we use the state-of-the-art Adam-NSCL to remember the knowledge of previous tasks. Adam-NSCL uses the null space of feature covariance matrix of all previous data to update the weights, and achieves an excellent performance on previous tasks.

Specially, we use the previous model as the initialization of the deep network. The feature covariance of all tasks is and the projection matrix of all previous data is , where is the set of eigenvectors of and their eigenvalues are zero. (Please refer to Algorithm 2 in Wang et al. (2021) for more details of obtaining the feature covariance and projection matrix).

At iteration , we randomly sample a mini-batch from , and then calculate the gradient . The weight is updated as

(3)

where and is the stepsize. The final optimized model is denoted as .

It is a simple SGD with a projection to update the parameters of neural network, which constrains optimization to the chosen directions. This strong constraint can preserve the previous knowledge, while it also limits the learning ability on new tasks.

Learning new knowledge of current task

Here we update another deep network to learn the new task. If we do not consider preserving the already learnt knowledge, it is a very easy problem.

Specially, given the -th training dataset and the the previous model as the initialization, we can simply use SGD or Adam to learn the knowledge of current task. At iteration , the neural network is updated as

(4)

where . The optimized model is denoted as .

Plasticity-Stability Trade-off

Now we have two neural networks: and . The is the optimum that preserves the previous knowledge, and the is the optimum of the current task. As indicated by the previous works, the two optima are not isolated, there exists paths along which the loss remains low.

In this paper, we found that the two neural networks can be connected in a linear path along which the accuracy is still high. Formally, the linear connector between and is formulated as

(5)

for . To achieve balanced model, we set , that is we average the weights of all tasks and get the final network as

(6)

We found that simply averaging model can achieve notable improvement. Stochastic weight averaging Izmailov et al. (2018) also used the averaging model, and they showed that such averaging model can converge to the wider solution with better generalization. Thus we simply use the averaging model in this paper.

Linear interpolation:

According to Observation 2, to make the two networks linearly connected, we firstly use as the initialization to update two models. Then, the two networks use similar training manners to arrive the optima. We use two different projections, where one is the null-space projection and another is traditional SGD without projection.

With the linear connector, it provides us a simple method to control the balance between the forgetting and intransigence by changing the value of . If , our method becomes Adam-NSCL, which mainly focuses on remembering knowledge of previous tasks. When , it achieves the best performance on the new task while it almost forgets the previous knowledge. Figure 2 shows the performances of the linear combinations with different .

Experimental Results

In this section, we evaluate our model on various incremental learning tasks and compared it with several state-of-the-art baselines. Besides, we have conducted ablation study to see the performances of previous and current tasks with various in Eq. (5). And we also evaluate our model by using the evaluation measures of stability and plasticity.

Datasets

CIFAR-100 Krizhevsky, Hinton et al. (2009) is a dataset including 100 classes of images with size of and each class contains 500 images for training and 100 images for testing. TinyImageNet Wu, Zhang, and Xu (2017) contains 120,000 images of 200 classes. The images are downsized to and each class contains 500 training images, 50 validation images and 50 test images. In this paper, the validation set of TinyImageNet is used for testing since the labels of test set are unavailable.

We split the dataset into disjoint subsets of classes such that the training samples of each task are from a disjoint subset of classes, where is the total number of classes and is the total number of tasks. When , we get 10-split-CIFAR-100, and the labels for 10 tasks are , respectively. When and , we get 20-split-CIFAR-100 and 25-split-TinyImageNet respectively in the same way. In task , we only have access to and no previous data is stored.

Implementation Detail

To make a fair comparison, we follow the experimental settings of Adam-NSCL Wang et al. (2021)

. Specifically, we use ResNet-18 as our backbone network and each task has its own single-layer linear classifier. When training in new task, we only update backbone network and classifier of new task, while classifiers of previous tasks remain unchanged. We use Adam optimizer, and the initial learning rate is set to

for the first task and for both and in other tasks. The total number of epochs is 80 and the learning rate is reduced by half at epoch 30 and epoch 60. The batch size for 20-split-CIFAR-100 is set to 32 and 16 for another two datasets. For parameters that can not be updated by gradient descent method, e.g.,

of batch normalization layer, we also average them as

Eq. (6).

Evaluation Protocol

We use Average Accuracy (ACC) to measure how the model performs on all tasks. Here we denote the number of tasks as . After finishing training from task to task , the accuracy of model on test set of task is denoted as . ACC can be calculated as

(7)

where is the total number of tasks. The larger ACC is, the better the model performs. Since it’s the average accuracy of all tasks, we must take the balance between tasks into account.

We use Backward Transfer (BWT) Lopez-Paz and Ranzato (2017) to measure how much the model forgets in the continual-learning process. BWT is defined as

(8)

It indicates the average accuracy drop of all previous tasks. The larger BWT is, the less model forgets. In this paper, we aim to achieve a more balanced model. Hence, ACC and BWT should be considered together. Given ACC and BWT two measures, we should firstly see the ACC: the larger value of ACC is better. When the two methods have the same ACC values, we can use the BWT to observe how two methods perform on stability and plasticity: the smaller BWT means the method is good at learning new knowledge but it forgets more, larger BWT means it forgets less but learn less new task.

Results

In this set of experiments, we compare our method with several state-of-the-art baselines. We compare our method with EWC Kirkpatrick et al. (2017), MAS Aljundi et al. (2018) , MUC-MAS Liu et al. (2020), SI Zenke, Poole, and Ganguli (2017), LwF Li and Hoiem (2018), InstAParam Chen et al. (2020), GD-WILD Lee et al. (2019), GEM Lopez-Paz and Ranzato (2017), A-GEM Chaudhry et al. (2018), MEGA Guo et al. (2020), OWM Zeng et al. (2019) and Adam-NSCL Wang et al. (2021). All methods use ResNet-18 as backbone network for a fair comparison.

Methods ACC(%) BWT(%)
EWC 70.77 -2.83
MAS 66.93 -4.03
MUC-MAS 63.73 -3.38
SI 60.57 -5.17
LwF 70.70 -6.27
InstaAParam 47.84 -11.92
GD-WILD 71.27 -18.24
GEM 49.48 2.77
A-GEM 49.57 -1.13
MEGA 54.17 -2.19
OWM 68.89 -1.88
Adam-NSCL 73.77 -1.6
Ours 76.83 -4.27
Table 1: Results on 10-split-CIFAR-100. Please note that a larger value of ACC is better. For a balanced model, a moderate value of BWT is better.
Methods ACC(%) BWT(%)
EWC 71.66 -3.72
MAS 63.84 -6.29
MUC-MAS 67.22 -5.72
SI 59.76 -8.62
LwF 74.38 -9.11
InstaAParam 51.04 -4.92
GD-WILD 77.16 -14.85
GEM 68.89 -1.2
A-GEM 61.91 -6.88
MEGA 64.98 -5.13
OWM 68.47 -3.37
Adam-NSCL 75.95 -3.66
Ours 78.21 -8.01
Table 2: Results on 20-split-CIFAR-100. A larger value of ACC is better and a moderate value of BWT is better for balanced model.
Methods ACC(%) BWT(%)
EWC 52.33 -6.17
MAS 47.96 -7.04
MUC-MAS 41.18 -4.03
SI 45.27 -4.45
LwF 56.57 -11.19
InstaAParam 34.64 -10.05
GD-WILD 42.74 -34.58
A-GEM 53.32 -7.68
MEGA 57.12 -5.90
OWM 49.98 -3.64
Adam-NSCL 58.28 -6.05
Ours 61.05 -9.95
Table 3: Results on 25-split-TinyImageNet.
Figure 2: Accuracy of (left), (middle) and (right) with different on 10-split-CIFAR-100
Figure 3: Accuracy of (left), (middle) and (right) with different on 20-split-CIFAR-100
Figure 4: Accuracy of (left), (middle) and (right) with different on 25-split-TinyImageNet

Table 1, Table 2 and Table 3 show the comparison results. The results show that our method achieves significant improvement w.r.t ACC on three datasets. The results of BWT and ACC indicate that our method can achieve better plasticity-stability trade-off. Detailed analysis is as follows.

10-split-CIFAR-100 The results are shown in Table 1. We can see that our model achieves the best ACC 76.83%, which is 3.06% superior to the second best model Adam-NSCL. The algorithm-based methods, e.g., A-GEM, OWN and Adam-NSCL, have larger BWT values. They tend to remember old information. GD-WILD and InstAParam have smaller BWT values, and they tend to learn new knowledge. Compared to these baselines, the BWT value of our model is -4.27%, which is a more balanced value. It indicates that our model can obtain a meaningful balance between previous tasks and new task.

20-split-CIFAR-100 As shown in Table 2, our model still achieves the best ACC 78.21%, which is 1.05% better than the second best model GD-WILD. Note that GD-WILD stores previous data and its BWT value is 6.84% worse than ours. Again, our model achieves a relative balanced BWT value -8.01%.

25-split-TinyImageNet The results of Table 3 show that our method achieves the best ACC 61.05%, and the ACC of second best model Adam-NSCL is 58.28%. Note that Adam-NSCL achieves an excellent performance, even so, our method performs better than Adam-NSCL. The BWT and ACC indicate that the our method not only can achieve better performance, but also obtain a more balanced model.

In summary, two observations can be made from the results. (1) our method yields the best performance on all databases. (2) our method achieves a better trade-off between stability and plasticity. Please note that the performance (ACC) of a IL model can be divided into two parts: stability (BWT) and plasticity. Hence, knowing ACC and BWT, we can probably know the performance of plasticity. We will further discuss it in the next subsections.

Ablation Study

In this set of experiments, we conduct ablation study on three benchmark datasets to see the effects of . As indicated by Eq. (5), we use to control the ratio of two independent neural networks.

For demonstration purposes, we only show three sequential learning tasks. The results of other tasks are similar. To be specific, when , is the previous task and is the current task. The test accuracies of using different values of on tasks and are shown in the left of Figure 2, Figure 3 and Figure 4. The results of indicate the ability to preserve old knowledge, and the accuracies of indicate the ability to learning new task.

When , test accuracies of on tasks , and are shown in the middle of Figure 2, Figure 3 and Figure 4. Please note that and are previous tasks, and is the new task.

When , test accucaries of , , and are shown in the right of Figure 2, Figure 3 and Figure 4. Here, , are previous tasks, and is the new task. Please note that we only have the training samples from .

As shown in Figure 2, Figure 3 and Figure 4, we can see that: 1) the linear paths between and are almost smooth, and there are no obvious jumps along the paths. For example, the accuracy of has been increasing as the value of get larger as shown in the left of Figure 2. 2) For 10-split-CIFAR-100 and 20-split-CIFAR-100, the model strikes a balance well on all tasks when is close to . For 25-split-TinyImageNet, though the fused model doesn’t perform best on some tasks, is still the most compromising solution.

Plasticity-Stability Trade-off Analysis

To better understand our method, we compare with Adam-NSCL to analyse the plasticity-stability trade-off.

We use BWT as the evaluation measure of stability. Further, we also use Intransigence Measure(IM) Chaudhry et al. (2018) to measure plasticity, which indicates how much the model has learnt from new task. The intransigence for the -th task can be calculated as

(9)

where denotes the accuracy on the test set of -th task with dataset . The smaller the is, the better the model is.

Table 4, Table 5 and Table 6 show the results of BWT and IM. First, the BWT values of our model are smaller than that of Adam-NSCL, which means that Adam-NSCL has strong ability to remember the previous knowledge. Second, the IM values of our model are much better than Adam-NSCL. The null space projection of Adam-NSCL causes the performance degradation of the new task. Our method considers both the stability and plasticity, and the overall effect makes the ACC higher.

Methods ACC BWT(%) (%)
Adam-NSCL 73.77 -1.6 14.50
Ours 76.83 -4.27 7.70
Table 4: BWT and IM on 10-split-CIFAR-100
Methods ACC BWT(%) (%)
Adam-NSCL 75.95 -3.66 12.60
Ours 78.21 -8.01 6.40
Table 5: BWT and IM on 20-split-CIFAR-100
Methods ACC BWT(%) (%)
Adam-NSCL 58.28 -6.05 10.50
Ours 61.05 -9.95 5.25
Table 6: BWT and IM on 25-split-TinyImageNet

Conclusion

In this paper, we proposed a simple linear connector for incremental learning. The key to success comes from designing a better plasticity-stability trade-off solution. First, we proposed two independent neural networks. The first network aims to preserve the previous knowledge and the second network is to learn new knowledge. We used the null space SGD projection to learn the first network and the SGD for the second network. Finally, we simply averaged the two network and achieved a significant improvement. In our future work, we aims to find a better way to combine the two networks and give a theoretical explanation.

References

  • Ahn et al. (2019) Ahn, H.; Cha, S.; Lee, D.; and Moon, T. 2019. Uncertainty-based Continual Learning with Adaptive Regularization. In Advances in Neural Information Processing Systems, volume 32, 4392–4402.
  • Aljundi et al. (2018) Aljundi, R.; Babiloni, F.; Elhoseiny, M.; Rohrbach, M.; and Tuytelaars, T. 2018.

    Memory Aware Synapses: Learning What (not) to Forget.

    In

    Proceedings of the European Conference on Computer Vision (ECCV)

    , 144–161.
  • Chaudhry et al. (2018) Chaudhry, A.; Dokania, P. K.; Ajanthan, T.; and Torr, P. H. 2018. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), 532–547.
  • Chaudhry et al. (2018) Chaudhry, A.; Ranzato, M.; Rohrbach, M.; and Elhoseiny, M. 2018. Efficient Lifelong Learning with A-GEM. In International Conference on Learning Representations.
  • Chen et al. (2020) Chen, H.-J.; Cheng, A.-C.; Juan, D.-C.; Wei, W.; and Sun, M. 2020. Mitigating forgetting in online continual learning via instance-aware parameterization. Advances in Neural Information Processing Systems, 33: 17466–17477.
  • Choromanska et al. (2015) Choromanska, A.; Henaff, M.; Mathieu, M.; Arous, G. B.; and LeCun, Y. 2015. The Loss Surfaces of Multilayer Networks. In

    Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics

    , volume 38, 192–204.
  • Delange et al. (2021) Delange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; and Tuytelaars, T. 2021. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1.
  • Dinh et al. (2017) Dinh, L.; Pascanu, R.; Bengio, S.; and Bengio, Y. 2017. Sharp minima can generalize for deep nets. In

    ICML’17 Proceedings of the 34th International Conference on Machine Learning - Volume 70

    , 1019–1028.
  • Douillard et al. (2020) Douillard, A.; Cord, M.; Ollion, C.; Robert, T.; and Valle, E. 2020. PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning. In European Conference on Computer Vision, 86–102.
  • Draxler et al. (2018) Draxler, F.; Veschgini, K.; Salmhofer, M.; and Hamprecht, F. A. 2018. Essentially No Barriers in Neural Network Energy Landscape. In International Conference on Machine Learning, 1308–1317.
  • Ebrahimi et al. (2020) Ebrahimi, S.; Elhoseiny, M.; Darrell, T.; and Rohrbach, M. 2020. Uncertainty-guided Continual Learning with Bayesian Neural Networks. In ICLR 2020 : Eighth International Conference on Learning Representations.
  • Fort and Jastrzebski (2019) Fort, S.; and Jastrzebski, S. 2019. Large Scale Structure of Neural Network Loss Landscapes. In Advances in Neural Information Processing Systems, volume 32, 6706–6714.
  • Frankle et al. (2020) Frankle, J.; Dziugaite, G. K.; Roy, D.; and Carbin, M. 2020. Linear Mode Connectivity and the Lottery Ticket Hypothesis. In ICML 2020: 37th International Conference on Machine Learning, volume 1, 3259–3269.
  • Garipov et al. (2018) Garipov, T.; Izmailov, P.; Podoprikhin, D.; Vetrov, D. P.; and Wilson, A. G. 2018. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. In 32nd Conference on Neural Information Processing Systems, NeurIPS 2018, volume 31, 8789–8798.
  • Gepperth and Karaoguz (2016) Gepperth, A.; and Karaoguz, C. 2016. A Bio-Inspired Incremental Learning Architecture for Applied Perceptual Problems. Cognitive Computation, 8(5): 924–934.
  • Guo et al. (2020) Guo, Y.; Liu, M.; Yang, T.; and Rosing, T. 2020. Improved Schemes for Episodic Memory based Lifelong Learning Algorithm. In Conference on Neural Information Processing Systems.
  • Isele and Cosgun (2018) Isele, D.; and Cosgun, A. 2018. Selective experience replay for lifelong learning. In AAAI Conference on Artificial Intelligence 2018, 3302–3309.
  • Izmailov et al. (2018) Izmailov, P.; Podoprikhin, D.; Garipov, T.; Vetrov, D. P.; and Wilson, A. G. 2018. Averaging Weights Leads to Wider Optima and Better Generalization. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, 876–885.
  • Jonsson, Mills, and Jacobsen (1998) Jonsson, H.; Mills, G.; and Jacobsen, K. W. 1998. Nudged elastic band method for finding minimum energy paths of transitions. In CLassical and Quantum Dynamics in Condensed Phase Simulations, volume 385, 385–404.
  • Kirkpatrick et al. (2017) Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N. C.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; Hassabis, D.; Clopath, C.; Kumaran, D.; and Hadsell, R. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America, 114(13): 3521–3526.
  • Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.
  • Lee et al. (2019) Lee, K.; Lee, K.; Shin, J.; and Lee, H. 2019. Overcoming Catastrophic Forgetting With Unlabeled Data in the Wild. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 312–321.
  • Li et al. (2019) Li, X.; Zhou, Y.; Wu, T.; Socher, R.; and Xiong, C. 2019. Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting. In International Conference on Machine Learning, 3925–3934.
  • Li and Hoiem (2018) Li, Z.; and Hoiem, D. 2018. Learning without Forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12): 2935–2947.
  • Liu et al. (2020) Liu, Y.; Parisot, S.; Slabaugh, G.; Jia, X.; Leonardis, A.; and Tuytelaars, T. 2020. More classifiers, less forgetting: A generic multi-classifier paradigm for incremental learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, 699–716. Springer.
  • Lopez-Paz and Ranzato (2017) Lopez-Paz, D.; and Ranzato, M. 2017. Gradient Episodic Memory for Continual Learning. In Advances in Neural Information Processing Systems, volume 30, 6467–6476.
  • Mermillod, Bugaiska, and Bonin (2013) Mermillod, M.; Bugaiska, A.; and Bonin, P. 2013. The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in Psychology, 4: 504–504.
  • Mundt et al. (2020) Mundt, M.; Hong, Y. W.; Pliushch, I.; and Ramesh, V. 2020. A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning. arXiv preprint arXiv:2009.01797.
  • Rannen et al. (2017) Rannen, A.; Aljundi, R.; Blaschko, M. B.; and Tuytelaars, T. 2017. Encoder Based Lifelong Learning. In 2017 IEEE International Conference on Computer Vision (ICCV), 1329–1337.
  • Rebuffi et al. (2017) Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; and Lampert, C. H. 2017. iCaRL: Incremental Classifier and Representation Learning. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , volume 2017, 5533–5542.
  • Serrà et al. (2018) Serrà, J.; Surís, D.; Miron, M.; and Karatzoglou, A. 2018. Overcoming catastrophic forgetting with hard attention to the task. In The 35th International Conference on Machine Learning (ICML 2018), 4548–4557.
  • Shin et al. (2017) Shin, H.; Lee, J. K.; Kim, J.; and Kim, J. 2017. Continual Learning with Deep Generative Replay. In Advances in Neural Information Processing Systems, volume 30, 2990–2999.
  • Wang et al. (2021) Wang, S.; Li, X.; Sun, J.; and Xu, Z. 2021. Training Networks in Null Space of Feature Covariance for Continual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 184–193.
  • Wortsman et al. (2021) Wortsman, M.; Horton, M.; Guestrin, C.; Farhadi, A.; and Rastegari, M. 2021. Learning Neural Network Subspaces. In ICML 2021: 38th International Conference on Machine Learning.
  • Wu, Zhang, and Xu (2017) Wu, J.; Zhang, Q.; and Xu, G. 2017.

    Tiny imagenet challenge.

    Technical Report.
  • Xiang et al. (2019) Xiang, Y.; Fu, Y.; Ji, P.; and Huang, H. 2019. Incremental Learning Using Conditional Adversarial Networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 6619–6628.
  • Yan, Xie, and He (2021) Yan, S.; Xie, J.; and He, X. 2021. DER: Dynamically Expandable Representation for Class Incremental Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3014–3023.
  • Yoon et al. (2017) Yoon, J.; Yang, E.; Lee, J.; and ju Hwang, S. 2017. Lifelong Learning with Dynamically Expandable Networks. In Sixth International Conference on Learning Representations.
  • Zeng et al. (2019) Zeng, G.; Chen, Y.; Cui, B.; and Yu, S. 2019. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8): 364–372.
  • Zenke, Poole, and Ganguli (2017) Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual Learning Through Synaptic Intelligence. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, volume 70, 3987–3995.