Memory Efficient Class-Incremental Learning for Image Classification

08/04/2020 ∙ by Hanbin Zhao, et al. ∙ Zhejiang University 6

With the memory-resource-limited constraints, class-incremental learning (CIL) usually suffers from the "catastrophic forgetting" problem when updating the joint classification model on the arrival of newly added classes. To cope with the forgetting problem, many CIL methods transfer the knowledge of old classes by preserving some exemplar samples into the size-constrained memory buffer. To utilize the memory buffer more efficiently, we propose to keep more auxiliary low-fidelity exemplar samples rather than the original real high-fidelity exemplar samples. Such memory-efficient exemplar preserving scheme make the old-class knowledge transfer more effective. However, the low-fidelity exemplar samples are often distributed in a different domain away from that of the original exemplar samples, that is, a domain shift. To alleviate this problem, we propose a duplet learning scheme that seeks to construct domain-compatible feature extractors and classifiers, which greatly narrows down the above domain gap. As a result, these low-fidelity auxiliary exemplar samples have the ability to moderately replace the original exemplar samples with a lower memory cost. In addition, we present a robust classifier adaptation scheme, which further refines the biased classifier (learned with the samples containing distillation label knowledge about old classes) with the help of the samples of pure true class labels. Experimental results demonstrate the effectiveness of this work against the state-of-the-art approaches. We will release the code, baselines, and training statistics for all models to facilitate future research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent years have witnessed a great development of incremental learning [64, 2, 5, 48, 1, 43, 10, 68, 17, 18, 30, 12, 67, 15, 45, 58, 65, 42, 7], which has a wide range of real-world applications with the capability of continual model learning. To handle a sequential data stream with time-varying new classes, class-incremental learning [49] has emerged as a technique for the resource-constrained classification problem, which dynamically updates the model with the new-class samples as well as a tiny portion of old-class information (stored in a limited memory buffer). In general, class-incremental learning aims to set up a joint classification model simultaneously covering the information from both new and old classes, and is usually facing the forgetting problem [13, 14, 41, 46, 50, 35, 53, 31, 27, 9] with the domination of new-class samples. To address the forgetting problem, many class-incremental learning approaches typically concentrate on the following two aspects: 1) how to efficiently utilize the limited memory buffer (e.g., select representative exemplar samples from old classes); and 2) how to effectively attach the new-class samples with old-class information (e.g., feature transferring from old classes to new classes or distillation label [4] on each sample with old-class teacher model [33]). Therefore, we focus on effective class knowledge transfer and robust classifier updating for class-incremental learning within a limited memory buffer.

Fig. 1: Illustration of resource-constrained class-incremental learning. Firstly a training is done on the first available data. After that, part of those data is stored in a limited memory. When new data arrives, the samples in the memory are extracted and used with the new data to train the network so that it can correctly identify all the classes it has seen.

As for class knowledge transfer, a typical way is to preserve some exemplar samples into a memory buffer which has a constrained size in practice. For maintaining a low memory cost of classification, existing approaches [4, 49] usually resort to reducing the number of exemplar samples from old classes, resulting in the learning performance drop. Motivated by this observation, we attempt to enhance the learning performance with a fixed memory buffer by increasing the number of exemplar samples while moderately reducing the fidelity of exemplar samples. Our goal is to build a memory efficient class-incremental learning manner with low-fidelity exemplars. However, the normal exemplar-based class-incremental learning schemes [49, 4] can not work well with low-fidelity exemplars, because there exists a domain gap between the original exemplar samples and their corresponding low-fidelity ones (with smaller memory sizes). Thus, a specific a learning scheme must be proposed to update the model while reducing the influence of domain shift. In our duplet learning scheme, when facing the samples of new classes the low-fidelity exemplar samples are treated as the auxiliary samples, resulting in a set of duplet sample pairs in the form of original samples and their corresponding auxiliary samples. Based on such duplet sample pairs, we construct a duplet-driven deep learner that aims to build domain-compatible feature extractors and classifiers to alleviate the domain shift problem. With such a domain-compatible learning scheme, the low-fidelity auxiliary samples have the capability of moderately replacing the original high-fidelity samples, leading to more exemplar samples in the fixed memory buffer with a better learning performance.

After that, the duplet-driven deep learner is carried out over the new-class samples to generate their corresponding distillation label information of old classes, which makes the new-class samples inherit the knowledge of old classes. In this way, the label information on each new-class sample is composed of both distillation labels of old classes and true new-class labels. Hence, the overall classifier is incrementally updated with these two kinds of label information. Since the distillation label information is noisy, the classifer still has a small bias. Therefore, we propose a classifier adaptation scheme to correct the classifier. Specifically, we fix the feature extractor learned with knowledge distillation, and then adapt the classifier over samples with true class labels only (without any distillation label information). Finally, the corrected classifier is obtained as a more robust classifier.

In summary, the main contributions of this work are three-fold. First, we propose a novel memory-efficient duplet-driven scheme for resource-constrained class-incremental learning, which innovatively utilizes low-fidelity auxiliary samples for old-class knowledge transfer instead of the original real samples. With more exemplar samples in the limited memory buffer, the proposed learning scheme is capable of learning domain-compatible feature extractors and classifiers, which greatly reduces the influence of the domain gap between the auxiliary data domain and the original data domain. Second, we present a classifier adaptation scheme, which refines the overall biased classifier (after distilling the old-class knowledge into the model) by using pure true class labels for the samples while keeping the feature extractors fixed. Third, extensive experiments over benchmark datasets demonstrate the effectiveness of this work against the state-of-the-art approaches.

The rest of the paper is organized as follows. We first describe the related work in Section II, and then explain the details of our proposed strategy in Section III. In Section IV, we report the experiments that we conducted and discuss their results. Finally, we draw a conclusion and describe future work in Section V.

Ii Related Work

Recently, there have been a lot of research works on incremental learning with deep models [8, 63, 22, 44, 56, 11, 36, 34, 16]. The works can be roughly divided into three fuzzy categories of the common incremental learning strategies.

Ii-a Rehearsal strategies

Rehearsal strategies [6, 49, 4, 3, 62, 23, 60] replay the past knowledge to the model periodically with a limited memory buffer, to strengthen connections for previously learned memories. Selecting and preserving some exemplar samples of past classes into the size-constrained memory is a strategy to keep the old-class knowledge. A more challenging approach is pseudo-rehearsal with generative models. Some generative replay strategies [19, 54, 29, 61, 57] attempt to keep the domain knowledge of old data with a generative model and using only generated samples does not give competitive results.

Ii-B Regularization strategies

Regularization strategies extend the loss function with loss terms enabling the updated weights to retain past memories. The work in 

[33] preserves the model accuracy on old classes by encouraging the updated model to reproduce the scores of old classes for each image through knowledge distillation loss. The strategy in [55] is to apply the knowledge distillation loss to incremental learning of object detectors. Other strategies [26, 66, 52, 5] use a weighted quadratic regularization loss to penalize moving important weights used for old tasks.

Ii-C Architectural strategies

Architectural strategies [25, 51, 37, 40, 39, 47, 32, 24]

mitigate forgetting of the model by fixing part of the model’s architectures (e.g. layers, activation functions, parameters). PNN 

[51] combines the parameter freezing and network expansion, and CWR [37] is proposed with a fixed number of shared parameters based on PNN.

Fig. 2: Visualization of the low-fidelity auxiliary samples and the corresponding real samples with two different feature extractors by t-SNE. (a): The feature extractor is updated incrementally on the auxiliary data without our duplet learning scheme, and we notice that there is a large gap between the auxiliary data and real data; (b): The feature extractor is updated incrementally on the auxiliary data with our duplet learning scheme, and we observe that the domain gap is reduced.

Our work belongs to the first and the second categories. We focus on effective past class knowledge transfer and robust classifier updating within a limited memory buffer. For effective class knowledge transfer with more exemplar samples, we innovatively design a memory-efficient exemplar preserving scheme and a duplet learning scheme that utilizes the low-fidelity exemplar samples for knowledge transfer, instead of directly utilizing the original real samples. Moreover, the distillation label information of old classes on new-class samples with knowledge distillation is usually noisy. Motivated by this observation, we further refine the biased classifier in a classifier adaptation scheme.

Notation Definition
The sample set of class
Auxiliary form of the sample set of class
The added data of new classes at the -th learning session
The deep image classification model at the -th learning session
The parameters of
The feature extractor of
The classifier of
The auxiliary exemplar samples of old class stored at the -th learning session
The auxiliary exemplar samples of old classes preserved at previous learning sessions
The mapping function of the encoder
The mapping function of the decoder
TABLE I: Main notations and symbols used throughout the paper.

Iii Method

Iii-a Problem Definition

Before presenting our method, we first provide an illustration of the main notations and symbols used hereinafter (as shown in Table I) for a better understanding.

Class-incremental learning assumes that samples from one new class or a batch of new classes come at a time. For simplicity, we suppose that the sample sets in a data stream arrive in order (i.e. ), and the sample set contains the samples of class (). We consider the time interval from the arrival of the current batch of classes to the arrival of the next batch of classes as a class-incremental learning session [19, 25]. A batch of new class data added at the -th () learning session are represented as:

(1)

In an incremental learning environment with a limited memory buffer, previous samples of old classes can not be stored entirely and only a small number of exemplars from the samples are selected and preserved into memory for old-class knowledge transfer [4, 49]. The memory buffer is dynamically updated at each learning session.

At the -th session, after obtaining the new-class samples we access memory to extract the exemplar samples of old-class information. Let denote the set of exemplar samples extracted from the memory at the -th session for the old class :

(2)

where is the number of exemplar samples and is the corresponding ground truth label. is the first samples selected from the sorted list of samples of class by herding [59]. And then can be rewritten as:

(3)

The objective is to train a new model which has competitive classification performance on the test set of all the seen classes. represents the deep image classification model at the -th learning session and the parameters of the model are denoted as . The output of is defined as:

(4)

is usually composed of a feature extractor and a classifier . After obtaining the model , the memory buffer is updated and is constructed with the exemplars in and a subset of .

Fig. 3: Illustration of the process of class-incremental learing with our duplet learning scheme and classifier adaptation scheme. We use and () to represent the feature extractor and the classifier respectively at the -th learning session. For initialization, is trained from scratch with the set of duplet sample pairs , is then constructed and stored in the memory in the form of . At the -th learning session, firstly we train a new feature extractor and a biased classifier on all the seen classes with our duplet learning scheme, and the exemplar auxiliary samples for all the seen classes are then constructed. Finally, we update the classifier with the classifier adaptation scheme on .

Iii-B Memory Efficient Class-incremental Learning

It is helpful for knowledge transfer to store more number of exemplars in the memory [4]. For effective knowledge transfer, we propose a memory efficient class-incremental learning manner, which means utilizing more low-fidelity auxiliary exemplar samples to approximately replace the original real exemplar samples.

We present an encoder-decoder structure to transform the original high-fidelity real sample to the corresponding low-fidelity sample (with smaller memory size):

(5)

where is the mapping function of the encoder and is the mapping function of the decoder. Due to the loss of fideltiy, we keep the auxiliary sample code with smaller memory cost compared with storing the corresponding real sample. We use to represent the memory cost ratio of keeping a low-fidelity sample and keeping the corresponding high-fidelity real sample ().

Let denote the set of exemplar auxiliary samples extracted from the memory at the -th session for the old class :

(6)

where is the number of exemplar auxiliary samples. With the fixed size memory buffer, is larger than . We use to represent all the exemplar auxiliary samples extracted from :

(7)

And then in our memory efficient class-incremental learning can be represented as:

(8)

where represents the corresponding real samples of .

At the -th session, we use the exemplars and newly added data to train the model . More number of exemplars from are helpful for learning the model. However, the domain gap between the auxiliary samples and their original versions (as shown in Figure 2(a)) often gives a bad performance. In order to fix this issue, we propose a duplet class-incremental learning scheme in the following subsection.

Iii-C Duplet Class-incremental Learning Scheme

To reduce the influence of domain shift, we propose a duplet learning scheme, which trains the model using a set of duplet sample pairs in the form of original samples and their corresponding auxiliary samples.

A duplet sample pair is constructed from an auxiliary sample and the corresponding real sample . At the -th learning session (as shown in Figure 3), we construct a set of duplet sample pairs with and , denoted as:

(9)

We train the model with the duplet sample pairs of and the exemplar auxiliary samples of old classes by optimizing the objective function :

(10)

where is the loss term for and is the loss term for .

: For preserving the model performance on old-class auxiliary samples, is defined as:

(11)

where is composed of a classification loss term and a knowledge distillation loss term for one sample, which is defined as:

(12)

The classification loss for one training sample on newly added classes is formulated as:

(13)

where is a indicator function and denoted as:

(14)

is a cross entropy function and represented as:

(15)

The knowledge distillation loss is defined as:

(16)

its aim is to make the output of the model close to the distillation class label () of the previously learned model on old classes.

: For the new-class duplet sample pairs, encourages the output of the model on the real sample similar to that on the corresponding auxiliary sample and is defined as:

(17)

In general, we can obtain a domain-compatible feature extractor and a classifier on all the seen classes by optimizing the loss function in Equation (10). The domain gap in Figure 2(a) is decreased greatly with our duplet learning scheme as shown in Figure 2(b).

Iii-D Classifier Adaptation

Our duplet-driven deep learner is carried out over the new classes samples to generate their corresponding distillation label information of old classes through the distillation loss (as defined in Equation (16)), which makes the new-class samples inherit the knowledge of old classes. Since the distillation label knowledge is noisy, we propose a classifier adaptation scheme to refine the classifier over samples with true class labels only (without any distillation label information).

Taking the -th learning session for example, we can obtain a domain-compatible but biased classifier by optimizing the objective function defined in Equation (10). Here, we fix the feature extractor learned and continue to optimize the parameters of the classifier further only using the true class label knowledge. Then, the optimization is done by using exclusively the auxiliary samples that are going to be stored in memory . The objective function is formulated as below:

(18)

By minimizing the objective function, the classifier is refined and has better performance for all the seen classes.

Input: The added data of new classes
Require: The exemplar auxiliary samples and the parameters
1 Obtain the auxiliary samples for new classes from using Equation (5);
2 Initialize , with and respectively;
/* The duplet learning scheme */
3 Obtain the optimal parameters of the feature extractor and a biased classifier by minimizing Equation (10);
4 Obtain the exemplar auxiliary samples of all the seen classes from and (described in Section III-B);
/* The classifier adaptation scheme */
5 Fix the parameters and adapt the biased classifier by minimizing Equation (18) to obtain the optimal parameters ;
Output: The auxiliary exemplar samples and the parameters
Algorithm 1 Training the model at the -th session

Figure 3 illustrates the process of class-incremental learning with our duplet learning scheme and classifier adaptation in detail. For initialization, is trained from scratch with the set of duplet sample pairs , then is constructed and stored in the memory in the form of . At the -th learning session, we extract the auxiliary samples for previous classes and for new classes. Firstly we train a domain-compatible feature extractor and a classifier on all the seen classes with our duplet learning scheme (described in Section III-C). Then the exemplar auxiliary samples for all the seen classes are constructed from and (described in Section III-B). Finally we update the classifier further with our classifier adaptation scheme (introduced in Section III-D). Algorithm 1 lists the steps for training the model in detail.

Iv Experiments

Iv-a Datasets

CIFAR-100 [27] is a labeled subset of the 80 million tiny images dataset for object recognition. This dataset contains 60000 RGB images in 100 classes, with 500 images per class for training and 100 images per class for testing.

ILSVRC [28]

is a dataset for ImageNet Large Scale Visual Recognition Challenge 2012. It contains 1.28 million training images and 50k validation images in 1000 classes.

Iv-B Evaluation Protocol

We evaluate our method on the iCIFAR-100 benchmark and the iILSVRC benchmark proposed in [49]. On iCIFAR-100, in order to simulate a class-incremental learning process, we train all 100 classes in batches of 5, 10, 20, or 50 classes at a time, which means 5, 10, 20, or 50 classes of new data are added at each learning session. After each batch of classes are added, the obtained accuracy is computed on a subset of test data sets containing only those classes that have been added. The results we report are the average accuracy without considering the accuracy of the first learning session as it does not represent the incremental learning described in [4]. On ILSVRC, we use a subset of 100 classes which are trained in batches of 10 (iILSVRC-small) [49]. For a fair comparison, we take the same experimental setup as that of [49], which randomly selects 50 samples for each class of ILSVRC as the test set and the rest as the training set. In addition, the evaluation method of the result is the same as that on the iCIFAR-100.

Fig. 4: Illustration of original real samples from CIFAR-100 and their auxiliary sample equivalents with different fidelities.

Iv-C Implementation Details

Data Preprocessing

on CIFAR-100, the only preprocessing we do is the same as iCaRL [49], including random cropping, data shuffling and per-pixel mean subtraction. On ILSVRC, the augmentation strategy we use is the random cropping and the horizontal flip.

Generating auxiliary samples of different fidelities

we generate auxiliary samples of different fidelities (as shown in Figure 4) with various kinds of encoder-decoder structures (e.g. PCA [21], Downsampling and Upsampling). The fidelity factors we use (as defined in Section 3.1) are as follow: for iCIFAR-100, we use , and when using a PCA based reduction and , when using downsampling. For iILSVRC, we conduct the experiments with values of based on downsampling. In our experiments, we use a memory buffer of full samples. Since the sample fidelity and number of exemplars is negatively correlated, our approach can store samples.

Training details

for iCIFAR-100 at each learning session, we train a 32-layers ResNet [20] using SGD with a mini-batch size of ( duplet sample pairs are composed of original samples and corresponding auxiliary samples) by the duplet learning scheme. The initial learning rate is set to and is divided by after and epochs. We train the network using a weight decay of and a momentum of

. For the classifier adaptation, we use the auxiliary exemplar samples for normal training and the other parameters of the experiments remain the same. We implement our framework with the theano package and use an NVIDIA TITAN 1080 Ti GPU to train the network. For iILSVRC, we train an 18-layers ResNet 

[20] with the initial learning rate of , divided by after , , and epochs. The rest of the settings are the same as those on the iCIFAR-100.

Method Exemplar Average Accuracy
iCaRL Real 60.79%
Auxiliary 50.86%
iCaRL-Hybrid1 Real 55.10%
Auxiliary 44.86%
Ours.FC Real 61.67%
Auxiliary 67.04%
Ours.NCM Real 61.97%
Auxiliary 66.95%
TABLE II: Evaluation of different methods which either using the auxiliary exemplars or real exemplars. Added classes at each session is .
Fidelity Factor Method Average Accuracy
PCA iCaRL-Hybrid1 44.86%
iCaRL-Hybrid1+DUP +14.67%
PCA iCaRL-Hybrid1 42.74%
iCaRL-Hybrid1+DUP +17.39%
PCA iCaRL-Hybrid1 40.90%
iCaRL-Hybrid1+DUP +16.88%
Downsampling iCaRL-Hybrid1 41.88%
iCaRL-Hybrid1+DUP +16.41%
Downsampling iCaRL-Hybrid1 40.29%
iCaRL-Hybrid1+DUP +16.96%
TABLE III: Validation of our duplet learning scheme on iCIFAR-100 with the auxiliary samples of different fidelities. With our duplet learning scheme (DUP), the average accuracy of the class-incremental model is enhanced by more than 10% compared to that of the normal learning scheme in [49] for all cases.

Iv-D Ablation Experiments

In this section, we first evaluate different methods which either using the auxiliary exemplars or real exemplars. Then we carry out two ablation experiments to validate our duplet learning and classifier adaptation scheme on iCIFAR-100, and we also conduct another ablation experiment to show the effect of the auxiliary sample’s fidelity and the auxiliary exemplar data size for each class when updating the model. Finally, we evaluate methods with memory buffer of different sizes.

Iv-D1 Baseline

for the class-incremental learning problem, we consider three kinds of baselines, which are: a) LWF.MC [33], utilizes knowledge distillation in the incremental learning problem, b) iCaRL [49], utilizes exemplars firstly for old-class knowledge transfer and a nearest-mean-of-exemplars classfication strategy and c) iCaRL-Hybrid1 [49]

, also uses the exemplars but with a neural network classifier (i.e. a fully connected layer).

Iv-D2 Using auxiliary exemplars or real exemplars

we evaluate different methods which either using the auxiliary exemplars or real exemplars, shown in Table II. For a fair comparison, the memory cost for the auxiliary or the real is fixed. Our method is denoted by “Ours.FC” if we utilize the fully connected layer as the classifier, or “Ours.NCM” if utilizing the nearest-mean-of-exemplars classification strategy. Using auxiliary exemplars directly leads to a performance drop for both iCaRL and iCaRL-Hybrid1 because of the large domain gap between the auxiliary data and real data (shown in Figure 2(a)). For our domain-invariant learning method, the average accuracy of using the auxiliary exemplars is about higher than that of using the real exemplars. It seems to be proved that with the same memory buffer limitation, using auxiliary exemplars can further improve the performance compared with using the real exemplars, as long as the domain drift between them is reduced.

Fidelity Factor Method Average Accuracy
PCA iCaRL-Hybrid1+DUP 59.53%
iCaRL-Hybrid1+DUP+CA +7.51%
PCA iCaRL-Hybrid1+DUP 60.13%
iCaRL-Hybrid1+DUP+CA +3.93%
PCA iCaRL-Hybrid1+DUP 57.77%
iCaRL-Hybrid1+DUP+CA +0.63%
Downsampling iCaRL-Hybrid1+DUP 58.29%
iCaRL-Hybrid1+DUP+CA +2.63%
Downsampling iCaRL-Hybrid1+DUP 57.25%
iCaRL-Hybrid1+DUP+CA +0.68%
TABLE IV: Validation of our classifier adaptation scheme on iCIFAR-100 with the auxiliary samples of different fidelities. The model’s accuracy is improved after updating the classifier further with the classifier adaptation scheme.
Fig. 5: Performance of model when varying the auxiliary sample’s fidelity and the exemplar auxiliary data size. When , it indicates that keeping the real exemplar samples in memory directly [49].
Fig. 6: Average incremental accuracy on iCIFAR-100 with 10 classes per batch for the memory of different size (expressed in the number of real exemplar samples). The average accuracy of the model with our scheme is higher than that of iCaRL, LWF.MC and iCaRL-Hybrid1 in all the cases.
Fig. 7: The performance of different methods with the incremental learning session of 5, 10, 20 and 50 classes on iCIFAR-100. The average accuracy of the incremental learning sessions is shown in parentheses for each method and computed without considering the accuracy of the first learning session. Our class-incremental learning scheme with auxiliary samples obtains the best results in all the cases.

Iv-D3 Validation of the duplet learning scheme

we evaluate our duplet learning scheme with different auxiliary samples fidelity for the iCIFAR100 benchmark. new classes are added at each learning session. We utilize the “iCaRL-Hybrid1” method with the auxilary samples by the normal learning scheme in [49] or our duplet learning scheme (DUP). As shown in Table III, we can observe that our scheme is able to significantly improve the performance of the final model greatly for various kinds of low-fidelity auxiliary samples compared with directly training the model. Moreover, the t-SNE [38] analysis in Figure 2(a) shows that normally there is a large gap between the auxiliary data and the real data without our duplet learning scheme. Figure 2(b) illustrates that our duplet learning scheme can actually reduce the domain drift and guarantee the effectiveness of the auxiliary data for preserving the model’s performance on old classes.

Iv-D4 Validation of the classifier adaptation scheme

we evaluate the performance of the final classifier after using our classifier adaptation scheme (CA). As shown in Table IV, the classifier adaptation scheme can further improve the classifier’s accuracy. For the auxiliary samples of different fidelities based on the PCA or Downsampling, we can observe that the improvement of the performance decreases along with the auxiliary samples’ fidelity.

Iv-D5 Balance of the auxiliary exemplar samples’ size and fidelity

we examine the effect of varying the auxiliary sample’s fidelity and the auxiliary exemplar data size for each class while the size of the limited memory buffer remains the same. Specifically, the fidelity will decrease if the size of auxiliary samples increase, where the number of the auxiliary samples is set to . As shown in Figure 5, the average accuracy of the model increases first and then decreases with the decrease of the auxiliary samples’ fidelity (also the increase of the auxiliary samples’ size), which means moderately reducing the samples’ fidelity can improve the final model’s performance with a limited memory buffer. When the fidelity of an auxiliary sample is decreased a lot, the model’s performance drops due to too much loss of class knowledge.

Iv-D6 Fixed memory buffer size

we conduct the experiments with memory buffer of different sizes on iCIFAR-100 where the number of added classes at each learning session is 10. The size of memory buffer is expressed in the number of real exemplar samples. As shown in Figure 6, all of the exemplar-based methods (“Ours.FC”, “iCaRL” and “iCaRL-Hybrid1”) benefit from a larger memory which indicates that more samples of old classes are useful for keeping the performance of the model. The average accuracy of the model with our scheme is higher than that of iCaRL, LWF.MC and iCaRL-Hybrid1 in all the cases.

Iv-E State-of-the-Art Performance Comparison

In this section, we evaluate the performance of our proposed scheme on the iCIFAR-100 and iILSVRC benchmark, against the state-of-the-art methods, including LWF.MC [33], iCaRL, iCaRL-Hybrid1 [49], ETE [4], BiC [60].

Fig. 8: The performance of different methods with the incremental learning session of 10 classes on iILSVRC-small. The average accuracy over all the incremenal learning sessions is shown in parentheses for each method.

For iCIFAR-100, we evaluate the incremental learning session of 5, 10, 20 and 50 classes, Figure 7 summarises the results of the experiments. The memory size for all the evaluated methods is the same. We observe that our class-incremental learning scheme with auxiliary samples obtains the best results in all the cases. Compared with iCaRL and iCaRL-Hybrid1, we achieve a higher accuracy at each learning session. When the new-class data arrives, the accuracy of our scheme decreases slowly compared to ETE and BiC.

For iILSVRC-small, we evaluate the performance of our method with the incremental learning session of 10 classes and the results are shown in Figure 8. It can be also observed that our scheme obtains the highest accuracy at each incremental learning session among others.

V Conclusion

In this paper, we have presented a novel memory-efficient exemplar preserving scheme and a duplet learning scheme for resource-constrained class-incremental learning, which transfers the old-class knowledge with low-fidelity auxiliary samples rather than the original real samples. We have also proposed a classifier adaptation scheme for the classifier’s updating. The proposed scheme refines the biased classifier with samples of pure true class labels. Our scheme has obtained better results than the state-of-the-art methods on several datasets. As part of our future work, we plan to explore the low-fidelity auxiliary sample selection scheme for inheriting more class information in a limited memory buffer.

Acknowledgment

The authors would like to thank Xin Qin for their valuable comments and suggestions.

References

  • [1] R. Aljundi, P. Chakravarty, and T. Tuytelaars (2017) Expert gate: lifelong learning with a network of experts. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 3366–3375. Cited by: §I.
  • [2] E. Belouadah and A. Popescu (2018) DeeSIL: deep-shallow incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 151–157. Cited by: §I.
  • [3] E. Belouadah and A. Popescu (2019) Il2m: class incremental learning with dual memory. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 583–592. Cited by: §II-A.
  • [4] F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari (2018) End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 233–248. Cited by: §I, §I, §II-A, §III-A, §III-B, §IV-B, §IV-E.
  • [5] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–547. Cited by: §I, §II-B.
  • [6] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny (2019) Efficient lifelong learning with a-gem. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §II-A.
  • [7] C. P. Chen, Z. Liu, and S. Feng (2018) Universal approximation capability of broad learning system and its structural variations. IEEE transactions on neural networks and learning systems (TNNLS) 30 (4), pp. 1191–1204. Cited by: §I.
  • [8] C. P. Chen and Z. Liu (2017) Broad learning system: an effective and efficient incremental learning system without the need for deep architecture. IEEE transactions on neural networks and learning systems (TNNLS) 29 (1), pp. 10–24. Cited by: §II.
  • [9] R. Coop, A. Mishtal, and I. Arel (2013) Ensemble learning in fixed expansion layer networks for mitigating catastrophic forgetting. IEEE transactions on neural networks and learning systems (TNNLS) 24 (10), pp. 1623–1634. Cited by: §I.
  • [10] M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars (2019) Continual learning: a comparative study on how to defy forgetting in classification tasks. arXiv preprint arXiv:1909.08383. Cited by: §I.
  • [11] G. Ding, Y. Guo, K. Chen, C. Chu, J. Han, and Q. Dai (2019) DECODE: deep confidence network for robust image classification. IEEE Transactions on Image Processing (TIP) 28 (8), pp. 3752–3765. Cited by: §II.
  • [12] Q. Dong, S. Gong, and X. Zhu (2018)

    Imbalanced deep learning by minority class incremental rectification

    .
    IEEE transactions on pattern analysis and machine intelligence (T-PAMI) 41 (6), pp. 1367–1381. Cited by: §I.
  • [13] R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §I.
  • [14] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio (2014) An empirical investigation of catastrophic forgetting in gradient-based neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §I.
  • [15] B. Gu, V. S. Sheng, K. Y. Tay, W. Romano, and S. Li (2014)

    Incremental support vector learning for ordinal regression

    .
    IEEE Transactions on Neural networks and learning systems (TNNLS) 26 (7), pp. 1403–1416. Cited by: §I.
  • [16] Y. Guo, G. Ding, J. Han, and Y. Gao (2017) Zero-shot learning with transferred samples. IEEE Transactions on Image Processing (TIP) 26 (7), pp. 3277–3290. Cited by: §II.
  • [17] Y. Hao, Y. Fu, Y. Jiang, and Q. Tian (2019) An end-to-end architecture for class-incremental object detection with knowledge distillation. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §I.
  • [18] Y. Hao, Y. Fu, and Y. Jiang (2019) Take goods from shelves: a dataset for class-incremental object detection. In Proceedings of the International Conference on Multimedia Retrieval (ICMR), pp. 271–278. Cited by: §I.
  • [19] C. He, R. Wang, S. Shan, and X. Chen (2018) Exemplar-supported generative reproduction for class incremental learning. In Proceedings of the British Machine Vision Conference (BMVC), pp. 3–6. Cited by: §II-A, §III-A.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778. Cited by: §IV-C.
  • [21] H. Hotelling (1933) Analysis of a complex of statistical variables into principal components.. Journal of educational psychology 24 (6), pp. 417. Cited by: §IV-C.
  • [22] C. Hou and Z. Zhou (2017) One-pass learning with incremental and decremental features. IEEE transactions on pattern analysis and machine intelligence (T-PAMI) 40 (11), pp. 2776–2792. Cited by: §II.
  • [23] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin (2019) Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 831–839. Cited by: §II-A.
  • [24] S. Huang, V. François-Lavet, and G. Rabusseau (2019) Neural architecture search for class-incremental learning. arXiv preprint arXiv:1909.06686. Cited by: §II-C.
  • [25] R. Kemker and C. Kanan (2018) Fearnet: brain-inspired model for incremental learning. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §II-C, §III-A.
  • [26] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences (PNAS), pp. 3521–3526. Cited by: §II-B.
  • [27] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §I, §IV-A.
  • [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    .
    In Proceedings of the Advances in neural information processing systems (NeurIPS), pp. 1097–1105. Cited by: §IV-A.
  • [29] F. Lavda, J. Ramapuram, M. Gregorova, and A. Kalousis (2018) Continual classification learning using generative models. arXiv preprint arXiv:1810.10612. Cited by: §II-A.
  • [30] K. Lee, K. Lee, J. Shin, and H. Lee (2019) Overcoming catastrophic forgetting with unlabeled data in the wild. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 312–321. Cited by: §I.
  • [31] S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang (2017)

    Overcoming catastrophic forgetting by incremental moment matching

    .
    In Proceedings of the Advances in neural information processing systems (NeurIPS), pp. 4652–4662. Cited by: §I.
  • [32] X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong (2019) Learn to grow: a continual structure learning framework for overcoming catastrophic forgetting.

    International Conference on Machine Learning (ICML)

    .
    Cited by: §II-C.
  • [33] Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence (T-PAMI) 40 (12), pp. 2935–2947. Cited by: §I, §II-B, §IV-D1, §IV-E.
  • [34] Z. Lin, G. Ding, J. Han, and L. Shao (2017) End-to-end feature-aware label space encoding for multilabel classification with many classes. IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 29 (6), pp. 2472–2487. Cited by: §II.
  • [35] X. Liu, M. Masana, L. Herranz, J. Van de Weijer, A. M. Lopez, and A. D. Bagdanov (2018) Rotate your networks: better weight consolidation and less catastrophic forgetting. In Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 2262–2268. Cited by: §I.
  • [36] Y. Liu, F. Nie, Q. Gao, X. Gao, J. Han, and L. Shao (2019)

    Flexible unsupervised feature extraction for image classification

    .
    Neural Networks 115, pp. 65–71. Cited by: §II.
  • [37] V. Lomonaco and D. Maltoni (2017) Core50: a new dataset and benchmark for continuous object recognition. arXiv preprint arXiv:1705.03550. Cited by: §II-C.
  • [38] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research (JMLR) 9 (Nov), pp. 2579–2605. Cited by: §IV-D3.
  • [39] A. Mallya, D. Davis, and S. Lazebnik (2018) Piggyback: adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 67–82. Cited by: §II-C.
  • [40] D. Maltoni and V. Lomonaco (2018) Continuous learning in single-incremental-task scenarios. arXiv preprint arXiv:1806.08568. Cited by: §II-C.
  • [41] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §I.
  • [42] Y. Nakamura and O. Hasegawa (2016)

    Nonparametric density estimation based on self-organizing incremental neural network for large noisy data

    .
    IEEE transactions on neural networks and learning systems (TNNLS) 28 (1), pp. 8–17. Cited by: §I.
  • [43] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §I.
  • [44] J. Park and J. Kim (2018) Incremental class learning for hierarchical classification. IEEE transactions on cybernetics 50 (1), pp. 178–189. Cited by: §II.
  • [45] A. Penalver and F. Escolano (2012) Entropy-based incremental variational bayes learning of gaussian mixtures. IEEE transactions on neural networks and learning systems (TNNLS) 23 (3), pp. 534–540. Cited by: §I.
  • [46] B. Pfülb and A. Gepperth (2019) A comprehensive, application-oriented study of catastrophic forgetting in dnns. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §I.
  • [47] J. Rajasegaran, M. Hayat, S. Khan, F. S. Khan, and L. Shao (2019) Random path selection for incremental learning. Cited by: §II-C.
  • [48] A. Rannen, R. Aljundi, M. B. Blaschko, and T. Tuytelaars (2017) Encoder based lifelong learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1320–1328. Cited by: §I.
  • [49] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) Icarl: incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2001–2010. Cited by: §I, §I, §II-A, §III-A, Fig. 5, §IV-B, §IV-C, §IV-D1, §IV-D3, §IV-E, TABLE III.
  • [50] H. Ritter, A. Botev, and D. Barber (2018) Online structured laplace approximations for overcoming catastrophic forgetting. In Proceedings of the Advances in neural information processing systems (NeurIPS), pp. 3738–3748. Cited by: §I.
  • [51] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §II-C.
  • [52] J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell (2018) Progress & compress: a scalable framework for continual learning. arXiv preprint arXiv:1805.06370. Cited by: §II-B.
  • [53] J. Serrà, D. Surís, M. Miron, and A. Karatzoglou (2018) Overcoming catastrophic forgetting with hard attention to the task. arXiv preprint arXiv:1801.01423. Cited by: §I.
  • [54] H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 2990–2999. Cited by: §II-A.
  • [55] K. Shmelkov, C. Schmid, and K. Alahari (2017) Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3400–3409. Cited by: §II-B.
  • [56] Y. Sun, K. Tang, Z. Zhu, and X. Yao (2018) Concept drift adaptation by exploiting historical knowledge. IEEE transactions on neural networks and learning systems (TNNLS) 29 (10), pp. 4822–4832. Cited by: §II.
  • [57] G. M. van de Ven and A. S. Tolias (2018) Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635. Cited by: §II-A.
  • [58] Z. Wang, H. Li, and C. Chen (2019)

    Incremental reinforcement learning in continuous spaces via policy relaxation and importance weighting

    .
    IEEE Transactions on Neural Networks and Learning Systems (TNNLS). Cited by: §I.
  • [59] M. Welling (2009) Herding dynamical weights to learn. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1121–1128. Cited by: §III-A.
  • [60] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu (2019) Large scale incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 374–382. Cited by: §II-A, §IV-E.
  • [61] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, Z. Zhang, and Y. Fu (2018)

    Incremental classifier learning with generative adversarial networks

    .
    arXiv preprint arXiv:1802.00853. Cited by: §II-A.
  • [62] Y. Xiang, Y. Fu, P. Ji, and H. Huang (2019) Incremental learning using conditional adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 6619–6628. Cited by: §II-A.
  • [63] Y. Xing, F. Shen, and J. Zhao (2015) Perception evolution network based on cognition deepening model—adapting to the emergence of new sensory receptor. IEEE transactions on neural networks and learning systems (TNNLS) 27 (3), pp. 607–620. Cited by: §II.
  • [64] J. Xu and Z. Zhu (2018) Reinforced continual learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 907–916. Cited by: §I.
  • [65] Z. Yang, S. Al-Dahidi, P. Baraldi, E. Zio, and L. Montelatici (2019) A novel concept drift detection method for incremental learning in nonstationary environments. IEEE transactions on neural networks and learning systems (TNNLS). Cited by: §I.
  • [66] F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987–3995. Cited by: §II-B.
  • [67] H. Zhang, X. Xiao, and O. Hasegawa (2013) A load-balancing self-organizing incremental neural network. IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 25 (6), pp. 1096–1105. Cited by: §I.
  • [68] J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. Heck, H. Zhang, and C. J. Kuo (2019) Class-incremental learning via deep model consolidation. arXiv preprint arXiv:1903.07864. Cited by: §I.