Few-Shot Class-Incremental Learning via Feature Space Composition

06/28/2020 ∙ by Hanbin Zhao, et al. ∙ Zhejiang University 0

As a challenging problem in machine learning, few-shot class-incremental learning asynchronously learns a sequence of tasks, acquiring the new knowledge from new tasks (with limited new samples) while keeping the learned knowledge from previous tasks (with old samples discarded). In general, existing approaches resort to one unified feature space for balancing old-knowledge preserving and new-knowledge adaptation. With a limited embedding capacity of feature representation, the unified feature space often makes the learner suffer from semantic drift or overfitting as the number of tasks increases. With this motivation, we propose a novel few-shot class-incremental learning pipeline based on a composite representation space, which makes old-knowledge preserving and new-knowledge adaptation mutually compatible by feature space composition (enlarging the embedding capacity). The composite representation space is generated by integrating two space components (i.e. stable base knowledge space and dynamic lifelong-learning knowledge space) in terms of distance metric construction. With the composite feature space, our method performs remarkably well on the CUB200 and CIFAR100 datasets, outperforming the state-of-the-art algorithms by 10.58

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

FSCIL_Feature_Space_Composition

The implementation of paper "Few-Shot Class-Incremental Learning via Feature Space Composition"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, continual learning De Lange et al. (2019); Parisi et al. (2019) (i.e. incremental learning or lifelong learning) has received considerable attention due to its capability of continual model learning with a wide range of real-world applications. In principle, continual learning aims at enabling a learner to acquire new knowledge from new data while preserving the learned knowledge from previous data. Continual learning is usually conducted under the task-incremental or the class-incremental learning scenarios. This paper focuses on the latter, which is more difficult, since the task identity is non-available at inference time. In practice, the knowledge from new tasks is often represented by a rather limited number of samples. To meet this practical demand, few-shot class-incremental learning Tao et al. (2020) has raised a lot of attention, which typically involves the learning components of the base task (i.e., the first task with large-scale training samples) and the new tasks (with limited samples). In this learning scenario, it strives to dynamically maintain a unified discriminative feature space to simultaneously represent the new knowledge while preserving the discriminative information from the base task. As a result, it is often in the dilemma between old knowledge forgetting (after a sequence of new tasks) and new sample overfitting (caused by limited samples). Therefore, we focus on building an effective representation space for few-shot class-incremental learning that well balances old-knowledge preserving and new-knowledge adaptation in this paper.

In general, the literature works deal with the class-incremental learning problem from the following three perspectives: 1) architectural strategies that add and remove components Yoon et al. (2018); Li et al. (2019); Hung et al. (2019); 2) rehearsal strategies which store past samples (or other old tasks information) Chaudhry et al. (2019); Lopez-Paz and Ranzato (2017); Shin et al. (2017); Aljundi et al. (2019); and 3) regularization strategies which regularize the updating of the networks’ parameters with the constraint of the learned knowledge Lee et al. (2017); Li and Hoiem (2017); Nguyen et al. (2018); Ritter et al. (2018)). In principle, the aforementioned approaches adopt a common strategy that they only maintain one unified feature space, seeking to make old-knowledge preserving and new-knowledge adaptation mutually compatible. In the unified feature space (with a limited embedding capacity of feature representation), the learning directions for old-knowledge preserving and new-knowledge adaptation are usually inconsistent with each other (even contradictory sometimes). With an asynchronous updating strategy (for old data and new data), the unified feature space for class-incremental learning tends to embed more information on new data as the number of tasks incrementally increases. Consequently, its embedding capacity for old data is greatly compressed. In the case of few-shot class-incremental learning, with only a limited number of new task data samples provided, the learned feature space is prone to suffer from semantic drift (i.e. catastrophic forgetting) or overfitting Tao et al. (2020); French (1999); Goodfellow et al. (2014); McCloskey and Cohen (1989); Pfülb and Gepperth (2019).

Motivated by the above observation, we propose a novel composite representation space for few-shot class-incremental learning. The proposed composite feature space is composed of two components: 1) base knowledge space; and 2) lifelong-learning knowledge space. The base knowledge space embeds the discriminative feature information on the base task. We keep it invariant and thereby providing the stable and consistent features for all data samples throughout the entire learning process. In comparison, the lifelong-learning knowledge space corresponds to dynamically updated feature extractors for continually acquiring the new knowledge with the arrival of new task data. Based on the above two space components, we design a composite space by simple yet effective feature space concatenation. As a result, the composite feature space is capable of adaptively embedding the new knowledge by the lifelong-learning component and effectively maintaining a stable and informative feature representation by the base component. In this way, the embedding capacity of feature representation is significantly increased by the stable base feature extractor for consistent features as well as lifelong-learning feature extractors for new-knowledge adaptation. Furthermore, a distance metric can be constructed for the composite feature space. From extensive experiments, we find out even an extremely simple uniform metric strategy is able to result in a dramatic performance improvement.

In summary, the main contributions of this work are summarized in the following three aspects: 1) We formulate the problem of few-shot class-incremental learning from a novel perspective of feature space composition, and build a novel composite representation space that aims at well balancing old-knowledge preserving and new-knowledge adaptation. 2) We present a simple yet effective feature space composition strategy that converts the space composition problem to distance metric construction for different space components with respect to different continual learning stages. 3) Extensive experiments over all the datasets demonstrate the effectiveness of our approach that outperforms state-of-the-art approaches by a large margin. Our method gains on CUB200 and CIFAR100 respectively.

2 Related Work

Recently, there has been a large body of research in continual learning De Lange et al. (2019); Parisi et al. (2019); Li et al. (2020); Lee et al. (2020); Adel et al. (2020); von Oswald et al. (2020); Kurle et al. ; Titsias et al. (2019); Ebrahimi et al. (2020); Zeng et al. (2019). These works can be categorized into three major families: 1) architectural strategies, 2) rehearsal strategies, 3) regularization strategies. Architectural strategies Yoon et al. (2018); Li et al. (2019); Hung et al. (2019); Mallya and Lazebnik (2018); Mallya et al. (2018); Serrà et al. (2018); Rusu et al. (2016); Aljundi et al. (2017); Rajasegaran et al. (2019) keep the learned knowledge from previous tasks and acquire the new knowledge from the current task by manipulating the network architecture, e.g. parameter masking, network pruning. Rehearsal strategies Chaudhry et al. (2019); Lopez-Paz and Ranzato (2017); Shin et al. (2017); Aljundi et al. (2019); Zhai et al. (2019); Wu et al. (2018); Rebuffi et al. (2017); Castro et al. (2018); He et al. (2018); Hou et al. (2019); Wu et al. (2019); Liu et al. (2020) replay old tasks information when learning the new task, the past knowledge is memorized by storing old tasks’ exemplars or old tasks data distribution via generative models. Regularization strategies Lee et al. (2017); Li and Hoiem (2017); Nguyen et al. (2018); Ritter et al. (2018); Kirkpatrick et al. (2017); Zenke et al. (2017); Liu et al. (2018); Aljundi et al. (2018); Dhar et al. (2019) alleviate forgetting by regularization loss terms enabling the updated parameters of networks to retain past knowledge. Continual learning is usually conducted under the task-incremental or the class-incremental learning scenarios. This paper considers continual learning in the difficult class-incremental learning scenario, where the task identity is non-available at inference time. Few-shot class-incremental learning Tao et al. (2020) is a more practical and challenging problem, where only a few number of samples for new tasks are available. The aforementioned approaches resort to one unified lifelong-learning feature space for balancing old-task knowledge preserving and new-task knowledge adaptation. With old samples discarded, and a few number of new samples, we find the embedding capacity of one unified feature space is limited. To enlarge the embedding capacity, our method proposes a composite feature space for few-shot class-incremental learning.

3 Method

3.1 Few-Shot Class-Incremental Learning

In a class-incremental learning setup, a network learns several tasks continually and each task contains a batch of new classes Tao et al. (2020); Rajasegaran et al. (2019); Yu et al. (2020). The time interval from the arrival of the current task to that of the next task is considered as a class-incremental learning session Kemker and Kanan (2018). We suppose the training set for the -th task is , which arrives at the -th session. is the set of classes of the -th task and we consider the generally studied case where there is no overlap between the classes of different tasks: for . At the -th session, we only have access to the data

and the model with a feature extractor and a classifier is trained on it.

In a few-shot class-incremental learning setup, each only contains a few samples, except the training set of the first task (termed as a base task) , which has a large number of training samples and classes instead. For each , if the number of classes is and the number of training samples per class is , the setting is denoted as -way -shot class-incremental learning Tao et al. (2020). Our goal is to obtain a model which performs well on all the seen tasks.

Utilizing an embedding network as the feature extractor for class-incremental learning has superior performance Yu et al. (2020) and is adopted in our work, which will be introduced in Section 3.2. Our feature space composition method for few-shot class-incremental learning will be elaborated in Section 3.3 and an overview of our method is shown in Figure 1.

Figure 1: Illustration of our method. At the -st session, a base embedding model is initially trained on a large-scale training set of base task. At the -th () learning session, the embedding model is firstly updated on data of -th task, then we composite the base feature space and the lifelong-learning feature space which is obtained by the updated embedding network, and finally use the composite feature space for classification.

3.2 Class-Incremental Learning with Embedding Networks

Previous methods mainly formulate class-incremental learning with two kinds of image classification frameworks. Typically, the first one is composed of a feature extractor and a softmax classifier which is trained jointly with a cross-entropy loss. The other is composed of an embedding network for feature extraction and a nearest class mean (NCM) classifier 

Rebuffi et al. (2017); Yu et al. (2020), where only the embedding network is learnable. Using a softmax classifier for class-incremental learning is challenging and harder for longer task sequences which has been discussed detailedly in Yu et al. (2020), and therefore we focus on the latter framework in this paper.

Embedding networks training.

Embedding networks map a given sample to an informative representation space where distance represents the semantic discrepancy between samples Bromley et al. (1994). To this end, a metric learning loss is utilized to ensure that the distance between similar instances is close, and that between dissimilar instances is larger than a margin. In the context of class incremental learning, the embedding space is associated with the problem of minimizing the following objective function:

(1)

where is the metric loss term and is the regularization term for retaining past knowledge Li and Hoiem (2017); Kirkpatrick et al. (2017); Aljundi et al. (2018); Yu et al. (2020). The triplet loss Wang et al. (2014) is often adopted as the metric learning loss :

(2)

where and are the Euclidean distances between the embeddings of the anchor and the positive instance and the negative instance respectively, and denotes the margin. Note that we denote the embedding of a given sample as in the rest of this paper.

Classification with the embedding space.

With a trained embedding network, a NCM classifier Rebuffi et al. (2017); Mensink et al. (2013) is utilized for classification, which is defined as:

(3)

where is the distance metric (e.g. Euclidean distance). is the prototype of class  Rebuffi et al. (2017), which is the class embedding mean and defined as:

(4)

where is the number of training samples of class and if is true and otherwise.

Figure 2: (Best viewed in color.) Visualization of samples in “the lifelong-learning feature space” or “the composite feature space” by t-SNE. Samples of ten classes are from two tasks and each class is represented by one color. (a): Samples in the lifelong-learning feature space at an early session and a later session; (b): Samples in the composite feature space at an early session and a later session.

3.3 Feature Space Composition

For few-shot class-incremental learning with an embedding network and a NCM classifier, the goal is to obtain a high-quality embedding space which represents samples of all the seen tasks well. Specifically, a high-quality feature space is expected to: 1) provide consistent features for all the seen samples and prevent overfitting while updating with old-task and new-task data asynchronously, and 2) fit the new-task data well with a few number of new samples. In this section, we give a detailed analysis for learning with a lifelong-learning feature space and an effective exploration based on space composition is introduced as follows.

Learning with a lifelong-learning space.

Since the data of old tasks is non-available and the number of new samples is few, the lifelong-learning feature space is prone to overfit the new tasks and easily becomes ill-posed. As shown in Figure 2-(a), with the arrival of new tasks, the samples of different classes which are separated at early session, turn out to overlap with each other at later session. It indicates the discriminability of the lifelong-learning feature space becomes worse after several sessions. The samples which can be discriminated well at an early session are not distinguished at the later session.

To this end, we propose to introduce a feature space, which is used to keep the consistency and discriminability of features throughout the learning process. In the few-shot class-incremental learning setting, since the samples of base task are adequate, such a space, referred as the base feature space, can be preserved and used at each session to improve the consistency.

Motivation for feature space composition.

Figure 3: Comparison of accuracy for the current task on MiniImageNet.

Only using the base feature space alone for classification at all sessions achieves worse results, as shown in Figure 3. A large performance gap on the data of current task exists between the base feature space and the lifelong-learning one, indicating that the lifelong-learning space is important for fitting to the new samples. Therefore, we propose to obtain a high-quality feature space at each session by compositing the base feature space and the lifelong-learning feature space, shown in Figure 1.

Classification with the composite space.

We use to denote the composition function (e.g. for a naive implementation, a simple concatenation operation). The composite feature for sample is denoted as , where denotes the embedding in the base feature space and denotes that in the lifelong-learning feature space (trained at the current -th session). We conduct classification in this composite feature space, which is defined as:

(5)
(6)

where is a metric matrix. For a simple formulation, can be a diagonal matrix and thereby indicating the importance of the two parts of features from different feature space, and all features will be concerned equally if

is an identity matrix. After feature space composition, we can obtain a high-quality feature space, as shown in Figure 

2-(b).

Complementarity between the base space and the lifelong-learning space.

We here briefly show that the two spaces are complementary, even with the simplest implementation strategy for space composition, where is constructed as with a scalar ( is an identity matrix with dimension half of ’s). means that only using the base feature space and for only using the lifelong-learning feature space at the current session. The change of accuracy, with respect to , is shown in Figure LABEL:fig:A:ablation. Apparently, using a base feature space achieves lowest accuracy since it contains limited new-task knowledge. The performance of using a lifelong-learning feature space independently is also lower than that of the composite space, because of limited new samples. The performance peaks on an intermediate value, which indicates the complementarity. More sophisticated forms of metric matrix can also be constructed (e.g. in a data-driven learning), and the discussion and analysis of another space composition strategy is detailed in Section 4.4.

4 Experiments and Results

4.1 Datasets

CIFAR100 Krizhevsky and Hinton (2009) is a labeled subset of the million tiny image dataset for object recognition. It contains RGB images in classes, with images per class for training and images per class for testing. CUB200-2011 Wah et al. (2011) contains images for training and for testing with the resolution of , over

bird categories. It is originally designed for fine-grained image classification. MiniImageNet is the subset of ImageNet-1k 

Vinyals et al. (2016) which is utilized by few-shot learning. It contains classes. Each class contains images for training and images for testing. ImageNet-Subset contains randomly chosen classes from ImageNet Deng et al. (2009) and has images in total.

Method learning sessions our relative
1 2 3 4 5 6 7 8 9 10 11 improvements
iCaRL Rebuffi et al. (2017) 68.68 52.65 48.61 44.16 36.62 29.52 27.83 26.26 24.01 23.89 21.16 +32.00
EEIL Castro et al. (2018) 68.68 53.63 47.91 44.20 36.30 27.46 25.93 24.70 23.95 24.13 22.11 +31.05
NCM Hou et al. (2019) 68.68 57.12 44.21 28.78 26.71 25.66 24.62 21.52 20.12 20.06 19.87 +33.29
TOPIC Tao et al. (2020) 68.68 62.49 54.81 49.99 45.25 41.40 38.35 35.36 32.22 28.31 26.28 +26.88
SDC Yu et al. (2020) 72.29 68.22 61.94 61.32 59.83 57.30 55.48 54.20 49.99 48.85 42.58 +10.58
Ours 72.29 68.59 64.73 63.45 61.03 59.13 58.24 56.89 55.45 54.54 53.16
Table 1: Comparison results on CUB200-2011 with ResNet18 using the -way -shot few-shot class-incremental learning setting.
Figure 5: Comparison results on CIFAR100 and miniImageNet with ResNet18 using the -way -shot few-shot class-incremental learning setting.

4.2 Implementation Details

Evaluation protocol:

We conduct experiments under few shot class-incremental setting. Following Tao et al. (2020), we evaluate our method on three datasets (CIFAR100, miniImageNet and CUB200-2011) with similar evaluation protocols. For CIFAR100 and miniImageNet datasets, we choose and classes for the base task and new tasks, respectively, and adopt the -way -shot setting, leading to training sessions in total. While for CUB200-2011, we adopt the -way -shot setting, by picking classes into new learning sessions and the base task has classes. For all datasets, we construct the training set of each learning session by randomly selecting training samples per class from the original dataset, and the test set is same as the original one. After training on a new batch of classes, we evaluate the trained model on test samples of all seen classes . To show the generality of our method, we also conduct experiments in the common class-incremental learning scenario Rajasegaran et al. (2019); Rebuffi et al. (2017). Following  Yu et al. (2020), we conduct six-task scenario on CUB200-2011, eleven-task scenario on ImageNet-Subset.

Training details:

We implement our models with Pytorch and use Adam 

Kingma and Ba (2014) for the optimization. In few-shot class-incremental learning setting, following Tao et al. (2020), we use the ResNet18 He et al. (2016) as the backbone network. The base model is trained with the same strategy in Tao et al. (2020) for a fair comparison. For each new task, we train our models with learning rate for epochs on CUB-200-2011, for epochs on CIFAR100 and miniImageNet. In the common class-incremental learning setting, we follow Yu et al. (2020) and use ResNet18 as the backbone for ImageNet-Subset (ResNet18 for CUB200-2011). The training strategy is same to Yu et al. (2020). We use an embedding network as the feature extractor and the dimensions of the feature extractor is . For each new task, the model is trained with learning rate for epochs on ImageNet-Subset. For data augmentation, we use standard random cropping and flipping as in Tao et al. (2020).

4.3 Comparison to State-of-the-Art Methods

In this section, we compare our proposed method in few-shot class-incremental learning scenario, with existing state-of-the-art methods. These include iCaRL Rebuffi et al. (2017), EEIL Castro et al. (2018), NCM Hou et al. (2019), TOPIC Tao et al. (2020) and SDC Yu et al. (2020). The first four methods use a softmax classifier and the last one uses a nearest class mean classifier, and therefore they achieve different accuracy at the first task (the base task). Table 1 reports the test accuracy of different methods on CUB-200-2011 dataset. Compared with the others, our method achieves a clear superiority (more than ). As shown in Figure 5, we can observe that our method beats all other methods at each encountered learning session on CIFAR100 and MiniImageNet datasets. As the continual learning proceeds, the superiority of our method becomes more significant.

4.4 Empirical Analysis

The effect of the number of training samples.

Figure 6: Comparison results under different few-shot class-incremental learning settings on CIFAR100.

In order to examine the effect of the number of training samples, we evaluate our method with different shots of training samples, which are -shot, -shot, -shot and -shot settings. As shown in Figure 6, we can see that the performance of our method increases as the number of training samples increases. It can also be noticed that the number of samples matters more on latter training sessions, and the performance gap grows more rapidly when the number of samples gets larger. Specifically, the gap between -shot and -shot scenarios appears mainly after the -th session, while that between -shot and -shot can be noticed at the third one.

Method learning sessions our relative
1 2 3 4 5 6 7 8 9 10 11 improvements
SDC Yu et al. (2020) 72.29 68.22 61.94 61.32 59.83 57.30 55.48 54.20 49.99 48.85 42.58 +11.18
Ours-simple 72.29 68.59 64.73 63.45 61.03 59.13 58.24 56.89 55.45 54.54 53.16 +0.6
Ours-pca 72.29 69.73 64.29 63.53 61.26 60.00 57.89 57.78 56.31 55.78 53.76
Table 2: Comparison results on CUB200-2011 with different strategies for feature space composition using the -way -shot class-incremental learning setting.

Different strategies for feature space composition.

We evaluate our method with another strategy for more sophisticated feature space composition, as shown in Table 2

. While the simplest composition strategy is denoted by “Ours-simple”, the sophisticated version, firstly concatenating the base feature space and the lifelong learned space and then reducing the dimensions of feature space by principal component analysis (PCA), is denoted by “Ours-pca”. Specifically, we compute the PCA transformation matrix

with features of samples in base task (obtained by the base model). Another transformation matrix can be learned incrementally Hall et al. (1998); Weng et al. (2003) with samples from the subsequent tasks. Then we reduce the dimension to, e.g. half of the original, and therefore the size of and are both . For concatenating features in composite space, the transformation matrix can be constructed as . Given a feature , and a center for class , the dimension-reduced data can be and , respectively. And the classification can be formulated as:

(7)

Compared with Equation (5), matrix can be considered as a specific data-driven metric matrix . Also, the transformation matrix can be viewed as a part of composition function , which can also be well formulated in our framework. The results are shown in Table 2 and we set

in the few-shot scenario because such few new-task samples are not able to estimate a reasonable

.

Method learning sessions our relative
1 2 3 4 5 6 improvements
SDC Yu et al. (2020) 85.09 76.75 71.40 67.45 66.36 63.72 +1.07
Ours 85.09 77.65 72.19 67.79 67.41 64.79
Table 3: Comparison results on CUB200-2011 with ResNet18 in the six-task class-incremental learning setting.
Method learning sessions our relative
1 2 3 4 5 6 7 8 9 10 11 improvements
SDC Yu et al. (2020) 78.72 73.35 67.87 66.62 63.77 61.15 57.43 56.49 52.91 50.08 49.66 +3.64
Ours 78.72 76.84 72.80 68.55 66.40 64.11 61.43 59.18 57.18 54.15 53.30
Table 4: Comparison results on ImageNet-Subset with ResNet18 in the eleven-task class-incremental learning setting.
Figure 7: Comparison of average forgetting in -way -shot setting.

Average forgetting of previous tasks.

The forgetting of previous tasks is estimated with average forgetting Yu et al. (2020); Chaudhry et al. (2018). We illustrate the forgetting curves of our method across learning sessions on CUB200-2011, shown in Figure 7. We can observe that our method achieves better performance compared with SDC, especially at the latter learning sessions (after the -th session). These results indicate the stability of our composite feature space against the continuous arrival of new tasks.

Results in class-incremental learning scenario.

Our method also performs well in the common class-incremental learning scenario, as shown in Table 3 and Table 4. Due to the adequate training samples in the subsequent tasks, the model is able to learn a better representation than that in the few-shot learning setting, and therefore the improvement is not as that large. Compared with SDC, the average accuracy of our method is higher at all the learning sessions on CUB200-2011 dataset and drops slower as the number of learning sessions increases. On ImageNet-Subset dataset, the performance of our method surpasses them with a large margin at all learning sessions.

5 Conclusion

In this paper, we have proposed a novel few-shot class-incremental learning scheme based on a composite representation space, which harmonizes old-knowledge preserving and new-knowledge adaptation by feature space composition. The composite feature space is built by collaboration of a stable base knowledge feature space and a lifelong-learning feature space in terms of distance metric construction. Comprehensive experimental results demonstrate that the proposed approach significantly outperforms other state-of-the-art approaches by a large margin.

References

  • T. Adel, H. Zhao, and R. E. Turner (2020) Continual learning with adaptive weights (claw). In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars (2018)

    Memory aware synapses: learning what (not) to forget

    .
    In

    Proceedings of the European Conference on Computer Vision (ECCV)

    ,
    pp. 139–154. Cited by: §2, §3.2.
  • R. Aljundi, E. Belilovsky, T. Tuytelaars, L. Charlin, M. Caccia, M. Lin, and L. Page-Caccia (2019) Online continual learning with maximal interfered retrieval. In Advances in Neural Information Processing Systems, pp. 11849–11860. Cited by: §1, §2.
  • R. Aljundi, P. Chakravarty, and T. Tuytelaars (2017) Expert gate: lifelong learning with a network of experts. In

    Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

    ,
    Cited by: §2.
  • J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994)

    Signature verification using a" siamese" time delay neural network

    .
    In Advances in neural information processing systems, pp. 737–744. Cited by: §3.2.
  • F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari (2018) End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 233–248. Cited by: §2, §4.3, Table 1.
  • A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–547. Cited by: §4.4.
  • A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny (2019) Efficient lifelong learning with a-gem. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars (2019) Continual learning: a comparative study on how to defy forgetting in classification tasks. arXiv preprint arXiv:1909.08383. Cited by: §1, §2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §4.1.
  • P. Dhar, R. V. Singh, K. Peng, Z. Wu, and R. Chellappa (2019) Learning without memorizing. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.
  • S. Ebrahimi, M. Elhoseiny, T. Darrell, and M. Rohrbach (2020) Uncertainty-guided continual learning with bayesian neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §1.
  • I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio (2014) An empirical investigation of catastrophic forgetting in gradient-based neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1.
  • P. M. Hall, A. D. Marshall, and R. R. Martin (1998) Incremental eigenanalysis for classification.. In BMVC, Vol. 98, pp. 286–295. Cited by: §4.4.
  • C. He, R. Wang, S. Shan, and X. Chen (2018) Exemplar-supported generative reproduction for class incremental learning.. In BMVC, pp. 98. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §4.2.
  • S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin (2019) Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2, §4.3, Table 1.
  • C. Hung, C. Tu, C. Wu, C. Chen, Y. Chan, and C. Chen (2019) Compacting, picking and growing for unforgetting continual learning. In Advances in Neural Information Processing Systems, pp. 13647–13657. Cited by: §1, §2.
  • R. Kemker and C. Kanan (2018) Fearnet: brain-inspired model for incremental learning. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §3.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.2.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §2, §3.2.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1.
  • [24] R. Kurle, B. Cseke, A. Klushyn, P. van der Smagt, and S. Günnemann Continual learning with bayesian neural networks for non-stationary data. Cited by: §2.
  • S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang (2017)

    Overcoming catastrophic forgetting by incremental moment matching

    .
    In Advances in Neural Information Processing Systems, pp. 4652–4662. Cited by: §1, §2.
  • S. Lee, J. Ha, D. Zhang, and G. Kim (2020) A neural dirichlet process mixture model for task-free continual learning. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong (2019) Learn to grow: a continual structure learning framework for overcoming catastrophic forgetting. International Conference on Machine Learning (ICML). Cited by: §1, §2.
  • Y. Li, L. Zhao, K. Church, and M. Elhoseiny (2020) Compositional continual language learning. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §1, §2, §3.2.
  • X. Liu, M. Masana, L. Herranz, J. Van de Weijer, A. M. Lopez, and A. D. Bagdanov (2018) Rotate your networks: better weight consolidation and less catastrophic forgetting. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2262–2268. Cited by: §2.
  • Y. Liu, A. Liu, Y. Su, B. Schiele, and Q. Sun (2020) Mnemonics training: multi-class incremental learning without forgetting. arXiv preprint arXiv:2002.10211. Cited by: §2.
  • D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476. Cited by: §1, §2.
  • A. Mallya, D. Davis, and S. Lazebnik (2018) Piggyback: adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 67–82. Cited by: §2.
  • A. Mallya and S. Lazebnik (2018) Packnet: adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1.
  • T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka (2013) Distance-based image classification: generalizing to new classes at near-zero cost. IEEE transactions on pattern analysis and machine intelligence 35 (11), pp. 2624–2637. Cited by: §3.2.
  • C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner (2018) Variational continual learning. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §1, §2.
  • B. Pfülb and A. Gepperth (2019) A comprehensive, application-oriented study of catastrophic forgetting in dnns. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1.
  • J. Rajasegaran, M. Hayat, S. H. Khan, F. S. Khan, and L. Shao (2019) Random path selection for continual learning. In Advances in Neural Information Processing Systems, pp. 12648–12658. Cited by: §2, §3.1, §4.2.
  • S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) Icarl: incremental classifier and representation learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2, §3.2, §3.2, §4.2, §4.3, Table 1.
  • H. Ritter, A. Botev, and D. Barber (2018) Online structured laplace approximations for overcoming catastrophic forgetting. In Advances in Neural Information Processing Systems, pp. 3738–3748. Cited by: §1, §2.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §2.
  • J. Serrà, D. Surís, M. Miron, and A. Karatzoglou (2018) Overcoming catastrophic forgetting with hard attention to the task. arXiv preprint arXiv:1801.01423. Cited by: §2.
  • H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pp. 2990–2999. Cited by: §1, §2.
  • X. Tao, X. Hong, X. Chang, S. Dong, X. Wei, and Y. Gong (2020) Few-shot class-incremental learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §1, §1, §2, §3.1, §3.1, §4.2, §4.2, §4.3, Table 1.
  • M. K. Titsias, J. Schwarz, A. G. d. G. Matthews, R. Pascanu, and Y. W. Teh (2019) Functional regularisation for continual learning with gaussian processes. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638. Cited by: §4.1.
  • J. von Oswald, C. Henning, J. Sacramento, and B. F. Grewe (2020) Continual learning with hypernetworks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. technical report cns-tr-2011-001. California Institute of Technology. Cited by: §4.1.
  • J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu (2014)

    Learning fine-grained image similarity with deep ranking

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §3.2.
  • J. Weng, Y. Zhang, and W. Hwang (2003) Candid covariance-free incremental principal component analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (8), pp. 1034–1040. Cited by: §4.4.
  • C. Wu, L. Herranz, X. Liu, J. van de Weijer, B. Raducanu, et al. (2018) Memory replay gans: learning to generate new categories without forgetting. In Advances In Neural Information Processing Systems, pp. 5962–5972. Cited by: §2.
  • Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu (2019) Large scale incremental learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.
  • J. Yoon, E. Yang, J. Lee, and S. J. Hwang (2018) Lifelong learning with dynamically expandable networks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • L. Yu, B. Twardowski, X. Liu, L. Herranz, K. Wang, Y. Cheng, S. Jui, and J. van de Weijer (2020) Semantic drift compensation for class-incremental learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §3.1, §3.1, §3.2, §3.2, §4.2, §4.2, §4.3, §4.4, Table 1, Table 2, Table 3, Table 4.
  • G. Zeng, Y. Chen, B. Cui, and S. Yu (2019) Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence 1 (8), pp. 364–372. Cited by: §2.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. International Conference on Machine Learning (ICML). Cited by: §2.
  • M. Zhai, L. Chen, F. Tung, J. He, M. Nawhal, and G. Mori (2019) Lifelong gan: continual learning for conditional image generation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2759–2768. Cited by: §2.