FSCIL_Feature_Space_Composition
The implementation of paper "Few-Shot Class-Incremental Learning via Feature Space Composition"
view repo
As a challenging problem in machine learning, few-shot class-incremental learning asynchronously learns a sequence of tasks, acquiring the new knowledge from new tasks (with limited new samples) while keeping the learned knowledge from previous tasks (with old samples discarded). In general, existing approaches resort to one unified feature space for balancing old-knowledge preserving and new-knowledge adaptation. With a limited embedding capacity of feature representation, the unified feature space often makes the learner suffer from semantic drift or overfitting as the number of tasks increases. With this motivation, we propose a novel few-shot class-incremental learning pipeline based on a composite representation space, which makes old-knowledge preserving and new-knowledge adaptation mutually compatible by feature space composition (enlarging the embedding capacity). The composite representation space is generated by integrating two space components (i.e. stable base knowledge space and dynamic lifelong-learning knowledge space) in terms of distance metric construction. With the composite feature space, our method performs remarkably well on the CUB200 and CIFAR100 datasets, outperforming the state-of-the-art algorithms by 10.58
READ FULL TEXT VIEW PDFThe implementation of paper "Few-Shot Class-Incremental Learning via Feature Space Composition"
In recent years, continual learning De Lange et al. (2019); Parisi et al. (2019) (i.e. incremental learning or lifelong learning) has received considerable attention due to its capability of continual model learning with a wide range of real-world applications. In principle, continual learning aims at enabling a learner to acquire new knowledge from new data while preserving the learned knowledge from previous data. Continual learning is usually conducted under the task-incremental or the class-incremental learning scenarios. This paper focuses on the latter, which is more difficult, since the task identity is non-available at inference time. In practice, the knowledge from new tasks is often represented by a rather limited number of samples. To meet this practical demand, few-shot class-incremental learning Tao et al. (2020) has raised a lot of attention, which typically involves the learning components of the base task (i.e., the first task with large-scale training samples) and the new tasks (with limited samples). In this learning scenario, it strives to dynamically maintain a unified discriminative feature space to simultaneously represent the new knowledge while preserving the discriminative information from the base task. As a result, it is often in the dilemma between old knowledge forgetting (after a sequence of new tasks) and new sample overfitting (caused by limited samples). Therefore, we focus on building an effective representation space for few-shot class-incremental learning that well balances old-knowledge preserving and new-knowledge adaptation in this paper.
In general, the literature works deal with the class-incremental learning problem from the following three perspectives: 1) architectural strategies that add and remove components Yoon et al. (2018); Li et al. (2019); Hung et al. (2019); 2) rehearsal strategies which store past samples (or other old tasks information) Chaudhry et al. (2019); Lopez-Paz and Ranzato (2017); Shin et al. (2017); Aljundi et al. (2019); and 3) regularization strategies which regularize the updating of the networks’ parameters with the constraint of the learned knowledge Lee et al. (2017); Li and Hoiem (2017); Nguyen et al. (2018); Ritter et al. (2018)). In principle, the aforementioned approaches adopt a common strategy that they only maintain one unified feature space, seeking to make old-knowledge preserving and new-knowledge adaptation mutually compatible. In the unified feature space (with a limited embedding capacity of feature representation), the learning directions for old-knowledge preserving and new-knowledge adaptation are usually inconsistent with each other (even contradictory sometimes). With an asynchronous updating strategy (for old data and new data), the unified feature space for class-incremental learning tends to embed more information on new data as the number of tasks incrementally increases. Consequently, its embedding capacity for old data is greatly compressed. In the case of few-shot class-incremental learning, with only a limited number of new task data samples provided, the learned feature space is prone to suffer from semantic drift (i.e. catastrophic forgetting) or overfitting Tao et al. (2020); French (1999); Goodfellow et al. (2014); McCloskey and Cohen (1989); Pfülb and Gepperth (2019).
Motivated by the above observation, we propose a novel composite representation space for few-shot class-incremental learning. The proposed composite feature space is composed of two components: 1) base knowledge space; and 2) lifelong-learning knowledge space. The base knowledge space embeds the discriminative feature information on the base task. We keep it invariant and thereby providing the stable and consistent features for all data samples throughout the entire learning process. In comparison, the lifelong-learning knowledge space corresponds to dynamically updated feature extractors for continually acquiring the new knowledge with the arrival of new task data. Based on the above two space components, we design a composite space by simple yet effective feature space concatenation. As a result, the composite feature space is capable of adaptively embedding the new knowledge by the lifelong-learning component and effectively maintaining a stable and informative feature representation by the base component. In this way, the embedding capacity of feature representation is significantly increased by the stable base feature extractor for consistent features as well as lifelong-learning feature extractors for new-knowledge adaptation. Furthermore, a distance metric can be constructed for the composite feature space. From extensive experiments, we find out even an extremely simple uniform metric strategy is able to result in a dramatic performance improvement.
In summary, the main contributions of this work are summarized in the following three aspects: 1) We formulate the problem of few-shot class-incremental learning from a novel perspective of feature space composition, and build a novel composite representation space that aims at well balancing old-knowledge preserving and new-knowledge adaptation. 2) We present a simple yet effective feature space composition strategy that converts the space composition problem to distance metric construction for different space components with respect to different continual learning stages. 3) Extensive experiments over all the datasets demonstrate the effectiveness of our approach that outperforms state-of-the-art approaches by a large margin. Our method gains on CUB200 and CIFAR100 respectively.
Recently, there has been a large body of research in continual learning De Lange et al. (2019); Parisi et al. (2019); Li et al. (2020); Lee et al. (2020); Adel et al. (2020); von Oswald et al. (2020); Kurle et al. ; Titsias et al. (2019); Ebrahimi et al. (2020); Zeng et al. (2019). These works can be categorized into three major families: 1) architectural strategies, 2) rehearsal strategies, 3) regularization strategies. Architectural strategies Yoon et al. (2018); Li et al. (2019); Hung et al. (2019); Mallya and Lazebnik (2018); Mallya et al. (2018); Serrà et al. (2018); Rusu et al. (2016); Aljundi et al. (2017); Rajasegaran et al. (2019) keep the learned knowledge from previous tasks and acquire the new knowledge from the current task by manipulating the network architecture, e.g. parameter masking, network pruning. Rehearsal strategies Chaudhry et al. (2019); Lopez-Paz and Ranzato (2017); Shin et al. (2017); Aljundi et al. (2019); Zhai et al. (2019); Wu et al. (2018); Rebuffi et al. (2017); Castro et al. (2018); He et al. (2018); Hou et al. (2019); Wu et al. (2019); Liu et al. (2020) replay old tasks information when learning the new task, the past knowledge is memorized by storing old tasks’ exemplars or old tasks data distribution via generative models. Regularization strategies Lee et al. (2017); Li and Hoiem (2017); Nguyen et al. (2018); Ritter et al. (2018); Kirkpatrick et al. (2017); Zenke et al. (2017); Liu et al. (2018); Aljundi et al. (2018); Dhar et al. (2019) alleviate forgetting by regularization loss terms enabling the updated parameters of networks to retain past knowledge. Continual learning is usually conducted under the task-incremental or the class-incremental learning scenarios. This paper considers continual learning in the difficult class-incremental learning scenario, where the task identity is non-available at inference time. Few-shot class-incremental learning Tao et al. (2020) is a more practical and challenging problem, where only a few number of samples for new tasks are available. The aforementioned approaches resort to one unified lifelong-learning feature space for balancing old-task knowledge preserving and new-task knowledge adaptation. With old samples discarded, and a few number of new samples, we find the embedding capacity of one unified feature space is limited. To enlarge the embedding capacity, our method proposes a composite feature space for few-shot class-incremental learning.
In a class-incremental learning setup, a network learns several tasks continually and each task contains a batch of new classes Tao et al. (2020); Rajasegaran et al. (2019); Yu et al. (2020). The time interval from the arrival of the current task to that of the next task is considered as a class-incremental learning session Kemker and Kanan (2018). We suppose the training set for the -th task is , which arrives at the -th session. is the set of classes of the -th task and we consider the generally studied case where there is no overlap between the classes of different tasks: for . At the -th session, we only have access to the data
and the model with a feature extractor and a classifier is trained on it.
In a few-shot class-incremental learning setup, each only contains a few samples, except the training set of the first task (termed as a base task) , which has a large number of training samples and classes instead. For each , if the number of classes is and the number of training samples per class is , the setting is denoted as -way -shot class-incremental learning Tao et al. (2020). Our goal is to obtain a model which performs well on all the seen tasks.
Utilizing an embedding network as the feature extractor for class-incremental learning has superior performance Yu et al. (2020) and is adopted in our work, which will be introduced in Section 3.2. Our feature space composition method for few-shot class-incremental learning will be elaborated in Section 3.3 and an overview of our method is shown in Figure 1.
Previous methods mainly formulate class-incremental learning with two kinds of image classification frameworks. Typically, the first one is composed of a feature extractor and a softmax classifier which is trained jointly with a cross-entropy loss. The other is composed of an embedding network for feature extraction and a nearest class mean (NCM) classifier
Rebuffi et al. (2017); Yu et al. (2020), where only the embedding network is learnable. Using a softmax classifier for class-incremental learning is challenging and harder for longer task sequences which has been discussed detailedly in Yu et al. (2020), and therefore we focus on the latter framework in this paper.Embedding networks map a given sample to an informative representation space where distance represents the semantic discrepancy between samples Bromley et al. (1994). To this end, a metric learning loss is utilized to ensure that the distance between similar instances is close, and that between dissimilar instances is larger than a margin. In the context of class incremental learning, the embedding space is associated with the problem of minimizing the following objective function:
(1) |
where is the metric loss term and is the regularization term for retaining past knowledge Li and Hoiem (2017); Kirkpatrick et al. (2017); Aljundi et al. (2018); Yu et al. (2020). The triplet loss Wang et al. (2014) is often adopted as the metric learning loss :
(2) |
where and are the Euclidean distances between the embeddings of the anchor and the positive instance and the negative instance respectively, and denotes the margin. Note that we denote the embedding of a given sample as in the rest of this paper.
With a trained embedding network, a NCM classifier Rebuffi et al. (2017); Mensink et al. (2013) is utilized for classification, which is defined as:
(3) |
where is the distance metric (e.g. Euclidean distance). is the prototype of class Rebuffi et al. (2017), which is the class embedding mean and defined as:
(4) |
where is the number of training samples of class and if is true and otherwise.
For few-shot class-incremental learning with an embedding network and a NCM classifier, the goal is to obtain a high-quality embedding space which represents samples of all the seen tasks well. Specifically, a high-quality feature space is expected to: 1) provide consistent features for all the seen samples and prevent overfitting while updating with old-task and new-task data asynchronously, and 2) fit the new-task data well with a few number of new samples. In this section, we give a detailed analysis for learning with a lifelong-learning feature space and an effective exploration based on space composition is introduced as follows.
Since the data of old tasks is non-available and the number of new samples is few, the lifelong-learning feature space is prone to overfit the new tasks and easily becomes ill-posed. As shown in Figure 2-(a), with the arrival of new tasks, the samples of different classes which are separated at early session, turn out to overlap with each other at later session. It indicates the discriminability of the lifelong-learning feature space becomes worse after several sessions. The samples which can be discriminated well at an early session are not distinguished at the later session.
To this end, we propose to introduce a feature space, which is used to keep the consistency and discriminability of features throughout the learning process. In the few-shot class-incremental learning setting, since the samples of base task are adequate, such a space, referred as the base feature space, can be preserved and used at each session to improve the consistency.
Only using the base feature space alone for classification at all sessions achieves worse results, as shown in Figure 3. A large performance gap on the data of current task exists between the base feature space and the lifelong-learning one, indicating that the lifelong-learning space is important for fitting to the new samples. Therefore, we propose to obtain a high-quality feature space at each session by compositing the base feature space and the lifelong-learning feature space, shown in Figure 1.
We use to denote the composition function (e.g. for a naive implementation, a simple concatenation operation). The composite feature for sample is denoted as , where denotes the embedding in the base feature space and denotes that in the lifelong-learning feature space (trained at the current -th session). We conduct classification in this composite feature space, which is defined as:
(5) |
(6) |
where is a metric matrix. For a simple formulation, can be a diagonal matrix and thereby indicating the importance of the two parts of features from different feature space, and all features will be concerned equally if
is an identity matrix. After feature space composition, we can obtain a high-quality feature space, as shown in Figure
2-(b).We here briefly show that the two spaces are complementary, even with the simplest implementation strategy for space composition, where is constructed as with a scalar ( is an identity matrix with dimension half of ’s). means that only using the base feature space and for only using the lifelong-learning feature space at the current session. The change of accuracy, with respect to , is shown in Figure LABEL:fig:A:ablation. Apparently, using a base feature space achieves lowest accuracy since it contains limited new-task knowledge. The performance of using a lifelong-learning feature space independently is also lower than that of the composite space, because of limited new samples. The performance peaks on an intermediate value, which indicates the complementarity. More sophisticated forms of metric matrix can also be constructed (e.g. in a data-driven learning), and the discussion and analysis of another space composition strategy is detailed in Section 4.4.
CIFAR100 Krizhevsky and Hinton (2009) is a labeled subset of the million tiny image dataset for object recognition. It contains RGB images in classes, with images per class for training and images per class for testing. CUB200-2011 Wah et al. (2011) contains images for training and for testing with the resolution of , over
bird categories. It is originally designed for fine-grained image classification. MiniImageNet is the subset of ImageNet-1k
Vinyals et al. (2016) which is utilized by few-shot learning. It contains classes. Each class contains images for training and images for testing. ImageNet-Subset contains randomly chosen classes from ImageNet Deng et al. (2009) and has images in total.Method | learning sessions | our relative | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | improvements | |
iCaRL Rebuffi et al. (2017) | 68.68 | 52.65 | 48.61 | 44.16 | 36.62 | 29.52 | 27.83 | 26.26 | 24.01 | 23.89 | 21.16 | +32.00 |
EEIL Castro et al. (2018) | 68.68 | 53.63 | 47.91 | 44.20 | 36.30 | 27.46 | 25.93 | 24.70 | 23.95 | 24.13 | 22.11 | +31.05 |
NCM Hou et al. (2019) | 68.68 | 57.12 | 44.21 | 28.78 | 26.71 | 25.66 | 24.62 | 21.52 | 20.12 | 20.06 | 19.87 | +33.29 |
TOPIC Tao et al. (2020) | 68.68 | 62.49 | 54.81 | 49.99 | 45.25 | 41.40 | 38.35 | 35.36 | 32.22 | 28.31 | 26.28 | +26.88 |
SDC Yu et al. (2020) | 72.29 | 68.22 | 61.94 | 61.32 | 59.83 | 57.30 | 55.48 | 54.20 | 49.99 | 48.85 | 42.58 | +10.58 |
Ours | 72.29 | 68.59 | 64.73 | 63.45 | 61.03 | 59.13 | 58.24 | 56.89 | 55.45 | 54.54 | 53.16 |
|
|
We conduct experiments under few shot class-incremental setting. Following Tao et al. (2020), we evaluate our method on three datasets (CIFAR100, miniImageNet and CUB200-2011) with similar evaluation protocols. For CIFAR100 and miniImageNet datasets, we choose and classes for the base task and new tasks, respectively, and adopt the -way -shot setting, leading to training sessions in total. While for CUB200-2011, we adopt the -way -shot setting, by picking classes into new learning sessions and the base task has classes. For all datasets, we construct the training set of each learning session by randomly selecting training samples per class from the original dataset, and the test set is same as the original one. After training on a new batch of classes, we evaluate the trained model on test samples of all seen classes . To show the generality of our method, we also conduct experiments in the common class-incremental learning scenario Rajasegaran et al. (2019); Rebuffi et al. (2017). Following Yu et al. (2020), we conduct six-task scenario on CUB200-2011, eleven-task scenario on ImageNet-Subset.
We implement our models with Pytorch and use Adam
Kingma and Ba (2014) for the optimization. In few-shot class-incremental learning setting, following Tao et al. (2020), we use the ResNet18 He et al. (2016) as the backbone network. The base model is trained with the same strategy in Tao et al. (2020) for a fair comparison. For each new task, we train our models with learning rate for epochs on CUB-200-2011, for epochs on CIFAR100 and miniImageNet. In the common class-incremental learning setting, we follow Yu et al. (2020) and use ResNet18 as the backbone for ImageNet-Subset (ResNet18 for CUB200-2011). The training strategy is same to Yu et al. (2020). We use an embedding network as the feature extractor and the dimensions of the feature extractor is . For each new task, the model is trained with learning rate for epochs on ImageNet-Subset. For data augmentation, we use standard random cropping and flipping as in Tao et al. (2020).In this section, we compare our proposed method in few-shot class-incremental learning scenario, with existing state-of-the-art methods. These include iCaRL Rebuffi et al. (2017), EEIL Castro et al. (2018), NCM Hou et al. (2019), TOPIC Tao et al. (2020) and SDC Yu et al. (2020). The first four methods use a softmax classifier and the last one uses a nearest class mean classifier, and therefore they achieve different accuracy at the first task (the base task). Table 1 reports the test accuracy of different methods on CUB-200-2011 dataset. Compared with the others, our method achieves a clear superiority (more than ). As shown in Figure 5, we can observe that our method beats all other methods at each encountered learning session on CIFAR100 and MiniImageNet datasets. As the continual learning proceeds, the superiority of our method becomes more significant.
In order to examine the effect of the number of training samples, we evaluate our method with different shots of training samples, which are -shot, -shot, -shot and -shot settings. As shown in Figure 6, we can see that the performance of our method increases as the number of training samples increases. It can also be noticed that the number of samples matters more on latter training sessions, and the performance gap grows more rapidly when the number of samples gets larger. Specifically, the gap between -shot and -shot scenarios appears mainly after the -th session, while that between -shot and -shot can be noticed at the third one.
Method | learning sessions | our relative | ||||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | improvements | |
SDC Yu et al. (2020) | 72.29 | 68.22 | 61.94 | 61.32 | 59.83 | 57.30 | 55.48 | 54.20 | 49.99 | 48.85 | 42.58 | +11.18 |
Ours-simple | 72.29 | 68.59 | 64.73 | 63.45 | 61.03 | 59.13 | 58.24 | 56.89 | 55.45 | 54.54 | 53.16 | +0.6 |
Ours-pca | 72.29 | 69.73 | 64.29 | 63.53 | 61.26 | 60.00 | 57.89 | 57.78 | 56.31 | 55.78 | 53.76 |
We evaluate our method with another strategy for more sophisticated feature space composition, as shown in Table 2
. While the simplest composition strategy is denoted by “Ours-simple”, the sophisticated version, firstly concatenating the base feature space and the lifelong learned space and then reducing the dimensions of feature space by principal component analysis (PCA), is denoted by “Ours-pca”. Specifically, we compute the PCA transformation matrix
with features of samples in base task (obtained by the base model). Another transformation matrix can be learned incrementally Hall et al. (1998); Weng et al. (2003) with samples from the subsequent tasks. Then we reduce the dimension to, e.g. half of the original, and therefore the size of and are both . For concatenating features in composite space, the transformation matrix can be constructed as . Given a feature , and a center for class , the dimension-reduced data can be and , respectively. And the classification can be formulated as:(7) | ||||
Compared with Equation (5), matrix can be considered as a specific data-driven metric matrix . Also, the transformation matrix can be viewed as a part of composition function , which can also be well formulated in our framework. The results are shown in Table 2 and we set
in the few-shot scenario because such few new-task samples are not able to estimate a reasonable
.Method | learning sessions | our relative | |||||
---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | improvements | |
SDC Yu et al. (2020) | 85.09 | 76.75 | 71.40 | 67.45 | 66.36 | 63.72 | +1.07 |
Ours | 85.09 | 77.65 | 72.19 | 67.79 | 67.41 | 64.79 |
Method | learning sessions | our relative | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | improvements | |
SDC Yu et al. (2020) | 78.72 | 73.35 | 67.87 | 66.62 | 63.77 | 61.15 | 57.43 | 56.49 | 52.91 | 50.08 | 49.66 | +3.64 |
Ours | 78.72 | 76.84 | 72.80 | 68.55 | 66.40 | 64.11 | 61.43 | 59.18 | 57.18 | 54.15 | 53.30 |
The forgetting of previous tasks is estimated with average forgetting Yu et al. (2020); Chaudhry et al. (2018). We illustrate the forgetting curves of our method across learning sessions on CUB200-2011, shown in Figure 7. We can observe that our method achieves better performance compared with SDC, especially at the latter learning sessions (after the -th session). These results indicate the stability of our composite feature space against the continuous arrival of new tasks.
Our method also performs well in the common class-incremental learning scenario, as shown in Table 3 and Table 4. Due to the adequate training samples in the subsequent tasks, the model is able to learn a better representation than that in the few-shot learning setting, and therefore the improvement is not as that large. Compared with SDC, the average accuracy of our method is higher at all the learning sessions on CUB200-2011 dataset and drops slower as the number of learning sessions increases. On ImageNet-Subset dataset, the performance of our method surpasses them with a large margin at all learning sessions.
In this paper, we have proposed a novel few-shot class-incremental learning scheme based on a composite representation space, which harmonizes old-knowledge preserving and new-knowledge adaptation by feature space composition. The composite feature space is built by collaboration of a stable base knowledge feature space and a lifelong-learning feature space in terms of distance metric construction. Comprehensive experimental results demonstrate that the proposed approach significantly outperforms other state-of-the-art approaches by a large margin.
Memory aware synapses: learning what (not) to forget
. InProceedings of the European Conference on Computer Vision (ECCV)
, pp. 139–154. Cited by: §2, §3.2.Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
, Cited by: §2.Signature verification using a" siamese" time delay neural network
. In Advances in neural information processing systems, pp. 737–744. Cited by: §3.2.Overcoming catastrophic forgetting by incremental moment matching
. In Advances in Neural Information Processing Systems, pp. 4652–4662. Cited by: §1, §2.Learning fine-grained image similarity with deep ranking
. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §3.2.