Semantic-aware Knowledge Distillation for Few-Shot Class-Incremental Learning

03/06/2021 ∙ by Ali Cheraghian, et al. ∙ CSIRO Monash University Australian National University 0

Few-shot class incremental learning (FSCIL) portrays the problem of learning new concepts gradually, where only a few examples per concept are available to the learner. Due to the limited number of examples for training, the techniques developed for standard incremental learning cannot be applied verbatim to FSCIL. In this work, we introduce a distillation algorithm to address the problem of FSCIL and propose to make use of semantic information during training. To this end, we make use of word embeddings as semantic information which is cheap to obtain and which facilitate the distillation process. Furthermore, we propose a method based on an attention mechanism on multiple parallel embeddings of visual data to align visual and semantic vectors, which reduces issues related to catastrophic forgetting. Via experiments on MiniImageNet, CUB200, and CIFAR100 dataset, we establish new state-of-the-art results by outperforming existing approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: (a) Knowledge distillation as described in [8107520] does not work on few-shot class-incremental learning [Tao_2020_CVPR] since adding new tasks appends new trainable weights () to the network in addition to base weights (). (b) The impact of using only a few instances of novel classes. As few samples are not sufficient to learn new parameters, the network gets biased towards base classes, overfitted on few examples of novel classes, and not well-separated from base classes. (c) Our semantically guided network does not add new parameters while adding new classes incrementally. We only include word vectors of new tasks () in addition to the base classes () and keep fine-tuning the base network () (d) As a result, the knowledge distillation process can help the network, remembering base training, generalizing to novel classes, and finding well-separated representation of classes.

In a real world scenario, we may not get access to information about all possible classes when the system is first trained. It is more realistic to assume that we will obtain class-specific data incrementally as time goes by. Therefore, in such a scenario, we require that our model can be adapted with new information made available without hampering the performance on what has been learnt so far. Although a natural task for human beings, it is a difficult task for an intelligent machine due to the possibility of catastrophic forgetting [MCCLOSKEY1989109]. A trained model tends to forget old tasks when learning new information. There are three different streams of work in the literature addressing such an incremental or continual learning paradigm  [DBLP:journals/corr/abs-1904-07734]. Firstly, task-incremental learning divides all classes into different tasks, where each task contains a few classes, and then learns each task individually. The task labels of the test instances are made available during testing which means the model does not need to predict the correct label between all classes but only between classes that are defined for a specific task. Secondly, domain-incremental learning does not reveal the task label at test time, but the model always solves the current task at hand without inferring the true class label. Thirdly, class-incremental learning

predicts the class label between all classes during test time as the output of all tasks are merged into one unified classifier without having access to the task label. Being the most realistic of the three, in this paper, we are interested in this third setting. Furthermore, in many applications, new tasks (a set of novel classes) come with only a few examples per class, making the class-incremental learning even more challenging. This setting is called

few-shot class-incremental learning (FSCIL) [Tao_2020_CVPR]. The main challenges in FSCIL are catastrophic forgetting of already acquired knowledge and overfitting the network to novel classes. Challenges of that nature are addressed by the work on knowledge distillation in [44873]. However, [Tao_2020_CVPR] showed that knowledge distillation is not the preferred approach for FSCIL due to class imbalance in the few-shot scenario and the performance trade-off between novel classes and base classes. In this paper, we propose an augmented knowledge distillation approach suitable for the case of few-shot incremental learning.

In order to apply knowledge distillation to novel tasks, scores of the previously trained model are needed as well as many instances of the new classes to be learned. Those new instances help to learn the new trainable weights that are added while learning novel tasks. For incremental learning with few-shot data, we can preserve previous scores but cannot provide enough samples to learn the corresponding weights for novel classes. For this reason, knowledge distillation [Tao_2020_CVPR] becomes a difficult problem in our case. Addressing this issue, we take advantage of a semantic word vector (word2vec [Mikolov_NIPS_2013] or GloVe [Jeffrey_Glove_2014]

) which provides a semantic representation for each class as auxiliary knowledge. Being inspired by the literature on zero-shot learning, given an image as input, we estimate the semantic word vectors for the input instead of directly predicting its class label. Then, we measure the similarity of the predicted word vectors with the word vectors from the set of possible class labels, followed by a softmax layer applied to the similarity values to get the final score of the classes. One key benefit of this approach is that adding new classes while training on novel tasks does not come with new weights to train because the model attempts to predict fixed-length word vectors as an intermediate representation. No matter how many classes are present during fine-tuning, the network continues with its previous task of estimating the word vectors. In this new set-up, the distillation loss can easily accommodate new classes. One challenge of this approach is to obtain a good alignment of visual and semantic word vectors of few-shot instances. To address this issue, we employ automatically assigned superclass information of classes to train multiple embedding modules in parallel after the backbone network. The set of superclasses is attained from the semantic word vector space representations of the base task, and are then held fixed for the novel classes that follow. We determine an embedding for each superclass during training such that each embedding sees only the superclass set of classes. Hence, given a novel class, there is a selection of embeddings that each may be more or less suited. We employ an attention module to merge multiple embedding outputs and a loss to train the alignment appropriately with few-shot instances. It helps the network not to overfit on the few-shot instances as well as not becoming biased to the base classes. Fig. 

1 describes the key differences between conventional works and our method. With our proposed approach, we successfully beat the current state-of-the-art [Tao_2020_CVPR] on MiniImageNet, CUB200, and the CIFAR100 datasets thanks to the combined effect of using the auxiliary semantic information from word vectors and knowledge distillation in concert.

In summary, the contributions of this paper are: (1) A semantically-guided knowledge distillation approach for few-shot class-incremental learning using semantic word vectors, (2) A new visual-semantic alignment strategy for few-shot class-incremental learning using automatically assigned superclass annotations, (3) Extensive experiments validating the approach on MiniImageNet, CUB200, and CIFAR100 while achieving new state-of-the-art results.

2 Related work

Incremental learning: Incremental learning means learning from a sequence of data which appear over time. In the literature [DBLP:journals/corr/abs-1904-07734], incremental learning techniques are categorized into three groups, task-incremental learning [Chaudhry_2018_ECCV, riemer2018learning, v.2018variational], domain-incremental learning [pmlr-v70-zenke17a, NIPS2017_6892], and class-incremental learning [8100070, castro:hal-01849366, Hou_2019_CVPR, Wu_2019_CVPR]. In this paper, we are only concerned with the third category, class-incremental learning, as we consider a unified output where the task label is not available during test time. Rebuffi  [8100070] keeps an “episodic memory” of the samples and incrementally adapts the nearest-neighbor classifier for the novel tasks. Castro  [castro:hal-01849366] proposed an end-to-end incremental learning method. In this method, a knowledge distillation loss is used to keep information about previously seen classes, and a classification loss is employed to learn the new classes. Hou  [Hou_2019_CVPR]

introduced a novel approach for incrementally learning a unified classifier that reduces the imbalance between old and new classes by cosine similarity, which excludes the bias in the classifier. Wu  

[Wu_2019_CVPR] proposed a method for large scale incremental learning, where they correct the bias in the output of the model with the help of a linear model.

Few-shot incremental learning: There are not many works in the literature addressing the FSCIL setting. Tao  [Tao_2020_CVPR] proposed this setting for the first time. They utilize a neural gas (NG) network to learn and maintain the topology of the feature manifold produced by various classes. To be more specific, they introduce a method that alleviates the forgetting of the old classes by stabilizing the topology of the NG and enhancing the representation learning for few-shot novel classes by expanding and modifying NG to novel training samples. There is another category in the literature, called dynamic few-shot learning (DFSL) [8578557, 8578557, xtranet], which is similar to FSCIL. In the literature, some works call this setting incremental few-shot learning but for clarity, in this work, we call this setting DFSL. The only difference is that in DFSL, there are only two sequences of tasks, in comparison to FSCIL, which contains several tasks. The first task that contain many training samples is called base task, and the second task which contains only a few training samples is called a novel task. Gidaris  [8578557] proposed an attention-based method that generates a classifier for the novel task from the classifier of the base task. Mengye  [ren19incfewshot] notes that the method of recurrent back-propagation can back-propagate within the optimization process and helps the learning of novel tasks. Yoon  [xtranet] proposed a method which obtains a task-adaptive representation for novel tasks based on the information provided from the base task by an attention module.

Knowledge distillation:

Knowledge distillation is a well-known procedure that is employed in incremental learning to address catastrophic forgetting. Distillation loss was initially introduced to convey knowledge between separate neural networks

[44873]. Later, Li [8107520] used a distillation loss to preserve the knowledge of the old tasks while learning the new ones using a classification loss. Shmelkov  [Shmelkov_2017_ICCV] proposed a method where the embedding and the classifier are trained together without the need for keeping samples of the training data. Castro  [Castro_2018_ECCV] introduced an end-to-end method which consists of a classification loss for learning novel tasks and a distillation loss to retain information of the old task. Zhang  [9093365] introduced an approach to train an individual network for the novel classes, and then merging this new network with the network based on previous classes using a double distillation objective. Zhao  [9156766] employed knowledge distillation, at first, to keep the discrimination of old classes. Next, to further keep the balance between old and new classes, they introduced a method to refine the bias weights in the FC layer following the regular training. While it has been shown in [Tao_2020_CVPR] that regular knowledge distillation is not working well in the FSCIL setting, we offer, in this work, a method that enables the use of knowledge distillation for this purpose.

3 Method

3.1 Problem Formulation

Suppose, there is a sequence of disjoint tasks , …, }, where is the set of classes in the task . Additionally, a set of -dimensional semantic class embeddings for each class label in the task defined as is available during training. To be more specific, in the task , is the -th sample, is its associated ground truth, and is its associated semantic representation. There are many training samples in the first task , termed the base task. However, in the following tasks , termed the novel task, there are only a few training samples (5-shots per class) for each class. It is essential to mention that the classes between all tasks are disjoint, i.e., , where . The objective of our work is to incrementally train a model with a unified output, while only training samples of the -th task is available at the -th session. At test time, we expect the trained model on task to predict the output for the current task and all the previous tasks {, … , }, where the task label is not available.

3.2 Knowledge Distillation

Figure 2: Our simplified proposed architecture for knowledge distillation. In this design, the input image x is forwarded into the backbone to extract a feature representation . Then, the extracted feature g is mapped into the semantic domain via a mapping module to form the estimation of the semantic vector .

Knowledge distillation is a common approach [8100070, castro:hal-01849366, Hou_2019_CVPR] for incremental learning to address the catastrophic forgetting. Even though it has shown promising results on incremental learning, it cannot be employed directly to few-shot class incremental learning (FSCIL) due to the imbalanced data and trade-off issues [Tao_2020_CVPR]. In this paper (see Fig. 2), we will show how to successfully take advantage of knowledge distillation in the FSCIL setting. To illustrate our proposed method for knowledge distillation, we use a simplified version of our proposed architecture. In this scheme, the input image x goes into the backbone , which is trained on only the first task , as we have many training samples in this task, and is kept frozen on other tasks. The output of the backbone is a feature representation . Next, the mapping network is used to project the feature representation g into the semantic domain, where the projected feature is aligned with its associated semantic representation . Lets assume we want to train the model with the task , which has classes, and the number of classes of the previous tasks are . Also, we choose one representation for each class of the former tasks saved in a small memory , where each prototype is the average of all training samples from each class projected into the semantic domain. The aim is to map the input image into the semantic domain by the function , which consists all training parameters of the backbone and the mapping module . We consider the cosine distance as the similarity between the projected feature and the semantic representation , . If the output of the classifier before adding the novel task is , and if the output of the classifier after adding the novel task is , the distillation loss is defined as,

(1)

where is,

(2)

Here, is the temperature scalar. The is set to 2 for all experiments. Also, is the number of samples in the task and memory .

Remark.

In this paper, we show that knowledge distillation can be used for few-shot class incremental learning. To this end, we use a semantic word vector in our pipeline as additional information.

Additionally, we employ a cross-entropy loss as a classification loss,

(3)

where

Then, the total loss is defined as,

where and are used to control the effect of each loss in the final loss .

This approach does not add new parameters during the incremental learning stage. Instead, it only fine-tunes with newly available data incrementally. It helps the distillation process and learning without forgetting. Also, this approach reduces the data imbalance problem of few-shot learning. The network may not have had the opportunity to see the novel objects, however, newly discovered novel objects may very well share semantic properties with the objects it has already seen among the base classes. For example, if zebra is a novel class, the network did not have any opportunity to see any zebras, however, many typical zebra attributes like ‘has tail’, ‘has strip’, etc. are seen by the system from other base class animals. It helped the network to understand zebra by reducing the data dependency.

3.3 Multiple embeddings for few shot tasks

Figure 3: Our proposed architecture is designed specifically for tasks with only a few training samples. The input image x goes into the backbone , which generates a global representation . Then, the feature g separates into several embedding modules . The attention module is used to merge all of them to generate the final representation .

The main challenge with the tasks that contain a few training samples is the overfitting issue, as it is difficult to learn the distribution of a task sufficiently with only a few training data. To this end, in our proposed method (see Fig. 3), we generate multiple embeddings, where each is designed specifically for a group of classes. We use the word vector semantic to separate classes into several groups. We design these groups based on the classes that we have in the first task because the first task contains more classes with many training samples. After that, we train each embedding using a cross-entropy loss based on the superclass labels obtained in the previous stage with the training data of the first task , where we have many training data for each class. To this end, each embedding observes only part of the entire set of classes, which results in embeddings that are experts on a particular superclass. For the novel tasks, we assign a superclass label to each class based on the cluster that we obtained in the previous stage.

Superclass: The number of the embeddings is defined by the superclass knowledge obtained from the semantic word vector space. In the semantic space, there is a class embedding s for each class. We apply -means clustering of the semantic representations of the classes of the first, large, task . In this way, similar classes fall into the same category (see Fig. 4). After applying -means clustering, we assign a superclass label to each class . For the other tasks, which have only a few training samples, we use the cluster centers (obtained for the first task ) to assign superclass labels to the classes in these tasks. To assign a superclass label to novel classes, we simply calculate the minimum Euclidean distance between the semantic vector of novel class and cluster centers.

Figure 4:

In order to construct superclasses, we use k-means clustering on the semantic word vectors of the classes in the first task

.
Figure 5: The proposed architecture. In this design, the image x is forwarded to the Backbone to obtain the global feature representation . The feature g is fed into several embedding modules to get a representation based on the superclass information obtained from the word vector semantic space. After this stage, the feature vector is obtained by concatenating the global and embedding representations. In the next stage, the f is projected via a mapping module from the visual space into the semantic space to align the visual information with their associated semantic representation.

Attention: The embedding representation is the weighted average of all private modules where a neural network determines weights, which is an attention module. Furthermore, the weights must sum to 1 to stay invariant to the number of private modules. Then, the final embedding representation is,

(4)

where

(5)

Here, and are trainable parameters.

1:function train()
2:

     Hyperparameters:

, , ,
3:     
4:     Train the backbone on the first task using a       cross-entropy loss
5:     Apply -means clustering, where , on base       semantic vectors and assign a superclass label
6:     Train embedding modules (Fig. 3) on the base       task using superclass labels as cluster identity       with a cross-entopy loss Train the final architecture (Fig. 5) now
7:     for  to  do
8:         Calculate the classification loss using Eq 3
9:         Calculate the attention loss using Eq 6
10:         
11:

         Backpropagate and update

,
12:     end for
13:     , ,
14:     for  to  do
15:         Assign superclass labels to task using           cluster centers obtained for the base task
16:         for  to  do
17:              Calculate the classification loss using Eq 3
18:              Calculate the distillation loss using Eq 1
19:              Calculate the attention loss using Eq 6
20:              
21:              Backpropagate and update ,
22:         end for
23:         , ,
24:     end for
25:end function
26:function updatememory(, , )
27:     for  to  do
28:         Calculate a prototype for each class by          averaging of all training samples from each class          when projected into the semantic domain
29:         
30:     end for
31:     return
32:end function
Algorithm 1 Training procedure of the proposed method

Training: We train the attention modules so that the final representation e

becomes similar to the corresponding embedding module given the associated superclass label. To this end, we use the following loss function,

(6)

Where is the number of samples in the task . This approach helps to not overfit the network on only a few novel class data. Multiple embeddings specialized on related classes, belonging to the same superclass, describe the novel instances. Combining multiple embedding features and the global feature enables a strong generalization when classifying both base and novel classes.

3.4 Model Overview

The final proposed architecture is shown in Fig. 5. The input image x goes into the backbone , which is pretrained on the first task with a cross-entropy loss, to generate the global feature representation . When training the other tasks , the backbone network is kept frozen to prevent overfitting to classes with only a few training samples. Then, the extracted feature g is fed into embedding modules , where they are trained on the first task based on the superclass information and are updated for the novel tasks. To this end, the output of all embedding are fused based on the weights generated by an attention module . Then, the global feature g is concatenated with the feature generated from the embedding modules e to form the feature . In the next stage, the feature f is projected from the visual domain to the semantic domain by the mapping module to form feature , where it is aligned with its associated semantic representation . In order to train our proposed architecture, we use the following loss function,

(7)

where , , and are used to control the effect of each term in the final loss function. The pseudo code of the training procedure is shown in Algorithm 1.

4 Experiments

This section contains two parts. In the first part, we evaluate our method on FSCIL [Tao_2020_CVPR], and we conduct a set of ablation studies to investigate the recommended approach. Next, we investigate the dynamic few-shot learning [8578557] (DFSL) setting to demonstrate the capability of our proposed method in a different setting.

4.1 Experiments on FSCIL

Method session
1 2 3 4 5 6 7 8 9 10 11
iCaRL [8100070] 68.68 52.65 48.61 44.16 36.62 29.52 27.83 26.26 24.01 23.89 21.16
EEIL [castro:hal-01849366] 68.68 53.63 47.91 44.20 36.30 27.46 25.93 24.70 23.95 24.13 22.11
NCM [Hou_2019_CVPR] 68.68 57.12 44.21 28.78 26.71 25.66 24.62 21.52 20.12 20.06 19.87
AL-MML [Tao_2020_CVPR] 68.68 62.49 54.81 49.99 45.25 41.40 38.35 35.36 32.22 28.31 26.28
Ours 68.23 60.45 55.70 50.45 45.72 42.90 40.89 38.77 36.51 34.87 32.96
Table 1: CUB200 results with ResNet18 based on the 10-way 5-shot setting.

Datasets: We evaluate our proposed method on three well-known datasets, MiniImageNet [NIPS2016_6385], CUB200 [WahCUB_200_2011], and CIFAR100 [CIFAR-100]. MiniImageNet contains 100 classes, where each class include 500 training samples and 100 testing samples. The size of each image is . CUB200 contains 200 fine-grained classes, separated into 6000 training images, and 6000 testing images. The image size in this dataset is . CIFAR100 includes 100 classes, where each class has 600 images, separated into 500 training images and 100 test images. Each image has a size of . In this paper, we follow the setting proposed by [Tao_2020_CVPR]. In this setting, for MiniImageNet and CIFAR100, and are selected as the number of base and novel classes. For novel classes, a 5-way 5-shot setting is considered. There are nine sessions for MiniImageNet and CIFAR100 datasets (1 base session + 8 novel sessions). For the CUB200 dataset, a 10-way 5-shot setting is considered, where 100 classes are selected as base classes, and the remaining 100 classes are split into 10 sessions.

Semantic Features: We utilize unsupervised word vectors accumulated from an unannotated text corpus as a class semantic embedding. For MiniImageNet, CUB200, and CIFAR100, we employed 1000, 400, and 300 dimensional word2vec [Mikolov_NIPS_2013], respectively. In the ablation study section, we also use the 300 dimensional GloVe [Jeffrey_Glove_2014] for the CUB200 dataset.

Implementation details: We employed ResNet18 [He_2016_CVPR] for the backbone , where features of the input image are derived from the final pooling layer with 512 dimensions. The backbone is trained on the first/base task, and kept fixed for the following tasks. For the embedding modules , we used a fully connected layer with 512 dimensions. Moreover, we utilized a few fully connected layers for the attention module . Ultimately, for the mapping module , we used three fully-connected layers with 512, 728, and

, which is the dimension of the semantic word vector, hidden units, where all layers have a ReLU function. In all experiments, we utilize the Adam optimizer 

[Article40], where the learning rate and batch size were set to 0.001 and 128, respectively. The number embedding modules for MiniImageNet, CUB200, and CIFAR100 are selected as 3,5, and 3 respectively. Also, the value of , , and are 0.7 , 1.1 , and 0.6 respectively for all datasets. Since we used semantic word vectors in our pipeline, we get slightly different result on the first task in comparison to the setting proposed by [Tao_2020_CVPR].

4.1.1 Results

In this part, we compare our proposed method with state-of-the-art [8100070, castro:hal-01849366, Hou_2019_CVPR, Tao_2020_CVPR] on the 5/10-way and 5-shot setting. Figure 6 presents our results on MiniImageNet and CIFAR100 using the 5-way 5-shot setting. Moreover, we show the result on the CUB200 dataset in Table 1. Overall, on these three datasets, our method beats all state-of-the-art methods. As additional sequences of new tasks arrive, our approach shows its advantages to the other methods. To be more specific, in MiniImageNet, in the last session, we get 39.04% accuracy, while the second-best one (TOPIC) achieves 24.42% accuracy which demonstrates that our method surpasses the state-of-the-art by a large margin (more than 14%). On CIFAR100, our method reaches the absolute accuracy of 34.80%, while the second-best (TOPIC) one accuracy is 29.37%. Also, on CUB200, our approach achieves 32.96% in the last session, which is superior to the other approaches.

Figure 6: Results on MiniImageNet and CIFAR100 based on the 5-way 5-shot FSCIL setting.

4.1.2 Ablation study

The impact of loss function and embedding module: In this part, we report the individual effect of the distillation loss and attention loss functions in the total loss. As can be seen in Fig. 7, where only the accuracy of the last session is shown, is more effective than , as helps the model to remember the previously seen tasks, while helps the model to generate a richer feature representation for the novel tasks, which have only a few training samples.

We also evaluate the effect of multiple embedding modules in our proposed algorithm. To understand the influence of these in our design, we utilise a baseline architecture which does not include the multiple embedding module and has only the backbone . As presented in Fig. 7, our proposed method exceeds the baseline, which indicates the value of the embedding module inside our architecture.

Figure 7: (Left) The influence of and losses, and (Right) the impact of using multiple embedding
Figure 8:

The results of our method on base (left), novel tasks (middle), and their harmonic mean (right) in different incremental sessions.

Impact of different backbones and semantic information: In this ablation study, we evaluate the influence of the alternative semantic word vector, GloVe. We also conduct this analysis using two backbones, ResNet18 and ResNet101, to understand the effect of different backbones. The accuracy of our approach is evaluated on test samples of the base task and the novel tasks. The accuracy of the base task is the performance of our model on the first/base task , and the accuracy of the novel task is considered the performance on the test samples of the current task and all previous novel tasks . To evaluate the contribution of the base and novel instances in the final accuracy, we also report the Harmonic Mean (HM) of the accuracy of the base and novel classes computed as,

(8)

The scores for , , and HM are shown in Fig. 8. As can be seen, the combination of GloVe and ResNet101 forget less of the base classes, which is followed by better learning on novel classes. As a result, it provides a greater HM value. This investigation reveals that having a different type of semantic vector is valuable in our suggested pipeline. Additionally, a deeper backbone (ResNet101) is useful in FSCIL as it generates more valuable feature representations. We present our results starting from session 2 to 11 because there is no novel task in session 1.

Method 1-shot 5-shot
accuracy accuracy
Imprint [8578708] 41.34 0.54% -23.79% 46.34 0.54% -25.25%
LwoF [8578557] 49.65 0.64% -14.47% 59.66 0.55% -12.35%
AA [ren19incfewshot] 54.95 0.30% -11.84% 63.04 0.30% -10.66%
XtarNet [xtranet] 55.28 0.33% -13.13% 66.86 0.31% -10.34%
Ours 74.00 0.15% -10.43% 84.03 0.25% -7.11%
Table 2: miniImageNet 64+5-way results

4.2 Experiments on DFSL

To further evaluate our proposed method, we apply our design to the DFSL setting. As discussed in the related work section, DFSL is similar to FSCIL, though the only difference is that we have only two sequences of tasks in DFSL whereas there are various sequences of tasks in FSCIL. In this section, we use the setting introduced by [8578557]. In this experiment, we use miniImageNet, which is split into two tasks, a base task and a novel task. The base task consists of 64 classes, while the novel task has five classes which are randomly selected in an episode way from 20 classes. It is necessary to mention that the classes in the base and novel tasks are disjoint. In each episode, the training (support) set is built by selecting 1 or 5 examples of the novel classes, representing a 1-shot or 5-shot scenario. The test (query) set consists of examples of the base and novel tasks. In this experiment, we use ResNet12 as adopted in [mishra2018a] as the backbone. The forgetting in the DFSL setting is calculated as the gap between joint and individual performances for base and novel classes. Individual performances for the base/novel task is calculated when only the base/novel classifier is used. and are used to indicate the gap between the base and novel tasks, respectively. The average of these gaps is represented as , which show the amount of forgetting,

We compare our method with state-of-the-art methods in Table 2. As can be seen, our method outperforms all other methods on joint accuracy and forgetting by a large margin on both the 1-shot and 5-shot settings.

5 Conclusion

We proposed a semantic-aware knowledge distillation method for few-shot class incremental learning (FSCIL). Due to a limited amount of training data for the novel classes, the knowledge distillation technique as previously used did not work well in this problem domain. In this paper, using auxiliary information from class semantics (word vectors), we propose a new FSCIL method where knowledge distillation can indeed perform learning without forgetting. Moreover, we offer an attention mechanism based on multiple embedding representations of visual data to describe the novel classes that also demonstrates better generalization. Three well-known datasets, MiniImageNet, CUB200, and CIFAR100, are used to show that class semantics can be a useful source of information for knowledge distillation. We outperform state-of-the-art methods of FSCIL by a large margin.

References