In this paper we are interested in two practically important learning scenarios, namely generalized few-shot learning (GFSL) [gidaris2018dynamic, ren2019incremental, qi2018imprinted, shi2019relational] and incremental few-shot learning (IFSL) [tao2020few, chen2021incremental]. In both scenarios it is possible to learn a performant classifier for a set of base classes for which many training samples exist. However, for the novel classes, only few training samples are available such that a novel class learning is challenging. Additionally, in generalized few-shot learning and in incremental learning it is important to prevent catastrophic forgetting of the base classes during novel class learning. Last, but not least, classifier calibration across classes has to be addressed, due to the imbalance in the amount of training samples. While previous work focuses on addressing a subset of these challenges [gidaris2018dynamic, ren2019incremental, kumar2019protogan, tao2020few, chen2021incremental], in this paper we aim to address all three.
To this end, we propose a three phase framework to explicitly address these challenges. The first phase is devoted to general representation learning as in previous work [gidaris2018dynamic, ren2019incremental, wang2018low]. Here, we utilize a large base dataset for pretraining and obtain high performance for base classification. In the second phase we concentrate on learning novel classes. In contrast to the prior work, we pay special attention to training a calibrated classifier for the novel classes while simultaneously preventing catastrophic forgetting for the base classes. More specifically we propose base-normalized cross entropy that amplifies the softmax output of novel classes to overcome the bias towards the base classes, and simultaneously enforce the model to preserve previous knowledge via explicit weight constraints. In the third phase we address the problem of calibrating the overall model across base and novel classes. In Fig. 1 we show how the model develops during all three phases by plotting the test accuracy of base and novel classes in the separate and joint spaces. The contributions of this work are as follows:
[wide, label=(0), itemsep=-.3em, topsep=0em, labelwidth=0em, labelindent=0pt]
A framework to explicitly address the problems of generalized few-shot-learning by balancing between learning novel classes, forgetting base classes and calibration across them in three phases;
Base-normalized cross-entropy to overcome the bias learned by the model on the base classes in combination with weight constraints to mitigate the forgetting problem in the second phase;
An extensive study to evaluate the proposed framework on images and videos showing state-of-the-art results for generalized and incremental few shot learning.
2 Related Work
Generalized Few-Shot Learning (GFSL) Many approaches for few-shot learning (FSL) rely on a meta-learning paradigm to quickly adapt a method to new underrepresented samples [snell2017prototypical, vinyals2016matching, finn2017maml, sung2018learning, mishra2018snail, sung2018learning]. Such models can be hard to extend to a generalized setup since they do not explicitly learn classification of base classes and do not consider extreme imbalance. Some recent work on few-shot learning additionally examine a generalized setup [ye2020set2set, luo2019global] showing a significant drop in performance in the joint space.
First work [hariharan2017low, wang2018low] on GFSL propose to hallucinate extra samples based on intra-class variations of base classes. Later, Gidaris [gidaris2018dynamic] propose an attention based weights generator for few-shot classes and promote cosine normalization to unify recognition of base and novel classes. Concurrently, Qi [qi2018imprinted] propose weight imprinting that is also based on the idea of cosine normalization. The technique is widely used to avoid the explicit calibration of magnitudes [hou2019learning, ren2019incremental, shi2019relational, mittal2021essentials, gidaris2019generating, rebuffi2017icarl]. In our work we exploit the bias in base classification weights to give the model impetus to learn novel classes in the joint space. Ren [ren2019incremental] propose to use an attention attractor network [zemel2001localist]
to regularize the learning of novel classes with respect to accumulated base attractor vectors. The above works are based on meta-learning frameworks, consequently they can be dependent on the number of novel classes. In contrast, Shi[shi2019relational]
propose a graph-based framework to model the relationship between all classes that can be trained end-to-end. GFSL receives attention in the video domain as well. Previous work propose to enlarge the training data for few novel classes by means of a generative adversarial network[kumar2019protogan] or to retrieve similar data from a large-scale external dataset [xian2020generalized].
Incremental Few-Shot Learning (IFSL) Class incremental learning (CIL) [li2017learningwf, liu2020mnemonics, castro2018end, hou2019learning, rebuffi2017icarl] addresses the problem of a continuously growing classification space, where each set of novel classes extends previously observed classes. The major problem of incremental learning, catastrophic forgetting [mccloskey1989catastrophic], is caused by limiting the access to already seen data while each novel class is provided with a large train set. Due to the ample number of training samples, on the one hand, authors [mittal2021essentials, ahn2020ssil, zhao2020continual] propose to separate softmax classification into several subspaces to balance learning. On the other hand, some work addresses it with bias removal techniques [wu2019large, belouadah2019il2m, Belouadah_2020_scail]
by training batch of additional hyperparameters[wu2019large], using dual memory [belouadah2019il2m] or with post-processing [Belouadah_2020_scail]. On the contrary, we benefit from the joint space to overcome the deficiency of data and to learn a stronger classifier for novel classes without introducing any additional parameters.
A new task that combines FSL and CIL is incremental few-shot learning (IFSL) [chen2021incremental, tao2020few, ayub2020cognitively]. GFSL can be seen as a subproblem of IFSL with only one incremental learning set of novel classes. Tao [tao2020few] propose to preserve the topology of the feature manifold via a neural gas network. Chen [chen2021incremental] use a non-parametric method based on learning vector quantization in deep embedding space to avoid imbalanced weights in a parametric classifier. In this work we approach the problem from the perspective of classic parametric classification models [he2016deep, krizhevsky2012imagenet] that recently are shown to be effective in FSL [tian2020rethink] and CIL [prabhu2020gdumb].
In the following we introduce the general setting and the motivation of our method based on four separate performance measures introduced below. We then discuss the second phase training as the most crucial part to achieve strong performance for both novel and base classes, followed by our third phase training. Finally, we discuss how to generalize our method to incremental few shot learning.
Setting: In both generalized and incremental few-shot learning we have a set of base classes with many training samples. Additionally, we have one or several sets of novel classes with only few training samples. In generalized few-shot learning we have just one set of novel classes, while in incremental few-shot learning we have a sequence of such sets. In the following, to keep the notation simple, we discuss our approach based on a single set of novel classes, whereas in incremental learning the approach is applied to the sequence of such sets of novel classes. Note that incremental few-shot learning in our work is the same as the few-shot class incremental learning in [chen2021incremental, tao2020few].
Performance measures and approach: In few-shot learning we are interested to achieve best performance for both base and novel classes simultaneously. Therefore, in order to monitor performance for both sets of classes, we are considering four different measures (see Fig. 3) First, we denote the classification performance of base samples in the space of only base classes , and the classification performance of novel samples in the space of only novel classes . More importantly, we are interested in the performance in the joint (J) space where both base and novel classes are accounted for simultaneously . For this we consider the performance of base and novel samples separately in the joint space, that is and . We prefer these two measures rather than using only the joint performance in joint space due to the imbalance of the number of classes [hou2019learning, belouadah2019il2m, wu2019large] with (e.g. 64 base vs. 5 novel).
These measures are directly related to the three challenges mentioned above: novel class learning is measured by and , catastrophic forgetting by and , while calibration is related to and . While ideally we would like to address all the measures simultaneously, we found this to be difficult in practice. Instead, during the first phase of our framework, we optimize for . In the second phase, during novel class learning, we are aiming for a calibrated classifier for the novel classes and thus optimize for both and , instead of only as in standard few-shot learning. Simultaneously, we aim to prevent catastrophic forgetting by an additional weight regularization that keeps high (see Fig. 1). In the third and last phase we aim to calibrate across novel and base classes and thus optimize for both and .
Model parameters: We denote the backbone parameters as , , and for the first, the second, and the third phase respectively. As classifier we use a linear classification layer without bias that we train on the top of the backbone. Practically, during the second phase we introduce a classification layer for novel classes. To evaluate performance in the joint space we concatenate the output of the two classifiers before the normalization.
denotes the output logit of the model for the classification into class. We train on a large dataset of base classes to obtain a good representation. For the second and the third phase we initialize with the parameters and fine-tune on the corresponding to phase set.
3.1 Second Phase - Novel Class Training
Base-Normalized Cross Entropy () Recently, Tian [tian2020rethink] showed that in few-shot learning competitive classification on novel classes can be achieved given good representations using standard cross entropy without meta-learning and prototypes. We follow this idea and train using a pretrained model from the first phase and fine-tune it with a new classification layer for novel classes. The standard way to fine-tune the model on the training set that includes classes is
where is the logit of the corresponding class , and equals to one if belongs to class , otherwise to zero.
One problem here is that even if the model is capable of learning information about the new classes well, there is no guarantee that this performance is replicated in the joint space . By training two disjoint classifiers we learn classification weights that satisfy the classification problem either on novel classes or on base classes , but so far the model does not learn any correlation between base classification weights and novel classification weights.
To this end, we propose to provide the model with information about the base class distribution in the joint space using readily available information. Specifically, for each novel training sample we compute logits not only for the novel classes, but also for the base classes (note, in the second phase the base classes classifier is kept fixed). We use these logits to compute classification scores with the softmax function, thus the normalization of each score includes the base class logits that initially prevail in the sum, as follows:
With this normalization, the novel model learns output probabilities for the novel classes directly in the joint space, and specifically increasing magnitude of novel class logits with respect to base class logits. This allows to have a good classification accuracy for novel classesin the second phase and at the same time helps to match this accuracy in the joint space . Note that we do not use any base class training samples during this learning phase and we keep the weights of the base classifier fix.
Knowledge Preservation After the first phase the model performs well for the base classes and we aim to keep this capability. In FSL [Dhillon2020Abaselinefew, raghu2019rapid] and IL [li2017learningwf, rebuffi2017icarl] multiple works show that adaptive representations can be beneficial for learning novel classes, specifically to fine-tune the parameters of the representation. In IL, the typical way to preserve knowledge from base classes [castro2018end, lee2019overcoming, wu2019large, douillard2020podnet, li2017learningwf, rebuffi2017icarl] is knowledge distillation (KD) [hinton2015distilling] that is applied in the form of KL-divergence between logits of base classes from adapted and old models.
As an alternative to keep the network to remember about the previous knowledge, we propose to utilise explicit weight constraints (WC) of the model with respect to the old model from the first phase. We formulate it in form of a penalization over adaptive parameters of the representation [li2017learningwf, kirkpatrick2017overcoming, evgeniou2004regularized]:
where denotes adaptive parameters of the backbone excluding classification parameters during the second phase and are the parameters of the model after base pretraining. The above constraint forces the model to keep the representation learned on base samples, but still allows the model to adjust the weights of the representation to better fit novel classes while not diverging a lot from the old model. The overall loss for the second phase is thus:
where controls the strength of knowledge preservation.
3.2 Third Phase - Joint Calibration
Balanced Replay The first and the second phase account for the performance on the base classes () and for the novel class learning in both spaces ( and ). For the third phase, due to the difference of number in training samples for base and novel classes and preservation of during the second phase, the model is able to obtain good performance in the joint space as well. Empirically we found that during the second phase performance can drop drastically, but due to keeping the base class performance we can achieve good performance in the third phase.
To achieve a balanced performance in the joint space of base and novel classes we apply the replay technique that is common in incremental learning [rebuffi2017icarl, liu2020mnemonics, rajasegaran2020itaml, lee2019overcoming]. Specifically, we randomly draw only once base training samples, one per class, and join these samples with the novel training data. Moreover, in our case we require the least possible memory [rebuffi2017icarl] to store exemplars of base classes, an essential component for incremental learning.
We continue training the model on the balanced dataset in the joint space. Due to the initial strong bias towards base classes from the first phase (), the model can improve its performance for base classes in the joint space () quickly while at the same time overwriting the novel class performance at least partially ().
3.3 From Generalized to Incremental Learning
The main difference between GFSL [ren2019incremental, shi2019relational] and IFSL [tao2020few, chen2021incremental] is the number of few-shot tasks. So far we considered the case of GFSL and it can be regarded as the first two tasks in terms of incremental setting: the training of the base classes refers to task one, the training of novel classes to task two in the incremental setup. As the current framework addresses the joint generalized problem in three phases, base classes, novel classes, and joint classes, we can easily extend the architecture to more tasks by repeating the novel class training. Specifically, for each new few-shot task we apply the second phase to learn a good joint classification for the current classes.
To evaluate the performance of the current joint space we finalize the training with the last phase of recuperation. To this end, we keep exemplars from base and novel classes from different tasks and perform training with the base-normalized cross-entropy loss and weights constraints as before. So, each time when we need to evaluate the joint performance on all classes, we apply the third phase.
For example, to report accuracy after five tasks, we learn representation parameters from base classes in the first task, as next stage we then apply the second phase sequentially for the second, third, fourth, and fifth tasks respectively, each time enlarging the classifier by the number of new classes in the task. After the fifth task we apply the third phase, balanced replay, where for each few-shot task we use all available data, and one sample per class for the data from the first task. During test the performance of the model is evaluated on a set that contains all previously seen classes.
|LCwoF (ours) lim||58.32||72.75||47.16||55.07||50.81||51.12||73.63||71.82||62.23||59.94||61.06||61.09|
|DFSL [gidaris2018dynamic] (c)||56.83||70.15||41.32||58.04||48.27||49.68||72.82||70.03||59.27||58.68||58.97||58.98|
|LCwoF (ours) lim||60.78||79.89||53.78||62.89||57.39||57.84||77.65||79.96||68.58||64.53||66.49||66.55|
|LCwoF (ours) unlim||61.15||80.10||53.33||62.99||57.75||58.16||77.88||80.09||67.17||66.59||66.88||66.88|
|tiered-ImageNet 5w1s||tiered-ImageNet 5w5s|
4 Experimental Results
This section validates our proposed LCwoF-framework. First, we compare our method to the previous state-of-the-art work on both GFSL and IFSL in Section 4.1. Then we analyze each phase and the components separately to show the importance and connections of each to the improved performance in Section 4.2.
Datasets: mini-ImageNet [vinyals2016matching] is a 100-class subset of ImageNet [deng2009imagenet]. For FSL we follow [ren2019incremental] and use a subsets 64-12-24 classes that corresponds to base-val-novel classes, and for IFSL we follow [chen2021incremental, tao2020few] with a subset of 60 and 40 classes for base training and incremental few-shot testing with 5 classes per each novel set. tiered-ImageNet [ren2018meta] is a larger subset of Imagenet [deng2009imagenet] with categorical splits for for base, validation, and novel classes. Here, each high-level category (e.g. dog that includes different breeds) belongs only to one of the splits. mini-Kinetics [xian2020generalized] is a 100-class subset of the Kinetics video classification dataset [Kay2017Kinetics]. We use the splits from [xian2020generalized]. UCF101 [Soomro2012UCF101] is a video dataset with 101 classes in total. We follow [kumar2019protogan] with a splitting of 50-51 for base and novel classes for FSL on videos. We additionally introduce and evaluate more challenging division of the dataset.
Implementation details For the FSL experiments on mini-ImageNet and tiered-ImageNet we employ the same ResNet12 with DropBlock [ghiasi2018dropblock]
as backbone and pretrain it on base classes for 500 epochs with SGD optimizer with momentum with the learning rate (lr) of 1e-3 that is decayed by 0.1 at 75, 150, and 300 epochs. For the second and the third phase we use lr of 1e-2 and 1e-3 respectively for the classification layers and decayed by 0.1 lr for the backbone parameters. For IFSL experiments we use ResNet18 and follow the same pretraining steps as above. We use different architecture choices for GFSL and IFSL to remain comparable to previous works after the base pretraining. For the second phase we always train the model for 150 epochs, while for the third phase we use validation set to choose the number of epochs for each dataset. For videos we preextract features with C3D model[tran2015learning]
pretrained on large-scale Sports-1M[karpathy2014large] dataset. We apply average pooling over temporal domain to obtain one feature vector per video. As a backbone we use 2-layer MLP. We also clip gradients at value 100 for the experiments. More details can be found in the supplement. Evaluation
We evaluate the proposed framework primarily with respect to the harmonic mean () [schonfeld2019generalized, shi2019relational, kumar2019protogan] that is computed between base and novel performance in the joint space. Additionally, we report performance of base and novel classes in their respective subspaces ( and ), in the joint space ( and ), and the arithmetic mean over the joint space () as in [ren2019incremental]. Extended tables for all datasets are in the supplement. 5w1s and 5w5s denote 5 novel classes with 1 and 5 training samples per classes respectively. For the state-of-the-art comparison, we average over 600 episodes [shi2019relational, luo2019global, Dhillon2020Abaselinefew], for all other experiments over 100 episodes. unlim denotes access to the entire base training set, whereas for lim setup we use small subset. All specifications are in supplement.
4.1 Comparison to state-of-the-art
Generalized Few-Shot Learning We compare our performance on image and video benchmarks: mini-ImageNet, tiered-ImageNet, Kinetics and UCF101 in Tables 1, 2, 3, and 4 respectively. For mini-ImageNet and tiered-ImageNet, we train the respective backbones from scratch on the base classes. For Kinetics and UCF101, we preextract video features as in [kumar2019protogan, xian2020generalized] and then train a shallow MLP model on the base classes.
On mini-ImageNet in Table 1 we provide an evaluation in separate and joint space on 5w1s and 5w5s setups. For both backbones, conv4 [shi2019relational] and ResNet, we achieve significant improvements over state-of-the-art results in terms of . Here, we can observe that previous methods drop in performance on both novel () and base () classes, whereas we address the problem by explicitly balancing between forgetting, learning, and calibration and achieve better performance.
On tiered-ImageNet, Table 2, we can observe a similar pattern and achieve strong improvements. Here, even with more base classes, we are able to calibrate the performance between novel and base classes.
We further evaluate the performance of the proposed idea on two video datasets. Our results on Kinetics, shown in Table 3, and UCF101, shown in Table 4, show that the proposed framework is able to perform well on the pre-extracted features. Results on UCF101 we present on two different splits for training and testing. In Table 4 the first two lines correspond to splits provided by [kumar2019protogan] thus can be directly compared. The second part of the table shows the evaluation for the setup with the original UCF train/test split as defined in [Soomro2012UCF101].
Note that on both image datasets we obtain significant gains in performance while applying unlim sampling strategy, while on the Kinetics and UCF101 there is a slight decrease in comparison to lim. We speculate that it happens due to the fixed feature preextraction whereas on images we train models on raw images. Across all the datasets, setting and architectures we consistently achieve significantly better performance than previous work.
|mini-Kinetics 5w1s||mini-Kinetics 5w5s|
|LCwoF (ours) lim||54.41||91.41||68.22|
|LCwoF (ours) lim||50.78||70.72||59.11|
|LCwoF (ours) unlim||49.12||69.98||57.72|
Incremental Few-Shot Learning We compare our method with the current few-shot class incremental methods in Table 5. As in the previous experiments we use the hm accuracy that we compute between base (first set) and novel classes. We provide more detail on the performance of each task in the supplement, specifically showing performance of base and novel classes separately, as well as the standard accuracy of all classes in the joint space. Our method notably outperforms other methods in the field due to the fact that we address directly the balance between the performance on base and novel classes. We show that we obtain higher novel accuracy in the joint space for every task.
4.2 Ablation Studies
Here we investigate the influence of several components of our framework and the impact of various hyperparameters.
Calibration and forgetting during the second phase In this subsection we analyse the influence of the base-normalized cross entropy as a technique to address the calibration problem as well as the influence of knowledge preservation to address the forgetting issue. In Fig. 4 we show the behaviour of the training process during the second and the third phase with and without base-normalized cross entropy and knowledge preservation as explicit weight constraints . Removing both elements, as shown in the top left sub-figure, results in a drastic drop in and that prohibits the model to quickly recover during the replay (third) phase since all previous knowledge is lost. The left bottom sub-figure shows the impact of . With the model easily achieves good performance on in the joint space that matches , but here both and drop during the second phase training that again prevents recuperation. Compared to that, the right top sub-figure includes base knowledge preservation that keeps both and relatively high, and further facilitates complete recovery for the base classes. But without novel learning in the joint space suffers during both the second and the third phase. The right bottom sub-figure shows performance if both, and knowledge preservation, are used. The model is able to keep a certain performance of during the second phase while achieving high accuracy on the novel classes ( and ). During the third phase we calibrate the model by the replay technique and achieve good balance between the two disjoint sets of classes.
Knowledge Preservation One important factor of the method is that we aim to retain the knowledge that the model obtained in the first phase, specifically the performance, during the second phase. In this section we evaluate two different methods to achieve this objective, comparing the proposed explicit weight constraints with knowledge distillation that is formulated via -divergence. Knowledge distillation is a common technique to preserve knowledge in incremental learning [castro2018end, douillard2020podnet, lee2019overcoming, li2017learningwf, rebuffi2017icarl, wu2019large], where abundant training data is available for new classes.
In Table 6 we evaluate the performance on 100 episodes on mini-ImageNet for the 5w1s and 5w5s settings. Comparing and knowledge preservation techniques, we find that the latter marginally outperforms the other if more data is available, as in the 5w5s setting, whereas plain weight constraints are more efficient in the lowest data regime with 1 training sample per class (5w1s). Additionally, we evaluate by including additional unlabeled 1000 images from the validation set during the second phase for loss computations, denoted as in Table 6. We find that it helps to improve in the 5w1s setting, but still stays lower than . Applying both techniques at the same time does not give an improvement.
Impact of In this section we study the influence of for the loss and how to preserve more knowledge and drop less on the base classes , but also how the model behaviour changes if we enforce it to preserve even more. In Fig. 5 we plot the accuracy after training in the second phase, before the replay phase for different . In Fig. 5 we fill these areas with gray color, that allow to reach our two objectives for the second phase, i.e., achieving good performance in the joint space on novel classes and keeping good accuracy in the base space . Specifically higher helps to further keep performance while novel learning and starts decreasing with higher values. It shows the start of the decrease depends on the number of available training samples: the more training data, the more we can keep from base classes by choosing a higher .
Early Stopping: Second Phase We observe that during the second phase, usually starts dropping even when has already achieved some reasonable accuracy, as can e.g. be seen in the bottom right subplot of Figure 4. We, therefore, also report the best performance that can be achieved with different during the second phase in Table 7. Note that influences the contribution of the knowledge preservation part . Thus, will drop faster if a lower is chosen. By adjusting , we find that the proposed technique can also be helpful during the second phase. It shows that in this case, balanced model performance can be reached with higher and that we can achieve high performance even without the third replay phase.
Impact of batch normalization
Impact of batch normalizationWe use batch norm layers in the model that capture statistic from the base classes during the first phase. At the second phase our data is highly limited thus we fix batch norm during further training. Table 8 shows that the performance drops more than 1 point when the model tries to accumulate new statistic from 1 training sample per class and to adapt parameters respectively.
Impact of normalization One of the common strategies to unify magnitudes of base and novel classifiers is to use cosine normalization of the embedding and weights [hou2019learning, ren2019incremental, shi2019relational, mittal2021essentials, gidaris2019generating, rebuffi2017icarl]. We experiment with such a setup (Table 9, lines 1 & 3) for our framework and find a decline in performance for both, and . Note that the performance of the model after the first phase on is the same as without cosine normalization. But if we attempt to match and during the second phase, we find that constrained magnitudes of logits due to normalization restrict the performance and do not allow to achieve our second phase objectives.
Analysis of classification layer As default, we conduct all our experiments with a linear classification layer without bias term (Table 9, lines 2 & 3). We therefore assess the performance of the model with and without bias. We find that during the first phase, it is the same, but that during the second and the third phase it is beneficial to use the latter giving a boost of about 0.7 in the performance .
This paper addresses major challenges in generalized few-shot and incremental few-shot learning with our three-phase framework. First, we learn a powerful representation by training a model on base classes. In the second phase, concerned with novel class learning, we employ base-normalized cross entropy that calibrates novel class classifiers to overcome the bias towards base classes. Additionally, during that phase we preserve knowledge about base classes via weight constraints. In the third phase, to achieve calibrated classifiers across both base and novel classes, we employ balanced replay. We show that each phase of the framework allows to study and address the essential problems of the task explicitly. We evaluate our proposed framework on four benchmark image and video datasets and achieve state-of-the-art performance across all settings. This work can be seen as a first step towards more explicitly addressing calibration, learning and knowledge preservation jointly to further improve deep learning for imbalanced settings beyond the ones addressed in this paper.
Appendix A Additional ablation of the phases
In this section we discuss possible variations of our proposed framework and their influence on the performance. Specifically, we discuss the necessity of the second phase and the duration of the second phase. We also inspect the influence of knowledge preservation on the performance after the third phase and the impact of weight decay regularization.
a.1 The second phase
Skip the second phase
In the proposed work, we address the problem of generalized few-shot learning with a three-phase framework. During the second phase we target to improve novel class learning and to mitigate catastrophic forgetting of the base classes. In Fig. 6 we show the development of the performance when we skip the second phase and directly proceed with the third phase.
During the third, joint calibration phase, the training set consists of base (one sample per class) and all novel training samples.
The performance of the base classes in the joint space and the separate space stays at high level even with few training samples.
While the separate novel performance can reach high values during the third phase, novel class learning in the joint space suffers from strong bias towards base classes (red curve on the figure stays low). It shows that our second phase for explicit novel learning in the joint space gives a significant boost to the overall performance in the joint space.
Skip the second phase and keep batch ratio during the third phase We further evaluate the performance of the model without the second phase but with the third phase adaptation. Specifically, we ensure consistent batch-wise ratio between novel and base classes during the third phase. In Table 10 the results show not only better performance than trivially skipping the second phase but also outperform the previous state-of-the-art [ren2019incremental]. Our proposed three-phase framework performs better on the novel classes.
|batch size||per batch ratio|
Interleave the second and the third phases In order to shed light onto the question if one should separate the second and the third phases as proposed, in Table 11, we instead interleave the second and the third phases. Particularly, we alternate training on novel classes only (for epochs) and balanced replay (for 1 epoch). We use . Phase alternation shows to be an effective alternative compared to the consecutive execution that still performs best.
|epochs per period||period ratio|
a.2 Number of epochs of the second phase
For the evaluation, we train the model for a fixed number of epochs during the second phase for all the datasets and setups. Fig. 7 shows similar behaviour when we apply smaller (30) number of epochs during the second phase and compare it to a longer second phase (150 epochs). Due to the negligible differences we use the initially chosen value that equals 150 epochs throughout the main paper.
a.3 High for the weight constraints
In the main paper in Fig.4 we show the range of appropriate to achieve good balanced performance for 5w1s and 5w5s setups. We claim that should not prevent novel class learning while preserving base performance in the base class space. In Fig.4 we exclude too low since empirically we found decrease in the performance on the base classes after the third phase. Table 12 presents the performance of the model with different after the third phase. Higher helps to better preserve knowledge of the base classes while it hinders novel class learning in the joint space.
a.4 Impact of weight decay regularization
While we apply constraints on the parameters of the model by applying , the question arises if and to which extend we would need standard weight decay as regularization on the model parameters. As shown in Fig. 8, while the contribution of the regularization term remains minor, it neither helps the performance nor harms.
a.5 10w1s and 20w1s
We evaluate our approach for additional setups to directly compare to knowledge preservation methods, similarly as in Table 6 in the main paper. Table 13 confirms our finding that with little amount of data knowledge distillation (KD) [hinton2015distilling, rebuffi2017icarl] performs worse than with constraints.
Appendix B Extended implementation details
This section covers additional details of the implementation. We first specify the exact augmentation for images and then discuss the evaluation for the generalized and incremental setups. Our framework is built with PyTorch library and will be made publicly available.
Augmentation For training on images we apply standard augmentation with random resizing followed by random cropping to the size 84x84 and random horizontal flip. We also use color jittering that allows to randomly change brightness, contrast, and saturation. For the test we first resize an image to the size of 92x92 and then apply a central 84x84 crop.
Evaluation For each image and video dataset for testing we use 15 samples per class for both base and novel classes. We train two parametric classifiers for base and novel classes respectively, to evaluate the performance in the joint space we concatenate the logit vectors (dimensionalilty and for base and novel classes accordingly) and predict the class by applying argmax operator over the concatenated output vector of dimensionality .
lim unlim For each dataset we conduct experiments with the two following setups: lim denotes limited access to the base training data during the third phase, whereas for unlim we allow the model to have unlimited access to the base training samples during the third phase. During the third phase we target to have balanced training set, thus the replay set consists of all novel samples and base samples. refers to the number of samples for each novel class, the notation corresponds the standard notation for few-shot learning [chen2021incremental, gidaris2018dynamic, kumar2019protogan], e.g. 5w1s denotes 5 novel classes with 1 training sample per class, thus . lim: for each episode we draw at random the replay set only once before the third phase and reuse that replay set for each epoch for that episode. unlim: for each episode we draw random base training samples for the replay set before each epoch anew.
b.1 Generalized few-shot learning setup
Episodic training is a common way to evaluate few-shot learning methods, further we detail the difference from the standard training protocol [gidaris2018dynamic, ren2019incremental, ye2020set2set]. As before stands for base classes. Generally, the few-shot setup is formulated in a -way -shot notation ( from the main paper equals to in this formulation), specifically each few-shot episode consists of novel classes with training samples per class. In the generalized setup each episode includes base classes for the classification along with novel classes. Each dataset includes classes in the novel test set, usually , e.g. in mini-ImageNet there are classes in the test set whereas we evaluate on 5w1s and 5w5s (for both cases ). To this end, following standard practice [gidaris2018dynamic, ren2019incremental, ye2020set2set] we evaluate the performance as average over 600 episodes such that for each episode we repeat:
randomly draw novel classes from
randomly draw training samples per each class
apply training framework to the current training data
randomly draw 15 samples per class to test the framework
reset the framework to initial state and clean train data
b.2 Incremental few-shot learning setup
We refer to recent work [delange2021continual, prabhu2020gdumb] for an extensive overview and taxonomy on incremental (continual) learning. In our work we aim at class-incremental learning where all seen classes should be classified in the joint space. Whilst another popular choice of continual learning is task-incremental learning with an objective to achieve high accuracy in the disjoint spaces, in our notation we could refer to this setup as having high base performance and high novel performance separately. Such formulation of the task is easier than to achieve a joint balanced, high performance. The usual way to evaluate class incremental learning [liu2020mnemonics, tao2020few, chen2021incremental] is to continuously measure performance of the model in the growing joint space. The first task (sometimes called session) is to evaluate performance on the base classes . Following few-shot -way () -shot notation, the second task increases the joint space by classes, the third task increases again by classes resulting in classes and so on up to 9 tasks in mini-ImageNet, namely classes. We follow [tao2020few, chen2021incremental] and use the same division into the tasks, training and test samples.
b.3 UCF101 splits
As mentioned in the main paper in Section 4, for UCF101 we have introduced a novel split. We observe that the performance achieves almost points with the previously introduced split by Dwivedi [kumar2019protogan] that we report in Table 17. We do not change the division into base and novel classes but instead we filter out some videos that share the same group [Soomro2012UCF101] from train and test splits. When comparing the results of from the previous split and our novel split we indeed see a drop in the performance indicating that the proposed split corresponds to a harder task. Subsequently, the performance and also drops. We will make the novel split publicly available.
Appendix C Extended tables
In Table 14 we summarize Tables 18, 19 and 20 by reporting performance with different metrics after all 9 tasks for incremental few-shot learning. In the supplement we account base biased and balanced hm performance for our framework. base biased stands for the performance of the model that shows higher accuracy on base classes, whereas balanced hm indicates more balanced performance between the disjoint sets that we control by number of epochs for the third phase. The discrepancy is caused by the difference in the number of base and novel classes () and the initial bias of the network towards base classes due to larger number of training samples and further knowledge preservation. Therefore, in Table 14 performance mainly depends on the performance of the base classes , e.g. for Joint training method and show the highest 61.89 points and 43.38 points respectively among all other methods. All other methods that achieve high performance (60.44 for IDLVQ, 59.64 for IW) accordingly reach high performance on the joint set of base and novel classes (41.84 for IDLVQ, 41.26 for IW). At the same time these methods perform poorly on the novel samples that corresponds to column in Table 14 (15.62 for Joint, 13.94 for IDLVQ, 13.69 for IW). On the contrary, in our framework we explicitly address novel class learning in the joint space via base-normalized cross entropy and, thus, we are able to surpass all the previous methods on novel classes by more than 10 points, we reach 27.65 points. Our base biased model outperforms previous state-of-the-art models by large margin on novel classes , harmonic mean , and sets a new benchmark for the joint classes . By balanced hm we show that better balance can be achieved in terms of and , while and accordingly decreases.
|LCwoF (base biased)||55.98||23.12||42.84||32.73|
|LCwoF (balanced hm)||47.73||27.65||39.70||35.02|
|tiered-ImageNet 5w1s||tiered-ImageNet 5w5s|
|DFSL [gidaris2018dynamic] (c)||59.52||47.53||47.32||36.10||40.96||41.71||75.89||47.98||67.94||39.08||49.61||53.51|
|LCwoF (ours) lim||64.71||70.55||57.13||60.39||58.71||58.76||79.72||70.58||69.05||63.44||66.12||66.25|
|LCwoF (ours) unlim||64.67||70.59||57.54||60.09||58.78||58.82||80.02||70.59||70.20||63.01||66.41||66.61|
|mini-Kinetics 5w1s||mini-Kinetics 5w5s|
|DFSL [gidaris2018dynamic] (c)||65.35||56.04||50.81||44.51||47.45||47.66||81.11||56.46||70.29||46.31||55.83||58.30|
|LCwoF (ours) lim||55.97||64.84||47.51||50.84||49.12||49.18||74.76||65.06||63.65||54.55||58.75||59.10|
|LCwoF (ours) unlim||55.39||65.01||46.26||51.94||48.93||49.10||73.77||65.18||65.40||52.70||58.37||59.05|
|LCwoF (ours) lim||55.98||84.16||50.78||70.72||59.11||60.75|
|LCwoF (ours) unlim||54.35||82.33||49.12||69.98||57.72||59.55|
|LCwoF (base biased)||-||25.56||30.59||27.29||28.08||29.91||27.97||30.30||32.73|
|LCwoF (balanced hm)||-||41.24||38.96||39.08||38.67||36.75||35.47||34.71||35.02|
|LCwoF (base biased)||64.45||63.53||62.07||61.55||60.85||59.26||58.25||57.23||55.98|
|LCwoF (balanced hm)||64.45||57.33||53.31||52.87||51.38||48.25||47.60||47.51||47.73|
|LCwoF (base biased)||-||16.00||20.30||17.53||18.25||20.00||18.40||20.60||23.12|
|LCwoF (balanced hm)||-||32.20||30.70||31.00||31.12||29.68||28.27||27.34||27.65|
|LCwoF (base biased)||64.45||59.88||56.10||52.75||50.20||47.71||44.97||43.74||42.84|
|LCwoF (balanced hm)||64.45||55.40||50.08||48.49||46.28||42.78||41.16||40.08||39.70|