Coarse-To-Fine Incremental Few-Shot Learning

Different from fine-tuning models pre-trained on a large-scale dataset of preset classes, class-incremental learning (CIL) aims to recognize novel classes over time without forgetting pre-trained classes. However, a given model will be challenged by test images with finer-grained classes, e.g., a basenji is at most recognized as a dog. Such images form a new training set (i.e., support set) so that the incremental model is hoped to recognize a basenji (i.e., query) as a basenji next time. This paper formulates such a hybrid natural problem of coarse-to-fine few-shot (C2FS) recognition as a CIL problem named C2FSCIL, and proposes a simple, effective, and theoretically-sound strategy Knowe: to learn, normalize, and freeze a classifier's weights from fine labels, once learning an embedding space contrastively from coarse labels. Besides, as CIL aims at a stability-plasticity balance, new overall performance metrics are proposed. In that sense, on CIFAR-100, BREEDS, and tieredImageNet, Knowe outperforms all recent relevant CIL/FSCIL methods that are tailored to the new problem setting for the first time.


page 5

page 7

page 10


Incremental Few-Shot Learning with Attention Attractor Networks

Machine learning classifiers are often trained to recognize a set of pre...

Multistage SFM: A Coarse-to-Fine Approach for 3D Reconstruction

Several methods have been proposed for large-scale 3D reconstruction fro...

Class-Incremental Learning with Strong Pre-trained Models

Class-incremental learning (CIL) has been widely studied under the setti...

Fine-grained Angular Contrastive Learning with Coarse Labels

Few-shot learning methods offer pre-training techniques optimized for ea...

Preserving Fine-Grain Feature Information in Classification via Entropic Regularization

Labeling a classification dataset implies to define classes and associat...

IIRC: Incremental Implicitly-Refined Classification

We introduce the "Incremental Implicitly-Refined Classi-fication (IIRC)"...

Prototype Completion for Few-Shot Learning

Few-shot learning aims to recognize novel classes with few examples. Pre...

1 Introduction

Product visual search is normally driven by a deep model pre-trained on a large-scale private image-set, while at inference it needs to recognize consumer images at a finer granularity. Such a model is expected to evolve on-the-fly Mai et al. (2021) over time as being used, because fine-tuning (FT) it for specific novel classes induces an increasing number of separate models retrained, and thus is inefficient. This expectation is also generally valid for vision-driven autonomous systems or intelligent agents. For example, a self-driving car needs to gradually grow its perception capabilities as it runs on the road.

Figure 1: Catastrophic forgetting when FT-ing a coarsely-trained model on fine samples presently available w/o freezing any weight. We pre-set 10 sessions from CIFAR-100 Krizhevsky (2009). There is a fine-class accuracy from the 1st session and yet no coarse-class accuracy as all samples are with fine labels.

As shown in Fig. 1, we are interested in such a coarse-to-fine recognition problem that fits the class-incremental learning (CIL) setting. Moreover, fine classes appear asynchronously, which again fits CIL. It is also a few-shot learning problem, as there is no time to collect abundant samples per new class. We name such an incremental few-shot learning problem Coarse-to-Fine Few-Shot Class-Incremental Learning (C2FSCIL), and aim to propose a method that can evolve a generic model to both avoid catastrophic forgetting of source-blind coarse classes (see also Fig. 1) and prevent over-fitting the new few-shot fine-grained classes. However, what exactly is the knowledge? Incremental learning (IL) is aimed for the learning model to adapt to new data without forgetting its existing knowledge, which is also called catastrophic forgetting, a concept in connectionist networks M.French (1999); Kirkpatrick et al. (2017)

– it occurs when the new weight vector is completely inappropriate as a solution for the originally learned pattern. In deep learning (DL), knowledge distillation (KD) is one of the most effective approaches to IL, while there lacks a consensus about what exactly the knowledge is in deep networks. Will it similarly be the weight vectors?

Is a coarsely-learned embedding space generalizable? We aim to achieve a superior performance at both the coarse and fine granularity. However, considering the diversity of fine labels, it is infeasible to train a comprehensive fine-grained model beforehand. Instead, can a model be trained, using coarsely-labeled samples, to classify finely-labeled samples with accuracy comparable to that of a model trained with fine labels Fotakis et al. (2021)? Our hypothesis is yes; then, the next question is how to pre-train a generalizable base model? How to explore a finer embedding space from coarse labels? Namely, what type of knowledge is useful for fine classes and how can we learn and preserve them Cha et al. (2021)?

Can we balance old knowledge and current learning? (a.k.a., solving the stability–plasticity dilemma Mermillod et al. (2013); Wang et al. (2021)). We aim to remember cues of both the pre-trained base classes and fine classes in the previous few-shot sessions. Our hypothesis is yes and our preference is a linear classifier as it is flexible, data in-demanding, and efficient to train as well as simple for derivation. Furthermore, the next question is how a linear classifier can evolve the model effectively with a few shots and yet a balanced performance. As presumed, if the knowledge is weights, then freezing weights retains knowledge while updating weights is evolving the model.

To answer those questions, we will first fine-tune a coarse model to test our hypothesis. Motivated to make CIL as simple as fine-tuning, our contributions are four-fold.

  1. We propose a new problem and empirical insights for incrementally learning coarse-to-fine with a few shots.

  2. We propose to learn, normalize, and freeze weights, a simple process (Knowe, pronounced as ’now’) that can effectively solve the problem once we have a base model contrastively learned from coarse labels.

  3. We theoretically analyze why Knowe is a valid solver.

  4. We propose a way to measure balanced performance.

2 Related Problems

2.1 Catastrophic Forgetting (CF)

To learn over time (i.e., sequential learning), it is suggested in McCloskey and Cohen (1989); M.French (1999)

that neural networks can be limited by catastrophic forgetting (CF) just like Perceptron is unable to solve X-OR. Knowledge forgetting, or called catastrophic forgetting/interference is about a learner’s memory (

e.g., LSTM) and is a result of the stability–plasticity dilemma regarding how to design a model that is sensitive to, but not radically disrupted by, new input McCloskey and Cohen (1989); M.French (1999). Often, maintaining plasticity results in forgetting old classes while maintaining stability prevents the model from learning new classes, which may be caused by a single set of shared weights M.French (1999).

2.2 Weakly-Supervised Learning

Judging from the fine-class stage (Fig. 1 middle to right), if we combine a pre-training set and the support set as a holistic training set, then the few-shot fine-grained recognition using a model pre-trained on coarse samples are similar to the

weakly-supervised learning

and specifically learning from coarse labels Bukchin et al. (2021); Fotakis et al. (2021); Xu et al. (2021); Yang et al. (2021), e.g., C2FS Bukchin et al. (2021). Ristel et. al.

investigates how coarse labels can be used to recognize sub-categories using random forests

Ristin et al. (2015) (say, NCM Ristin et al. (2014)).

2.3 Open-World Recognition

Judging from the coarse-class stage Bendale and Boult (2015) (see the left side of Fig. 1), CIL Rebuff et al. (2017)

can be dated back to the support vector machine

Kuzborskij et al. (2013) and random forest Ristin et al. (2014, 2015), where a new class can be added as a new node, and now seen as a progressive case of continual/lifelong learning Delange et al. (2021); Mai et al. (2021), where CF is a challenge as data are hidden. The topology structure is also favored in DL Tao et al. (2020b, a). Few-shot learning (FSL) measures models’ ability to quickly adapt to new tasks Tian et al. (2020) and has a flavor of CIL considering novel classes in the support set, e.g., DFSL Gidaris and Komodakis (2018), IFSL Ren et al. (2019), FSCIL Tao et al. (2020b); Dong et al. (2021); Zhang et al. (2021), and so on.

2.4 Uniqueness of Proposed Problem

Different from existing settings Dong et al. (2021); Tao et al. (2020b); Zhang et al. (2021) that focus on remembering the pre-trained base classes only, our setting requires remembering the knowledge gained in both the base coarse and previous fine sessions. We add finer classes instead of new classes at the same granularity. Our setting requires a balance between coarse and fine performance unexplored by existing works, as shown in Fig. 2.

Method Class hierarchy Few-shot Learning Incremental Learning
LwF Gidaris and Komodakis (2018)
CECZhang et al. (2021)
ANCOR Bukchin et al. (2021)
IIRCAbdelsalam et al. (2021)
C2FSCIL (Ours)
Table 1: Comparison of settings with related works.
Figure 2: The stability-plasticity trade-off. Top-right is FT w/o IL; bottom-left represents most IFSL methods; bottom-right is our approach; top-left does not apply. (CIFAR-100)

3 State of the Art (SOTA)

3.1 Incremental Learning (IL)

IL allows a model to be continually updated on new data without forgetting, instead of training a model once on all data. There are two settings: class-IL Masana et al. (2020) and task-IL Delange et al. (2021).

They share main approaches, such as regularization and rehearsal methods. Regularization methods prevent the drift of consolidated weights and optimize network parameters for the current task, e.g., parameter control in EWC Kirkpatrick et al. (2017). CIL is our focus and aims at learning a classifier that maintains a good performance on all classes seen in different sessions.

In addition, Li first introduces KD ,Hinton et al. (2015) to IL literature in LwF Li and Hoiem (2016) by modifying a cross-entropy loss to retain the knowledge in the original model. Recent works focus on retaining old-class samples to compute the KD loss. For example, iCaRL Rebuff et al. (2017) learns both features and strong classifiers by combining KD and feature learning, e.g., NME.

3.2 Operating Weights for IL

The IL literature since 2017 has seen various weight operations (op. for short) in the sense of consolidation (e.g., EWC Kirkpatrick et al. (2017)), aligning Zhao et al. (2020); He et al. (2021), normalization Zhao et al. (2020); Zhu et al. (2021), standardization Belouadah et al. (2020), regularization Kirkpatrick et al. (2017); Pan et al. (2020), aggregation Liu et al. (2021b), calibration Singh et al. (2020), rectification Singh et al. (2021), transfer Lee et al. (2017); Liu et al. (2020), sharing Riemer et al. (2019), masking Mallya et al. (2018), imprinting Qi et al. (2018), picking Hung et al. (2019), scaling Belouadah and Popescu (2020), merging Lee et al. (2020), pruning Mallya and Lazebnik (2018), quantizaton Shi et al. (2021), weight importance Jung et al. (2020), assignment Hu et al. (2021), restricting weights to be positive Zhao et al. (2020), constraining weight changes Kukleva et al. (2021), and so on.

3.3 Few-Shot Learning (FSL)

The prosperity of DL has pushed large-scale supervised learning, so far, to be the most popular learning paradigm. However, FSL is human-like learning Wang and Yao (2019) in the case of only a few samples Shu et al. (2018); Bendre et al. (2020). For example, Finn et. al. proposed Model-Agnostic Meta Learning (MAML) to train a model that can quickly adapt to a new task using only a few samples and training iterations Finn et al. (2017). Prototypical Network learns a metric space in which classification can be performed by computing distances to prototype representations of each class Snell et al. (2017). Ren et. al. proposes a meta-learning model, the Attention Attractor Network, which regularizes the learning of novel classes Ren et al. (2019). It is shown that decoupling the embedding learner and classifier is feasible Zhang et al. (2021). Tian et. al. demonstrates that using a good learned embedding model can be more effective than meta learning Tian et al. (2020).

3.4 Incremental Few-Shot Learning (IFSL)

In the IFSL Ren et al. (2019) or similarly FSCIL Tao et al. (2020b) setting, samples in the incremental session are relatively scarce, different from conventional CIL. While IFSL is based on meta learning, IFSL and DFSL Gidaris and Komodakis (2018) both utilize attentions. In FSCIL, a model named TOPIC is proposed, which contains a single neural gas (NG) network to learn feature-space topologies as knowledge, and adjust NG to preserve the stabilization and enhance the adaptation. In Dong et al. (2021), Dong et. al. propose an exemplar relation KD-IL framework to balance the tasks of old-knowledge preserving and new-knowledge adaptation as done in Wu et al. (2021). CEC Zhang et al. (2021) is proposed to separate classifier from the embedding learner, and use a graph attention network to propagate context cues between classifiers for adaptation. In Hou et al. (2019), Hou et. al. address the imbalance between old and new classes by cosine normalization Wang et al. (2017); Gidaris and Komodakis (2018); Hou et al. (2019).

3.5 Uniqueness of Proposed Approach

Different from state-of-the-art approaches to IFSL, we do not follow rehearsal methods, namely, our model learns without memorizing samples Dhar et al. (2019). However, retaining samples is often practically infeasible, say, when learning on-the-fly Mai et al. (2021). Even if there is memory for storing previous samples, there often is a budget, buffer, or queue. Thus, we aim to examine the extreme case of knowledge forgetting, and thus design IFSL methods to the upper-bound extent. For example, although in Kukleva et al. (2021) they do not use any base-class training samples and keep the weights of the base classifier frozen, they still use previous samples in their third phase.

Figure 3: C2FSCIL and basic idea. In base session we train on to get . Per incremental session, is trained on -way -shot support set based on , and then tested on any class seen in either or .

4 A New Problem C2FSCIL and Our Insights

Given a model parameterized by and pre-trained on where , a set of coarse labels , we have a stream of -way -shot support sets where and , a set of fine-grained labels . Then, we adapt our model to over time and update the parameter set from all the way to , as shown levelwise in Fig. 3.

For testing, we also have a stream of -way -shot query sets where and , which is the generalized union of all label sets till the -th session.

Notably, , . We assume no sample can be retained (unlike rehearsal methods) and the CIL stage only includes (sub-classes of) base classes. At the -th session, only the support set can be used for training.

Figure 4: Ablation study of contrastive learning when fine-tuning ResNet12 w/o IL. Left: w/o; right: w/. (CIFAR-100)
Figure 5: Ablation study of freezing embedding-weights for fine-tuning a contrastive model. Left: when not freezing classifier-weights. Right: when freezing them. (CIFAR-100)

4.1 Embeddings need to be contrastively learned

As shown in Fig. 4, straightforward training on coarse labels does not help much the subsequent FSL on fine labels (now_acc at ), while contrastive learning self-supervised by the fine cues does help (now_acc at ). Thus, coarsely-trained embedding can be generalizable.

Fig. 5-left shows that freezing embedding-weight outperforms not freezing them. It implies the embedding space without any update is generalizable, and that, if classifier-weights are not frozen, freezing embedding-weights helps.

4.2 Freezing weights helps, surely for classifiers

However, Fig. 5-right implies that, if classifiers weights are frozen, then freezing embedding-weights does not help.

Comparing left with right of Fig. 5, we find that freezing classifier-weights (right) outperforms not doing so (left), either freezing embedding-weights (circle) or not (triangle).

4.3 Weights need to be normalized

As shown in Fig. 6, samples of classes seen in the st session are totally classified to classes seen in the nd session while only samples of the present classes can be correctly classified. We plot weight norms to find them grow and propose a conjecture implying a need of normalization. Please see our analysis in the Appendix.

Conjecture 1 (FC weights grow over time). Let denotes the Frobenius norm of the weight matrix formed by all weight vectors in the FC layer for new classes in the

-th session. With training converged and norm outliers ignored, it holds that


4.4 The balance need to be measured

As already shown in the bottom-right sub-plot of Fig. 2, it is possible to avoid the collapse of coarse-class accuracy, and slow down the accuracy drop of all the previous classes, while still maintaining a high accuracy on present classes.

Old knowledge and current learning can be balanced, which can be achieved not only on CIFAR-100 but also more generally on BREEDS Santurkar et al. (2020). Fig. 7 shows the balanced performance on its various subsets. In order to better measure how good the balance is, we also need new overall metrics.

Figure 6:

10-way 5-shot confusion matrix (left) and visualization of the norm of raw weights (mid-right) in the last layer for old/new classes. As each session can only access labels of the present classes, a linear classifier will have a larger weight for the current classes’ neurons, inducing the queries of previous classes to be likely assigned into current classes’ region (left) in the embedding space. (CIFAR-100)

(a) living17
(b) nonliving26
(c) entity13
(d) entity30
Figure 7: Reaching a balance on BREEDS. More in Sec. 6.

5 A New Approach: Know-weight (Knowe)

5.1 Learning Embedding-Weights Contrastively

Now, we elaborate on the Model of Fig. 3 about how we train a generalizable base embedding space Tian et al. (2020); Liu et al. (2021a).

We follow ANCOR Bukchin et al. (2021) to use MoCo He et al. (2020) as the backbone, and keep two network streams each of which contains a backbone with the last-layer FC replaced by a Multi-Layer Perceptron (MLP). The hidden layer of two streams’ MLP outputs intermediate and , respectively. Given coarse labels, the total loss is defined as where


and is the standard cross-entropy loss that captures the inter-class cues. We also use angular normalization Bukchin et al. (2021) to improve their synergy. Note that index samples, is a temperature parameter, denotes the intermediate output of the -th sample, a negative sample, in the same class with the -th sample, a positive sample, so as to capture intra-class cues (fine cues), and reduce unnecessary noises to the subsequent fine-grained classification Xu et al. (2021). will be small when is similar with and different from .

5.2 Normalizing Classifier-Weights

In the last layer, we set the bias term to . For a sample

, once a neuron has its output logit

ready, then a Softmax activation function

is applied to convert

to a probability so that we can classify

. ( is transpose)

However, such a inner-product linear classification often favors new classes Hou et al. (2019). Instead, we compute the logit using the normalized inner-product Wang et al. (2017) (a.k.a.

, cosine similarity, cosine normalization

Gidaris and Komodakis (2018); Hou et al. (2019)) as where -normalized and , and then apply Softmax to the rescaled logit as


where is the class index, is a temperature parameter that rescales the Softmax distribution, as is ranged of .

5.3 Freezing Memorized Classifier-Weights

As shown in Fig. 3, in the -th incremental session, the task is similar to FSL where a support set is offered to train a model to be evaluated on a query set . However, FSL only evaluates the classification accuracy of the classes appeared in the support set . In our setting, the query set contains base classes, and all classes in previous support sets. As shown in Fig. 5, no matter freezing embedding-weights helps or not, it does not hurt. We do so, hoping it to reduce model complexity to avoid over-fitting.

As past samples are not retained, we store the classifier-weights per session to implicitly retain the label information by augmenting a weight matrix where in the -th session, we have with , for , except where is the feature dimension.

In the -th session, we minimize the following regularized cross-entropy loss on the support set :


where is the indicator function and is the output probability (i.e., Softmax of logits ) of the -th class.

5.4 Theoretical Guarantee for Stability-Plasticity

We extend definitions in Wang et al. (2021) to set the base of our analysis. Please see also proofs in the Appendix.

Definition 1 (Stability Decay). For the same input sample, let denote the output logits of the -th neuron in the last layer in the -th session. After the loss reaches the minimum, we define the degree of stability as .

Definition 2 (Relative Stability). Given models and , if , then we say is more stable than .

Assuming embedding-weights are frozen, then we have:

Proposition 1 (Normalizing or freezing weights improves stability; doing both improves the most). Given , if we only normalize weights of a linear FC classifier, we obtain ; if we only freeze them, we obtain ; if we do both, we obtain . Then, and .

Our second claim is about normalization for plasticity.

Proposition 2 (Weights normalized, plasticity remains). To train our FC classifier, if we denote the loss as where is normalized, the weight update at each step as , and the learning rate as , then we have .

Notably, freezing the weights does not affect plasticity.

5.5 New Overall Performance Measures

In thi section , we evaluate the model after each session with the query set , and report the Top-1 accuracy. The base session only contains coarse labels, and thus is evaluated by the coarse-grained classification accuracy . We evaluate , the fine-grained accuracy , and the total accuracy per incremental session, except the last session when only fine labels are available and is not evaluated. We average to obtain an overall performance score as


Inspired by Belouadah and Popescu (2020), we define the fine-class forgetting rate


and the forgetting rate for the base coarse class as


With them, we can evaluate the model with an overall measure to represent the catastrophic forgetting rate as


where is the number of incremental sessions; is the number of appeared fine classes until the -th session, and is fine-class total number; and are the accuracy of coarse and fine classes per session, respectively.

6 Experiments

6.1 Datasets and Results

CIFAR-100 contains 32x32 images from fine classes, each of which has training images and 100 test images Krizhevsky (2009). They can be grouped into coarse classes, each of which includes fine classes, e.g., trees contains maple, oak, pine, palm, and willow. The 100 fine classes are divided into 10-way 5-shot incremental sessions.


is derived from ImageNet with class hierarchy re-calibrated by

Santurkar et al. (2020) and contains subsets named living17, nonliving26, entity13, and entity30. They have , , , coarse classes, fine classes per coarse class, K, K, K, 7K training images (224x224), K, K, K, K test images, respectively. See also Table 2.


(tIN) is a subset of ImageNet and contains

classes Ren et al. (2018) that are grouped into high-level super-classes to ensure that the training classes are distinct enough from the test classes semantically. The train/val/test set have coarse classes, fine classes, K, K, K images (sized at 84x84), respectively.

Table 2 summarizes our performance. Fig. 7 shows our separated accuracy on BREEDS, and Fig. 8 visualizes confusion matrices to show the evolving of per-class accuracy.

Dataset coarse# fine# total# sessions way/shot queries
CIFAR-100 20 100 120 10 10/5 15 38.50 0.42
living17 17 68 85 7 10/1 15 54.62 0.33
nonliving26 26 104 130 11 10/1 15 48.41 0.25
entity13 13 260 273 13 20/1 15 41.45 0.38
entity30 30 240 270 8 30/1 15 47.79 0.32
tieredImageNet 20 351 371 10 36/5 15 33.24 0.39
Table 2: Dataset setting and performance. # is class num.
Figure 8: Confusion matrices of Knowe tested on living17.

6.2 Implementation Details

We use ResNet-50 on BREEDS, ‘-12’ on CIFAR-100 and ‘-12’ on tIN, train except FC using ANCOR, use SGD with a momentum , as well as set weight decay to 5e-4, batch size to , to , and to . The learning rate is for , and is for ,, etc. for epochs.

6.3 Ablation Study

Impact of base contrastive learning. Fig. 4 already illustrates its benefit for a simple model without any weight operation. As shown in Fig (a)a, Knowe also obtains a better performance than not using MoCo in Knowe’s base, which verifies that the contrastively-learned base model helps fine-grained recognition. Starting from almost the same fine accuracy in the nd session, the gap between w/ MoCo and w/o MoCo increases, as the former stably outperforms the latter on current classes. It verifies that the former can learn more fine knowledge than the latter. Given that there are only a few fine-class samples, the extra fine-grained knowledge is likely from the contrastively-learned base model.

(a) Base contrastive learning?
(b) Freezing embedding-weights?
(c) Normalizing classifier-weights?
(d) Freezing classifier-weights?
Figure 9: -factor ablation study on living17, separated acc.

Impact of freezing embedding-learner weight (decoupling). It has been verified in Sec.4.1 that, if neither freezing nor normalizing classifier-weights, freezing the embedding-weights helps. We have a Conjecture 2: where is a premise that classifier-weights are normalized, is another that classifier-weights are frozen, and is a conclusion that freezing embedding-weights improves Knowe’s performance. However, Fig. (b)b illustrates that, for Knowe, freezing embedding-weights induces a slightly better performance than not freezing them. It implies, if classifier-weights are normalized and frozen, then freezing the embedding-weights does not help (), which is shown by small changes of and in Table 3.

Impact of normalizing classifier-weight. Fig 6 has already shown that, with a linear classifier, the weight norms of new classes totally surpass the weight norms of previous classes, which causes that the linear classifier biases towards new classes (i.e., any sample of previous class can be classified as a new class). That implies a need of normlizing the classifer-weights. As shown in Fig (c)c, when we freeze weights of previous classes and only tune the weights of new classes without normalization, the model performs stably worse than Knowe with normalization, which verifies that normalizing classifier-weights plays a positive role.

Impact of freezing memorized classifier-weights. As shown in Fig (d)d, there is severe CF of both fine and coarse knowledge when not freezing the weights of previously-seen classes, which implies that little knowledge is retained. Although embedding-weights are frozen and classifier-weights are normalized, the coarse knowledge is totally forgotten. It implies that, if classifier-weights are normalized and yet not frozen, freezing the embedding-weights does not help (). It can be explained that fine-tuning on a few samples normally induces little change to the embedding-weights and yet great change to classifier-weights. Moreover, the model without freezing classifier-weights performs much worse than Knowe that freezes previous weights. The gap of the fine accuracy increases over time and is larger than the gap of the present accuracy. It implies that they also differ in the performance of previous fine classes, which is the CF of learned fine knowledge.

More about freezing embedding-weights. In Sec.4.2, we know . Thus, we have a Conjecture 3: , meaning if and only if classifier-weights are either normalized or frozen, then freezing embedding-weights does not help. Please see also the Appendix for the analysis.

Overall finding. A decent now_acc seems to be a condition for weight freezing and normalization to be effective.

Mehtod Contr. learn. Decoupled Normalization Frozen Total accuracy per session
0 1 2 3 4 5 6 7 8
(a) Base w/o MoCo 93.18 33.04 26.37 31.08 29.51 35.10 34.71 37.84 N/A 40.10 0.50
(b) FT w/ weight op. 94.21 63.14 47.45 40.10 41.47 34.80 40.59 43.53 N/A 50.66 0.35
(c) Knowe w/o norm. 94.50 17.84 14.02 22.26 21.28 24.71 26.77 24.80 N/A 30.77 0.57
(d) FT last layer 94.21 12.06 11.28 12.26 12.26 12.55 12.65 9.51 N/A 22.09 0.66
LwF+ Li and Hoiem (2016) 94.50 61.47 44.61 27.45 19.12 11.28 6.37 4.22 N/A 33.63 0.51
ScaIL Belouadah and Popescu (2020) 94.50 38.63 25.59 31.08 30.29 35.10 37.84 41.08 N/A 41.76 0.48
Weight Align+ Zhao et al. (2020) 94.50 50.98 37.94 38.43 37.06 35.20 39.80 43.24 N/A 47.14 0.40
Subsp. Reg.+ Akyürek et al. (2021) 94.50 59.41 39.51 33.43 29.31 25.59 27.84 26.47 N/A 42.01 0.40
Knowe (Ours) 94.21 63.63 50.88 43.82 42.84 40.29 47.75 53.53 N/A 54.62 0.33
ANCOR Bukchin et al. (2021) 94.50 11.86 11.18 12.35 11.77 12.55 10.78 9.02 N/A 21.75 0.66
Jt. train. (upp. bd.) 94.21 63.63 58.53 52.26 46.28 47.75 36.96 42.75 N/A 55.29 0.25
LwF+Li and Hoiem (2016) 89.48 65.03 48.69 22.72 9.36 6.03 4.61 2.86 3.33 28.01 0.47
ScaILBelouadah and Popescu (2020) 89.48 39.25 25.50 22.44 23.69 25.75 30.81 32.08 35.25 36.03 0.48
Weight Align+ Zhao et al. (2020) 89.48 47.36 37.06 31.72 30.56 32.28 34.11 36.39 37.06 41.78 0.42
Subsp. Reg.+ Akyürek et al. (2021) 89.48 42.39 28.94 20.86 16.14 16.44 16.75 16.17 16.06 29.25 0.48
Knowe (Ours) 87.90 63.22 49.22 37.75 34.78 36.25 38.03 40.08 42.83 47.79 0.32
ANCORBukchin et al. (2021) 89.48 8.67 8.28 9.50 6.83 8.75 9.53 8.19 8.69 17.55 0.61
Jt. train. (upp. bd.) 87.90 63.22 56.56 53.72 47.36 44.78 41.61 38.06 36.75 52.22 0.20
Table 3: Ablation study of 4 factors and comparison with others on BREEDS living17 (top) and entity30 (bottom). Best seen on computer.
Method Contr. learn. Decoupled Normalization Frozen 0 1 2 3 4 5 6 7 8 9 10 11 12 13
LwF+ Li and Hoiem (2016) 86.94 65.51 58.14 44.17 22.76 14.36 9.68 6.92 5.90 5.19 5.32 3.40 N/A N/A 27.36 0.38
ScaIL Belouadah and Popescu (2020) 86.94 36.09 24.10 21.47 23.27 23.65 27.95 31.80 34.23 36.09 37.76 38.14 N/A N/A 35.12 0.43
Weight Align.+ Zhao et al. (2020) 86.94 61.41 46.03 40.00 35.77 34.10 35.96 33.21 35.51 36.60 37.56 37.76 N/A N/A 43.40 0.29
Subsp. Reg.+ Akyürek et al. (2021) 86.94 63.59 52.56 42.95 35.96 31.41 28.01 26.15 23.27 19.68 19.36 20.19 N/A N/A 37.51 0.25
Knowe (Ours) 86.23 65.90 53.08 46.80 42.82 38.91 41.22 39.10 40.06 41.80 42.44 42.63 N/A N/A 48.41 0.25
ANCOR Bukchin et al. (2021) 86.94 5.83 6.03 6.92 5.90 6.60 7.63 7.05 7.05 7.50 7.44 2.63 N/A N/A 13.13 0.61
Jt. train. (upp. bd.) 86.23 65.90 60.51 59.04 53.53 53.85 46.73 46.60 43.85 36.67 37.31 36.80 N/A N/A 52.25 0.16
LwF+Li and Hoiem (2016) 92.03 59.10 43.64 18.49 10.49 6.82 3.59 2.54 3.10 2.56 2.10 2.23 1.77 1.54 17.86 0.52
ScaILBelouadah and Popescu (2020) 92.03 37.10 13.92 13.36 14.87 18.36 21.72 23.28 24.33 27.62 29.59 31.54 32.36 34.08 29.58 0.49
Weight Align+ Zhao et al. (2020) 92.03 36.74 24.15 20.51 22.31 24.82 26.41 26.85 27.26 31.49 32.26 35.28 36.72 37.69 33.89 0.46
Subsp. Reg.+ Akyürek et al. (2021) 92.03 52.72 28.95 15.92 12.08 10.82 10.90 11.49 12.05 12.03 11.77 11.72 12.54 14.36 22.10 0.45
Knowe (Ours) 91.35 66.90 45.69 35.54 30.56 29.21 30.10 29.95 30.85 33.74 35.36 38.54 40.26 42.21 41.45 0.38
ANCORBukchin et al. (2021) 92.03 5.36 5.67 5.49 5.18 6.51 5.82 4.80 5.39 6.28 5.36 5.13 5.26 5.62 11.71 0.57
Jt. train. (upp. bd.) 91.35 66.90 57.54 49.92 50.59 48.64 47.69 44.41 41.72 39.13 39.62 40.72 38.49 37.26 49.57 0.24
Table 4: Comparison with others on BREEDS nonliving26 (top) and entity13 (bottom). Bold is the best, slanted is nd. Best seen on computer.
Method Contr. learn. Decoupled Normalization Frozen 0 1 2 3 4 5 6 7 8 9 10
LwF+Li and Hoiem (2016) 78.39 41.87 28.00 23.80 14.93 10.53 8.00 8.80 6.47 7.33 6.73 21.35 0.51
ScaILBelouadah and Popescu (2020) 78.39 14.47 14.13 18.07 21.00 25.20 26.20 31.87 32.60 36.53 38.20 30.61 0.52
Weight Align+ Zhao et al. (2020) 78.39 13.20 14.13 18.20 21.20 24.60 26.93 32.33 32.60 38.93 38.46 30.82 0.53
Subsp. Reg.+ Akyürek et al. (2021) 78.39 41.47 31.80 32.87 26.73 25.73 25.27 26.73 24.27 25.73 24.00 33.00 0.43
Knowe (Ours) 72.07 36.00 28.13 30.27 32.20 31.20 30.93 36.33 39.27 43.20 43.93 38.50 0.42
ANCORBukchin et al. (2021) 78.39 7.93 7.13 8.27 7.80 8.60 6.40 7.53 6.93 8.20 8.33 14.14 0.59
Jt. train. (upp. bd.) 72.07 36.00 37.07 40.27 40.13 41.33 38.60 41.13 40.47 41.40 43.47 42.90 0.33
LwF+Li and Hoiem (2016) 87.64 69.36 13.88 4.22 4.05 4.03 3.02 2.74 1.44 1.05 1.06 17.50 0.55
ScaILBelouadah and Popescu (2020) 87.64 48.51 33.12 26.15 22.66 22.77 23.42 22.72 23.38 25.17 26.65 32.93 0.40
Weight Align+ Zhao et al. (2020) 87.64 25.13 18.63 18.37 20.08 22.20 24.22 24.73 26.71 29.00 30.45 29.74 0.48
Subsp. Reg.+ Akyürek et al. (2021) 87.64 49.73 32.06 24.35 20.95 20.76 20.84 21.12 21.79 23.15 24.31 31.52 0.42
Knowe (Ours) 76.15 48.24 30.60 25.60 22.34 23.48 24.79 24.69 27.65 30.26 31.87 33.24 0.39
ANCORBukchin et al. (2021) 87.64 7.10 6.69 6.55 6.36 6.57 6.42 6.55 6.55 6.40 5.17 13.82 0.61
Jt. train. (upp. bd.) 76.15 48.24 39.89 34.09 32.21 30.85 28.81 29.86 28.57 28.74 29.06 36.95 0.32
Table 5: Comparison with others on CIFAR-100 (top table) and tieredImageNet (bottom table). ’+’ means improvement. Best seen on computer.
Figure 10: Separated accuracy comparison on all datasets. Top-down: total, corese, fine; red is Knowe, grey is joint training.

6.4 Performance Comparison and Analysis

Table 3,4,5 and Fig. 10 compare Knowe with SOTA FSCIL/IL methods including LwF Li and Hoiem (2016), ScaIL Belouadah and Popescu (2020), Weight Aligning Zhao et al. (2020) and Subspace Regularizers (Sub. Reg.) Akyürek et al. (2021). Joint training is non-IL and an acc upper bound in principle.

Overall metrics: average acc and forgetting rate . As presented in Table 3, 4, 5, Knowe has the smallest and the largest on all datasets. From both metrics, Weight Aligning ranks nd on BREEDS, Sub. Reg. ranks nd on CIFAR-100, and ScaIL ranks nd on tIN. There is a consistency of two metrics. LwF often has poor numbers, which implies that, with no samples retained, KD does not help.

Total accuracy per session decreases over time yet slower and slower for Knowe and SOTA methods. However, outstanding ones decrease first and then rise, because that the proportion of fine classes in the query set gets higher and their accuracy plays a leading role in the total accuracy. Knowe is the best, with a strong rising trend, which satisfies the aim of CIL the most and envisions Knowe continuing performing well when more sessions are added (Table 4) . Sub. Reg. and Weight Align. often have nd-best numbers (both freeze weights); ScaIL and LwF occasionally do.

Coarse class accuracy decreases over time unavoidably (see Fig. 10), while Knowe and SOTA methods slow down the decay, with comparable rates. As IL methods, Weight Aligning, ScaIL, and LwF do not forget knowledge totally although they do not operate weights as done by Knowe. As an non-IL approach, ANCOR totally forgets old knowledge from the st session because it fine-tunes on the few fine shots without any extra operation to retain coarse knowledge. The joint training on all fine classes till the present is non-IL, and in principle should bound the fine-class performance. Interestingly, it also suffers less from coarse acc decay, the rate of which is much lower (Fig. 10). Differently, the cause can be imbalance between increasing fine classes and existing coarse classes. Knowe’s performance is very competitive and indeed bounded by joint training.

Fine class’s total accuracy normally decreases over time yet slower and slower for Knowe and SOTA methods (Fig. 10), and can be maintained in a similar range for most methods, among which Knowe often stays the highest, ScaIl and Weight Aligning are in the middle, Sub Reg. often stays in a low level, and LwF and ANCOR perform stably the worst. Knowe is the most balanced, while Sub. Reg. biases towards stability that is its drawback. Joint training does not necessarily bound the accuracy, possibly due to few shots.

7 Conclusion

In this paper, we present a challenging new problem, new metrics, insights, and an approach that solves it well in the sense of getting more balanced performance than the state-of-the-art approaches.

While it is not new to freeze or normalize weights, we are unaware of them previously being presented as a principled approach (to CIL) that is as simple as fine-tuning. It makes pre-trained big models more useful for finer-grained tasks.

For C2FSCIL with a linear classifier, weights seem to be the knowledge. However, how generic are our findings in practice? Can they be applied to general FSCIL? If yes, we are more comfortable with that answer, but then how does a class hierarchy make a difference? Future work will include examining those questions, non-linear classifiers, and so on.


  • [1] G. ,Hinton, V. Oriol, and D. Jeff (2015) Distilling the knowledge in a neural network. In NeurIPS, Cited by: §3.1.
  • [2] M. Abdelsalam, M. Faramarzi, S. Sodhani, and S. Chandar (2021) IIRC: Incremental Implicitly-Refined Classification. In CVPR, Cited by: Table 1.
  • [3] A. F. Akyürek, E. Akyürek, D. Wijaya, and J. Andreas (2021) Subspace regularizers for few-shot class incremental learning. Arxiv preprint:2110.07059. Cited by: §6.4, Table 3, Table 4, Table 5.
  • [4] E. Belouadah, A. Popescu, and I. Kanellos (2020) Initial classifier weights replay for memoryless class incremental learning. ArXiv preprint:2008.13710. Cited by: §3.2.
  • [5] E. Belouadah and A. Popescu (2020) ScaIL: Classifier weights scaling for class incremental learning. In

    IEEE/CVF Winter Conference on Applications of Computer Vision

    Cited by: §3.2, §5.5, §6.4, Table 3, Table 4, Table 5.
  • [6] A. Bendale and T. Boult (2015) Towards open world recognition. In CVPR, Cited by: §2.3.
  • [7] N. Bendre, H. T. Marín, and P. Najafirad (2020) Learning from few samples: a survey. ArXiv preprint:2007.15484. Cited by: §3.3.
  • [8] G. Bukchin, E. Schwartz, K. Saenko, O. Shahar, R. Feris, R. Giryes, and L. Karlinsky (2021) Fine-grained angular contrastive learning with coarse labels. In CVPR, Cited by: §2.2, Table 1, §5.1, Table 3, Table 4, Table 5.
  • [9] H. Cha, J. Lee, and J. Shin (2021) Co2L: contrastive continual learning. In ICCV, Cited by: §1.
  • [10] M. Delange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars (2021) A continual learning survey: defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.3, §3.1.
  • [11] P. Dhar, R. V. Singh, K. Peng, Z. Wu, and R. Chellappa (2019) Learning without memorizing. In CVPR, Cited by: §3.5.
  • [12] S. Dong, X. Hong, X. Tao, X. Chang, and X. Wei (2021) Few-shot class-incremental learning via relation knowledge distillation. In AAAI, Cited by: §2.3, §2.4, §3.4.
  • [13] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    International Conference on Machine Learning

    Cited by: §3.3.
  • [14] D. Fotakis, A. Kalavasis, V. Kontonis, and C. Tzamos (2021) Efficient algorithms for learning from coarse labels. In 34th Annual Conference on Learning Theory, Cited by: §1, §2.2.
  • [15] S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. In CVPR, Cited by: §2.3, Table 1, §3.4, §5.2.
  • [16] C. He, R. Wang, and X. Chen (2021) A tale of two cils: the connections between class incremental learning and class imbalanced learning, and beyond. In CVPR Workshops, Cited by: §3.2.
  • [17] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, Cited by: §5.1.
  • [18] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin (2019) Learning a unified classifier incrementally via rebalancing. In CVPR, Cited by: §3.4, §5.2.
  • [19] X. Hu, K. Tang, C. Miao, X. Hua, and H. Zhang (2021) Distilling causal effect of data in class-incremental learning. In CVPR, Cited by: §3.2.
  • [20] S. C. Y. Hung, C. Tu, C. Wu, C. Chen, Y. Chan, and C. Chen (2019) Compacting, picking and growing for unforgetting continual learning. In NeurIPS, Cited by: §3.2.
  • [21] S. Jung, H. Ahn, S. Cha, and T. Moon (2020) Continual learning with node-importance based adaptive group sparse regularization. In NeurIPS, Cited by: §3.2.
  • [22] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), pp. 3521–3526. Cited by: §1, §3.1, §3.2.
  • [23] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Unvieristy of Toronto: Technical Report. Cited by: Figure 1, §6.1.
  • [24] A. Kukleva, H. Kuehne, and B. Schiele (2021) Generalized and incremental few-shot learning by explicit learning and calibration without forgetting. In ICCV, Cited by: §3.2, §3.5.
  • [25] I. Kuzborskij, F. Orabona, and B. Caputo (2013) From n to n+1: multiclass transfer incremental learning. In CVPR, Cited by: §2.3.
  • [26] J. Lee, H. G. Hong, D. Joo, and J. Kim (2020) Continual learning with extended kronecker-factored approximate curvature. In CVPR, Cited by: §3.2.
  • [27] S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang (2017)

    Overcoming catastrophic forgetting by incremental moment matching

    In NIPS, Cited by: §3.2.
  • [28] Z. Li and D. Hoiem (2016) Learning without forgetting. In ECCV, Cited by: §3.1, §6.4, Table 3, Table 4, Table 5.
  • [29] C. Liu, Y. Fu, C. Xu, S. Yang, J. Li, C. Wang, and L. Zhang (2021) Learning a few-shot embedding model with contrastive learning. In AAAI, Cited by: §5.1.
  • [30] Y. Liu, B. Schiele, and Q. Sun (2021) Adaptive aggregation networks for class-incremental learning. In CVPR, Cited by: §3.2.
  • [31] Y. Liu, Y. Su, A. Liu, B. Schiele, and Q. Sun (2020) Mnemonics training: multi-class incremental learning without forgetting. In CVPR, Cited by: §3.2.
  • [32] R. M.French (1999) Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3. Cited by: §1, §2.1.
  • [33] Z. Mai, R. Li, J. Jeong, D. Quispe, H. Kim, and S. Sanner (2021) Online continual learning in image classification: an empirical survey. ArXiv preprint:2101.10423. Cited by: §1, §2.3, §3.5.
  • [34] A. Mallya, D. Davis, and S. Lazebnik (2018) Piggyback: adapting a single network to multiple tasks by learning to mask weights. In ECCV, Cited by: §3.2.
  • [35] A. Mallya and S. Lazebnik (2018) Packnet: adding multiple tasks to a single network by iterative pruning. In CVPR, Cited by: §3.2.
  • [36] M. Masana, X. Liu, B. Twardowski, M. Menta, A. D. Bagdanov, and J. van de Weijer (2020) Class-incremental learning: survey and performance evaluation on image classification. ArXiv preprint:2010.15277. Cited by: §3.1.
  • [37] M. McCloskey and N. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. The Psychology of Learning and Motivation 24, pp. 109–164. Cited by: §2.1.
  • [38] M. Mermillod, A. Bugaiska, and P. Bonin (2013) The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in Psychology 4. Cited by: §1.
  • [39] P. Pan, S. Swaroop, A. Immer, R. Eschenhagen, R. E. Turner, and M. E. Khan (2020) Continual deep learning by functional regularisation of memorable past. In NeurIPS, Cited by: §3.2.
  • [40] H. Qi, M. Brown, and D. G. Lowe (2018) Low-shot learning with imprinted weights. In CVPR, Cited by: §3.2.
  • [41] S. Rebuff, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) iCaRL: Incremental Classifier and Representation Learning. In CVPR, Cited by: §2.3, §3.1.
  • [42] M. Ren, R. Liao, E. Fetaya, and R. S. Zemel (2019) Incremental few-shot learning with attention attractor networks. In NeurIPS, Cited by: §2.3, §3.3, §3.4.
  • [43] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel (2018) Meta-learning for semi-supervised few-shot classification. In ICLR, Cited by: §6.1.
  • [44] M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro (2019) Learning to learn without forgetting by maximizing transfer and minimizing interference. In ICLR, Cited by: §3.2.
  • [45] M. Ristin, J. Gall, M. Guillaumin, and L. V. Gool (2015) From categories to subcategories: large-scale image classification with partial class label refinement. In CVPR, Cited by: §2.2, §2.3.
  • [46] M. Ristin, M. Guillaumin, J. Gall, and L. V. Gool (2014) Incremental Learning of NCM Forests for Large-Scale Image Classification. In CVPR, Cited by: §2.2, §2.3.
  • [47] S. Santurkar, D. Tsipras, and A. Madry (2020) BREEDS: benchmarks for subpopulation shift. ArXiv preprint:2008.04859. Cited by: §4.4, §6.1.
  • [48] Y. Shi, L. Yuan, Y. Chen, and J. Feng (2021) Continual learning via bit-level information preserving. In CVPR, Cited by: §3.2.
  • [49] J. Shu, Z. Xu, and D. Meng (2018) Small sample learning in big data era. arXiv preprint arXiv:1808.04572. Cited by: §3.3.
  • [50] P. Singh, P. Mazumder, P. Rai, and V. P. Namboodiri (2021) Rectification-based knowledge retention for continual learning. In CVPR, Cited by: §3.2.
  • [51] P. Singh, V. K. Verma, P. Mazumder, L. Carin, and P. Rai (2020) Calibrating cnns for lifelong learning. In NeurIPS, Cited by: §3.2.
  • [52] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In NeurIPS, Cited by: §3.3.
  • [53] X. Tao, X. Chang, X. Hong, X. Wei, and Y. Gong (2020) Topology-preserving class-incremental learning. In ECCV, Cited by: §2.3.
  • [54] X. Tao, X. Hong, X. Chang, S. Dong, X. Wei, and Y. Gong (2020) Few-shot class-incremental learning. In CVPR, Cited by: §2.3, §2.4, §3.4.
  • [55] Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola (2020) Rethinking few-shot image classification: a good embedding is all you need?. In ECCV, Cited by: §2.3, §3.3, §5.1.
  • [56] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille (2017) NormFace: l2 hypersphere embedding for face verification. In ACM Conference on Multimedia, Cited by: §3.4, §5.2.
  • [57] S. Wang, X. Li, J. Sun, and Z. Xu (2021) Training networks in null space of feature covariance for continual learning. In CVPR, Cited by: Appendix B, §1, §5.4.
  • [58] Y. Wang and Q. Yao (2019) Few-shot learning: a survey. ArXiv preprint:1904.05046. Cited by: §3.3.
  • [59] G. Wu, S. Gong, and P. Li (2021) Striking a balance between stability and plasticity for class-incremental learning. In ICCV, Cited by: §3.4.
  • [60] Y. Xu, Q. Qian, H. Li, R. Jin, and J. Hu (2021) Weakly supervised representation learning with coarse labels. In ICCV, Cited by: §2.2, §5.1.
  • [61] J. Yang, H. Yang, and L. Chen (2021) Towards cross-granularity few-shot learning: coarse-to-fine pseudo-labeling with visual-semantic meta-embedding. In ACM Conference on Multimedia, Cited by: §2.2.
  • [62] C. Zhang, N. Song, G. Lin, Y. Zheng, P. Pan, and Y. Xu (2021) Few-shot incremental learning with continually evolved classifiers. In CVPR, Cited by: §2.3, §2.4, Table 1, §3.3, §3.4.
  • [63] B. Zhao, X. Xiao, G. Gan, B. Zhang, and S. Xia (2020) Maintaining discrimination and fairness in class incremental learning. In CVPR, Cited by: §3.2, §6.4, Table 3, Table 4, Table 5.
  • [64] F. Zhu, X. Zhang, C. Wang, F. Yin, and C. Liu (2021) Prototype augmentation and self-supervision for incremental learning. In CVPR, Cited by: §3.2.

Appendix A Introduction

In this analysis, we decouple the embedding learner and classifier, a linear FC layer, freeze weights of the embedding learner, and use the conventional Softmax cross-entropy loss. Different from convention FC layer, we freeze weights of neurons corresponding to previously-seen classes.

Appendix B Prior Art

We set the base of our analysis with two definitions Wang et al. (2021). As we only analyze the last layer, we take off layer index .

Definition A (Stability). When the model is being trained in the -th session, in each session should lie in the null space of the uncentered feature covariance matrix , namely, if holds, then is stable at the -th session’s -th step.

Note that is the classification-layer’s weight vector, is the change of , indexes the session, and indexes the training step. where in is the input features of classification-layer on -th session using classification-layer’s weight trained on -th session. We call it the absolute stability where the equality condition is strict.

Definition B (Plasticity). Assume that the network is being trained in the -th session, and denotes the parameter update generated by Gradient Descent for training at step . If holds, then preserves plasticity at the -th session’s -th step.

Notably, if the inequality condition holds, then the ’s loss deceases, which is the essence, and thus is learning.

Appendix C Our Extension of Stability

Definition 1 (Stability Decay). For the same input sample, let denote the output logits of the -th neuron in the last layer in the -th session. After the loss reaches the minimum, we define the degree of stability as


Definition 2 (Relative Stability). Given models and , if , then we say is more stable than .

Appendix D Our Proof of Proposition 1

Proposition 1 (Normalizing or freezing weights improves stability; doing both improves the most). Given , if we only normalize weights of a linear FC classifier, we obtain ; if we only freeze them, we obtain ; if we do both, we obtain . Then, and .

Proof. (1) Stability Degree of model .

It is assumed that the training for all sessions will reach minimum loss. For the training sample in -th session, the probability that belongs to superclass is one, i.e., and . According to , the following conditions are satisfied,


After training of -th session has reached the minimum loss, , then,


(2) Stability Degree of model .

Under the same conditions above, the following conditions are satisfied according to ,


After training of -th session has reached minimum loss, , then the following holds:


(3) Stability Degree of model .

Compared with , model freezes weights of neurons corresponding to previously-seen classes. After training of -th session has reached its minimum loss, , where in order to offset the influence of , then,


(4) Stability Degree of model .

Compared with , model freezes weights of neurons corresponding to previously-seen classes. After training of -th session has reached its minimum loss, , then,


Comparing the stability degree of different models, we have and is the most stable.

Appendix E Our Proof of Proposition 2

Proposition 2 (Weights normalized, plasticity remains). To train our FC classifier, if we denote the loss as where is normalized, the weight update at each step as , and the learning rate as , then we have


Proof. For a sample whose feature vector is , the output of the -th neuron in linear FC layer is denoted as


The probability of sample belonging to -th class is


And the loss of training is denoted as


where denotes the label of sample . Denote the weights update of the -th neuron in linear FC layer as , then


According to , we have


By denoting , according to Taylor’s theorem, we have


where when . Therefore, there exists such that


With , we have , and thus for all . Therefore, weights update is the descent direction.

Appendix F Our Analysis of Conjecture 1

Considering a convention linear FC layer without weight normalization nor weight frozen of previously-seen classes. Let denotes a weight vector where indexes the classes. When the sample’s ground-truth label is , we have


where is the feature vector of a training sample.

Conjecture 1 (FC weights grow over time). Let denotes the Frobenius norm of the weight matrix formed by all weight vectors in the FC layer for new classes in the -th session. With training converged and norm outliers ignored, it holds that .

Analysis. For a convention linear FC layer, the output of neural network directly determines the probability of which class the sample belongs to. So we use to represent the reward () or penalty () for different neurons after sample with label is trained, where is the output of the -th neuron and is the learning rate. Then, we have


For a sample with superclass label and subclass label , when we train sample only with label and reach a relatively good state in -th session, we will get and . When we train sample only with label in other sessions and reach a relatively good state, the penalty for superclass of sample will be much larger than other classes, meanwhile the reward for subclass of sample will be much larger too. Therefore, if belongs to previously-seen classes, will hold most of the time during training. Thus, previously-seen classes will keep being penalized during the gradient descent. As a result, the weights of previously-seen classes are prone to be smaller than those for the newly added classes. And because we train new classes in stages and reach a relatively good state (say, the training loss converges to small value) for all sessions, the FC weights will piecewisely grow over time. Therefore, the model is consequently biased towards new classes.

Appendix G Our Analysis of Conjecture 2 and 3

Since the Conjecture 2 and Conjecture 3 are drawn from empirical observations, the following inductions will be conditioned on that the observations are always true. As a result, we present our analysis, rather than calling it a proof.

Conjecture 3 (Sufficient & necessary condition of no impact of freezing embedding-weights). where
: classifier-weights are normalized,
: classifier-weights are frozen,
: freezing embedding-weights improves the performance

As the name hints, Conjecture 2&3 is an integration of Conjecture 2 and Conjecture 3. Since is the contrapositive proposition of , they have the same truth value. Since , we have . Furthermore, given , we have , which means that is sufficient (if) and necessary (only if) for . Namely, iff classifier-weights are either normalized or frozen, then freezing embedding-weights does not help. In the following, we analyze Conjecture 2 and 3, respectively.

Conjecture 2 (the ’only if’ part).

Analysis. Although Conjecture 2 is a direct formulation of the corresponding observation, we will analyze it in a general sense. We have four propositions that are all true according to our empirical observations:
4⃝  .

They share a similar composition pattern, and thus we can summarize them as Table 6.

Table 6: Compound propositions.

Let us make an realization of general propositions
: classifier-weights are normalized,
: classifier-weights are frozen,
: freezing embedding-weights improves the performance, respectively. We want to construct a common proposition for all the four cases all to be true. Namely, we need to solve for a comopsitive proposition that satisfies the truth table with 1⃝, 2⃝, 3⃝, 4⃝ ordered top-down.

Table 7: A truth table that is not completed.

Note that is iff is and is . Therefore, we want ’s truth value of the line never to be . Given the value pairs of and , the only way to make that happen is to let be , which is a solution that satisfies all four cases, and thus is always true.

Table 8: The truth table is realized.

Namely, we have , which is exactly Conjecture 2, , with a change of notations.

Conjecture 4 (the ’if’ part). .
Analysis. Given propositions 2⃝, 3⃝, 4⃝, we will combine them and derive a logically-equivalent premise.


Similarly, we can derive 2⃝ 3⃝ 4⃝ as

With the premise replaced, we have
2⃝ 3⃝ 4⃝ ,

Given 2⃝, 3⃝, 4⃝ are all always true. it holds that is always true, Namely, we have .