Subspace Regularizers for Few-Shot Class Incremental Learning

by   Afra Feyza Akyürek, et al.
Boston University

Few-shot class incremental learning – the problem of updating a trained classifier to discriminate among an expanded set of classes with limited labeled data – is a key challenge for machine learning systems deployed in non-stationary environments. Existing approaches to the problem rely on complex model architectures and training procedures that are difficult to tune and re-use. In this paper, we present an extremely simple approach that enables the use of ordinary logistic regression classifiers for few-shot incremental learning. The key to this approach is a new family of subspace regularization schemes that encourage weight vectors for new classes to lie close to the subspace spanned by the weights of existing classes. When combined with pretrained convolutional feature extractors, logistic regression models trained with subspace regularization outperform specialized, state-of-the-art approaches to few-shot incremental image classification by up to 22 miniImageNet dataset. Because of its simplicity, subspace regularization can be straightforwardly extended to incorporate additional background information about the new classes (including class names and descriptions specified in natural language); these further improve accuracy by up to 2 that simple geometric regularization of class representations offers an effective tool for continual learning.



There are no comments yet.


page 9


Rectification-based Knowledge Retention for Continual Learning

Deep learning models suffer from catastrophic forgetting when trained in...

Subspace Networks for Few-shot Classification

We propose subspace networks for the problem of few-shot classification,...

Few-Shot Incremental Learning with Continually Evolved Classifiers

Few-shot class-incremental learning (FSCIL) aims to design machine learn...

Incremental Few-shot Text Classification with Multi-round New Classes: Formulation, Dataset and System

Text classification is usually studied by labeling natural language text...

Encoders and Ensembles for Task-Free Continual Learning

We present an architecture that is effective for continual learning in a...

Free Lunch for Few-shot Learning: Distribution Calibration

Learning from a limited number of samples is challenging since the learn...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Few-shot class incremental learning: (a) A base classifier is trained on a large dataset (). (b) This classifier is extended to also discriminate among a set of new classes with a small number of labeled examples (). (c) Models are evaluated on a test set that includes all seen classes (). This paper focuses on extremely simple, regularization-based approaches to FSCIL, with and without side information from natural language: (i) We regularize novel classifier weights toward the shortest direction to the subspace spanned by base classifier weights. (ii) We regularize novel classifiers pulling them toward the weighted average of base classifiers where weights are calculated using label/description similarity between novel and base class names or one-sentence descriptions. (iii) We learn a linear mapping between word labels and classifier weights of the base classes. Later, we project the novel label white wolf and regularize the novel classifier weight towards the projection.

Standard approaches to classification in machine learning assume a fixed training dataset and a fixed set of class labels. But for many real-world classification problems, these assumptions are unrealistic. Classifiers must sometimes be updated on-the-fly to recognize new concepts (e.g. new skills in personal assistants or new road signs in self-driving vehicles), while training data is sometimes unavailable for reuse (e.g. due to privacy regulations, Lesort et al. 2019; McClure et al. 2018; or storage and retraining costs, Bender et al. 2021). Development of models that support few-shot class-incremental learning (FSCIL), in which classifiers’ label sets can be easily extended with small numbers of new examples and no retraining, is a key challenge for machine learning systems deployed in the real world (Masana et al., 2020).

As a concrete example, consider the classification problem depicted in Fig. 1. A model, initially trained on a large set of examples from several base classes (snorkel, arctic fox, meerkat; Fig. 1a), must subsequently be updated to additionally recognize two novel classes (white wolf and poncho; Fig. 1b), and ultimately distinguish among all five classes (Fig. 1c). Training a model to recognize the base classes is straightforward: for example, we can jointly optimize the parameters of a feature extractor (perhaps a convolutional network parameterized by ) and a linear classification layer () to maximize the regularized likelihood of (image, label) pairs from the dataset in Fig. 1a:


But how can this model be updated to additionally recognize the classes in Fig. 1b, with only a few examples of each new class and no access to the original training data?

Naïvely continuing to optimize Eq. 1 on (x, y) pairs drawn from the new dataset will cause several problems. In the absence of any positive examples of those classes, performance on base classes will suffer due to catastrophic forgetting (Goodfellow et al., 2013), while performance on novel classes will likely be poor as a result of overfitting (Anderson and Burnham, 2004).

As a consequence, most past work on FSCIL has focused on alternative approaches that use non-standard prediction architectures (e.g., Tao et al., 2020) or optimize non-likelihood objectives (e.g., Yoon et al., 2020; Ren et al., 2019)

. This divergence between approaches to standard and incremental classification has its own costs—state-of-the-art approaches to FSCIL are complicated, requiring nested optimizers, complex data structures, and numerous hyperparameters. When improved representation learning and optimization techniques are developed for standard classification problems, it is often unclear to how to apply these to the incremental setting.

In this paper, we turn the standard approach to classification into a surprisingly effective tool for FSCIL. Specifically, we show that both catastrophic forgetting and overfitting can be reduced by introducing an additional subspace regularizer (related to one studied by Agarwal et al. 2010 and Kirkpatrick et al. 2017) that encourages novel to lie close to the subspace spanned by the base classes. On its own, the proposed subspace regularizer produces ordinary linear classifiers that achieve state-of-the-art results on FSCIL, improving over existing work in multiple tasks and datasets.

Because of its simplicity, this regularization approach can be easily extended to incorporate additional information about relationships between base and novel classes. Using language data as a source of background knowledge about classes, we describe a variation of our approach, which we term semantic subspace regularization, that pulls weight vectors toward particular convex combinations of base classes that capture their semantic similarity to existing classes. This further improves accuracy by up to 2% over simple subspace regularization across multiple tasks. These results suggest that FSCIL and related problems may not require specialized machinery to solve, and that simple regularization approaches can solve the problems that result from limited access to training data for both base and novel classes.111Code will be made publicly available.

2 Background

A long line of research has focused on the development of automated decision-making systems that support online expansion of the set of concepts they can recognize and generate. An early example (closely related to our learning-from-definitions experiment in Section 5) appears in the classic SHRDLU language grounding environment (Winograd, 1972): given the definition a steeple is a small triangle on top of a tall rectangle, SHRDLU acquires the ability to answer questions containing the novel concept steeple. Recent work in machine learning describes several versions of this problem in featuring more complex perception or control:

Few-shot and incremental learning

Few-shot classification problems test learners’ ability to distinguish among a fixed set of classes using only a handful of labeled examples per class (Scheirer et al., 2012). In addition to these examples, most effective approaches to few-shot learning rely on additional data for pre-training (Tian et al., 2020) or meta-learning (Vinyals et al., 2016; Finn et al., 2017; Snell et al., 2017). One peculiarity of this evaluation paradigm is that, even when pre-trained, models are evaluated only on new (few-shot) classes, and free to update their parameters in ways that cause them to perform poorly on pre-training tasks. As noted by past work, a more realistic evaluation of models’ ability to rapidly acquire new concepts should consider their ability to discriminate among both new concepts and old ones, a problem usually referred to as few-shot class-incremental learning (FSCIL)222Variants of this problem have gone by numerous names in past work, including generalized few-shot learning (Schönfeld et al., 2019), dynamic few-shot learning (Gidaris and Komodakis, 2018) or simply incremental few-shot learning (Ren et al., 2019; Chen and Lee, 2021). (Tao et al., 2020).

FSCIL requires learners to incrementally acquire novel classes with few labeled examples while retaining high accuracy on previously learned classes. It combines the most challenging aspects of class-incremental learning (Rebuffi et al., 2017) task-incremental learning (Delange et al., 2021), and rehearsal-based learning (Rolnick et al., 2019; Chaudhry et al., 2019), three related problems with much stronger assumptions about the kind of information available to learners. Existing approaches to this problem either prioritize novel class adaptation (Ren et al., 2019; Yoon et al., 2020; Chen and Lee, 2021; Cheraghian et al., 2021) or reducing forgetting in old classes (Tao et al., 2020).

Learning class representations

Even prior to the widespread use of deep representation learning approaches, the view of classification as problem of learning class representations motivated a number of approaches to multi-class and multi-task learning (Evgeniou and Pontil, 2007; Agarwal et al., 2010). In few-shot and incremental learning settings, many recent approaches have also focused on the space of class representations. Qi et al. (2018) initialize novel class representations using the average features from few-shot samples. Others (Gidaris and Komodakis, 2018; Yoon et al., 2020) train a class representation predictor via meta-learning, and Tao et al. (2020) impose topological constraints on the manifold of class representations as new representations are added. Alternatively, Chen and Lee (2021) models the visual feature space as a Gaussian mixture and use the cluster centers in a similarity-based classification scheme.

Our approach is related to Ren et al. (2019), who propose a nested optimization framework to learn auxiliary parameters for every base and novel class to influence the novel weights via regularization; we show that these regularization targets can be derived geometrically without the need for an inner optimization step. Also related is the work of Barzilai and Crammer (2015), which synthesizes the novel weights as linear combinations of base weights; we adopt a regularization approach that allows learning of class representations that are not strict linear combinations of base classes.

Learning with side information from language

The use of background information from other modalities (especially language) to bootstrap learning of new classes is widely studied (Frome et al., 2013; Radford et al., 2021; Reed et al., 2016)—particularly in the zero-shot learning and generalized zero-shot learning where side information is the only source of information about the novel class (Chang et al., 2008; Larochelle et al., 2008; Akata et al., 2013; Pourpanah et al., 2020). Specialized approaches exist for integrating side information in few-shot learning settings (Schwartz et al., 2019; Cheraghian et al., 2021).

3 Problem Formulation

We follow the notation in Tao et al. (2020) for FSCIL: assume a stream of learning sessions, each associated with a labeled dataset . Every consists of a support set (used for training) and a query set (used for evaluation). We will refer to the classes represented in as base classes; as in Fig. 1a, we will assume that it contains a large number of examples for every class. (and subsequent datasets) introduce novel classes (Fig. 1b). Let denote the set of classes expressed in a set of examples ; we will write and for convenience. The learning problem we study is incremental in the sense that each support set contains only new classes ()333This is the original setup established by Tao et al. (2020). We will also present experiments in which we retain one example per class for memory replay following Chen and Lee (2021)., while each query set evaluates models on both novel classes and previously seen ones (). It is few-shot in the sense that for , is small (containing 1–5 examples for all datasets studied in this paper). Given an incremental learning session the goal is to fine-tune existing classifier with the limited training data from novel classes such that the classifier performs well in classifying all classes learned thus far.

FSCIL with a single session

Prior to Tao et al. (2020), a simpler version of the multi-session FSCIL was proposed by Qi et al. (2018) where there is only single incremental learning session after the pre-training stage i.e. . This version, which we call single-session FSCIL, has been extensively studied by previous work (Qi et al., 2018; Gidaris and Komodakis, 2018; Ren et al., 2019; Yoon et al., 2020). This problem formulation is the same as above with : a feature extractor is trained on the samples from , then , then evaluated on samples with classes in .

4 Approach

Our approach to FSCIL consists of two steps. In the base session, we jointly train a feature extractor and classification layer on base classes (Section 4.1). In subsequent (incremental learning) sessions, we freeze the feature extractor and update only the classification layer using regularizers that (1) stabilize representations of base classes, and (2) bring the representations of new classes close to existing ones (Sections 4.2-4.4).

4.1 Feature Extractor Training

As in Eq. 1, we begin by training an ordinary classifier comprising a non-linear feature extractor and a linear decision layer with parameters . We choose and to maximize:


As discussed in Section 5, all experiments in this paper implement

as a convolutional neural network. In subsequent loss formulations we refer to

as .

4.2 Fine-tuning

Along with the estimated

, feature extractor training yields parameters only for base classes . Given an incremental learning dataset , we introduce new weight vectors and optimize


with respect to alone. Eq. 3 features two new regularization terms, and . limits the extent to which fine-tuning can change parameters for classes that have already been learned:


where denotes the value of the corresponding variable at the end of session . (For example, refers to the weights for the base class prior to fine tuning, i.e. after session .) As shown in Section 5.2, using alone, and setting , is a surprisingly effective baseline; however, performance can be improved by appropriately regularizing new parameters as described below.

Variant: Memory

Following past work (Chen and Lee, 2021) which performs incremental learning while retaining a small “memory” of previous samples , we explore an alternative baseline approach in which we append append in Eq. 3 with . We define the memory at session as where and . We sample only 1 example per previous class and we reuse the same example in subsequent sessions.

4.3 Method 1: Subspace Regularization

Past work on other multitask learning problems has demonstrated the effectiveness of constraining parameters for related tasks to be similar (Jacob et al., 2008), lie on the same manifold (Agarwal et al., 2010) or even on the same linear subspace (Evgeniou and Pontil, 2007). Moreover, Schönfeld et al. (2019) showed that a shared latent feature space for all classes is useful for class-incremental classification. Features independently learned for novel classes from small numbers of examples are likely to capture spurious correlations (unrelated to the true causal structure of the prediction problem) as a result of dataset biases (Arjovsky et al., 2019). In contrast, we expect most informative semantic features to be shared across multiple classes: indeed, cognitive research suggests that in humans’ early visual cortex, representations of different objects occupy a common feature space (Kriegeskorte et al., 2008). Therefore, regularizing toward the space spanned by base class weight vectors encourages new class representations to depend on semantic rather than spurious features and features for all tasks to lie in the same universal subspace.

We apply this intuition to FSCIL via a simple subspace regularization approach. Given a parameter for an incremental class and base class parameters , we first compute the subspace target for each class. We then compute the distance between from and define:


where is the projection of onto the space spanned by :


and contains the orthogonal basis vectors of the subspace spanned by the initial set of base weights . (

can be found using a QR decomposition of the matrix of base class vectors, as described in the appendix.)

Previous work that leverages subspace regularization for multitask learning assume that data from all tasks are available from the beginning (Argyriou et al., 2007; Agarwal et al., 2010). Our approach to subspace regularization removes this assumption, enabling tasks (in this case, novel classes) to arrive incrementally and predictions to be made cumulatively over all classes seen thus far.

4.4 Method 2: Semantic Subspace Regularization

The constraint in Eq. 5 makes explicit use of geometric information about base classes, pulling novel weights toward the base subspace. However, it provides no information about where within that subspace the weights for a new class should lie. In most classification problems, classes have names consisting of natural language words or phrases; these names often contain a significant amount of information relevant to the classification problem of interest. (Even without having ever seen a white wolf, a typical English speaker can guess that a white wolf is more likely to resemble an arctic fox than a snorkel.) These kinds of relations are often captured by embeddings of class labels (or more detailed class descriptions) (Pennington et al., 2014).

When available, this kind of information about class semantics can be used to construct an improved subspace regularizer by encouraging new class representations to lie close to a convex combination of base classes weighted by their semantic similarity. We replace the subspace projection in Eq. 5 with a semantic target for each class. Letting denote a semantic embedding of the class , we compute:




Embeddings can be derived from multiple sources: in addition to the class names discussed above, a popular source of side information for zero-shot and few-shot learning problems is detailed textual descriptions of classes; we evaluate both label and description embeddings in Section 5.

Baseline: Linear Mapping

While the approach described in Eq. 7 combines semantic information and label subspace information, a number of previous studies in vision and language have also investigated the effectiveness of directly learning a mapping from the space of semantic embeddings to the space of class weights (Das and Lee, 2019; Socher et al., 2013; Pourpanah et al., 2020; Romera-Paredes and Torr, 2015). Despite pervasiveness of the idea in other domains, this is the first time we are aware of it being explored for FSCIL. We extend our approach to incorporate this past work by learning a linear map between the embedding space and the weight space containing :


then set


Concurrent work by (Cheraghian et al., 2021) also leverages side information for FSCIL where they learn a mapping from image space onto the label space to directly produce predictions in the label space. We provide comparisons in Section 5.

5 Experiments

Given a classifier trained on an initial set of base classes, our experiments aim to evaluate the effect of subspace regularization (1) on the learning of new classes, and (2) on the retention of base classes. To evaluate the generality of our method, we evaluate using two different experimental paradigms that have been used in past work: a multi-session experiment in which new classes are continuously added and the classifier must be repeatedly updated, and a single-session setup () in which new classes arrive only once. We use SGD as our optimizer to train all models. Details about experiment setups and results are discussed below. Additional details may be found in the appendix.

Figure 2: Multi-Session FSCIL accuracy (%) results on miniImageNet. In the first session 0, there are a total of 60 classes (base). Every session following the first one introduces 5 novel

classes with 5 labeled samples from each. Each session provides accuracy over all classes that were seen thus far. Weighted average is the weighted combination of novel and base accuracies with respect to the number of classes in each category. Error bars are standard deviation (95% CI). In accordance with

Chen and Lee (2021) we preserve only one sample per class from previous classes and append them to the support set during fine-tuning (+M variant). Regularization based approaches i.e. subspace regularization, semantic subspace regularization and linear mapping consistently outperform previous benchmarks on average.

5.1 Multi-Session

We follow the same setup established in Tao et al. (2020) as well as Section 3: we first train a ResNet (He et al., 2016) network from scratch as the feature extractor on a large number of examples from base classes to obtain an initial classifier . We then observe a new batch of examples and produce a new classifier defined by . Finally, we evaluate the classifier according to top-1 accuracy in base and novel samples as well as their weighted average (Tao et al., 2020; Chen and Lee, 2021). We use the miniImageNet dataset (Vinyals et al., 2016; Russakovsky et al., 2015) for our multi-session evaluation. miniImageNet contains 100 classes with 600 samples per class.

Session 1 2 8
Tao et al. (2020) 50.1 45.2 24.4
Chen and Lee (2021) 59.9 55.9 41.8
Fine-tuning 61.8 67.7 49.9 62.9 26.5 48.1
Subspace Reg. 71.7 72.9 66.9 67.8 46.8 48.8
Cheraghian et al. (2021)* 58.0 53.0 39.0
 Linear Mapping 72.6 73.2 67.1 68.0 46.9 50.0
 Semantic Subspace Reg. 73.8 73.9 68.4 69.0 47.6 49.7
Table 1: Multi-Session FSCIL weighted average of accuracy (%) results on miniImageNet using an identical setup to Fig. 2 with memory distinction. We report the average results over 10 random splits of the data for incremental sessions 1, 2 and 8. indicates 1 sample per class is kept (or not) in the memory to further regularize forgetting. Our Fine-tuning baseline is already superior to previous results. In both memory settings, our regularizers substantially outperform respective benchmarks for all 1-8 sessions. Results are only estimates from the plot in the respective work. Bold indicates the highest.

Feature extractor training

In session , we randomly select 60 classes as base classes () and use the remaining 40 classes as novel classes. Reported results are averaged across 10 random splits of the data (Fig. 2). We use a ResNet-18 model that is identical to the one described in Tian et al. (2020). Following Tao et al. (2020), we use 500 labeled samples per base class to train our feature extractor and 100 for testing.

Incremental evaluation

Again following Tao et al. (2020), we evaluate for a total of 8 incremental sessions for miniImageNet. In each session, for , we sample 5 novel classes for training and 5 samples from each class. Hence, at the last session , evaluation involves the entire set of 100 miniImageNet classes. We use GloVe embeddings (Pennington et al., 2014) for label embeddings in Eq. 7.


Fig. 2 and Table 1 show the results of multi-session experiments with and without memory. Session 0 indicates base class accuracy after feature extractor training. We compare subspace and language-guided regularization (linear mapping and semantic subspace reg.) to simple fine-tuning (a surprisingly strong baseline). We also compare our results to three recent benchmarks: Tao et al. (2020), Chen and Lee (2021) and Cheraghian et al. (2021).444Chen and Lee (2021) and Cheraghian et al. (2021) do not provide a codebase and Tao et al. (2020) does not provide an implementation the main TOPIC algorithm in their released code. Therefore, we report published results rather than a reproduction. This comparison is inexact: our feature extractor performs substantially better than Tao et al. (2020) and Chen and Lee (2021). Despite extensive experiments (see appendix) on various versions of ResNet-18 (He et al., 2016), we were unable to identify a training procedure that reproduced the reported accuracy for session 0: all model variants investigated achieved 80%+ validation accuracy.

When models are evaluated on combined base and novel accuracy, subspace regularization outperforms previous approaches (by 22% (-M) and 7% (+M) at session 8); when semantic information about labels is available, linear mapping and semantic subspace regularization outperform Cheraghian et al. (2021) (Table 1). Evaluating only base sample accuracies (Fig. 2b), semantic subspace reg. outperforms others; compared to regularization based approaches fine-tuning is subject to catastrophic forgetting. is still useful in regulating forgetting (Table 3 in appendix). The method of Chen and Lee (2021) follows a similar trajectory to our regularizers, but at a much lower accuracy (Fig. 2a). In Fig. 2c, a high degree of forgetting in base classes with fine-tuning allows higher accuracy for novel classes—though not enough to improve average performance (Fig. 2a). By contrast, subspace regularizers enable a good balance between plasticity and stability (Mermillod et al., 2013). In Table 1, storing as few as a single example per an old class substantially helps to reduce forgetting. Results from linear mapping and semantic subspace regularization are close, with semantic subspace regularization performing roughly 1% better on average. The two approaches offer different trade-offs between base and novel accuracies: the latter is more competitive for base classes and vice-versa.

5.2 Single Session

miniImageNet tieredImageNet
Model 1-shot 5-shot 1-shot 5-shot
Acc. Acc. Acc. Acc.
Imprinted Networks (Qi et al., 2018) 41.34 -23.79% 46.34 -25.25% 40.83 -22.29% 53.87 -17.18%
LwoF (Gidaris and Komodakis, 2018) 49.65 -14.47% 59.66 -12.35% 53.42 -9.59% 63.22 -7.27%
Attention Attractor Networks (Ren et al., 2019) 54.95 -11.84% 63.04 -10.66% 56.11 -6.11% 65.52 -4.48%
XtarNet (Yoon et al., 2020) 56.12 -13.62% 69.51 -9.76% 61.37 -1.85% 69.58 -1.79%
Fine-tuning 58.56 -12.14% 66.54 -13.77% 64.42 -7.23% 72.59 -6.88%
Subspace Regularization 58.38 -12.30% 68.88 -10.74% 64.39 -7.23% 73.03 -6.16%
 Linear Mapping 58.87 -12.83% 69.68 -10.40% 64.55 -7.31% 73.10 -6.16%
 Semantic Subspace Reg. (w/ description) 59.09 -12.38% 68.46 -11.70% 64.49 -7.14% 72.94 -6.29%
 Semantic Subspace Reg. (w/ label) 58.70 -12.24% 69.75 -10.48% 64.75 -7.22% 73.51 -6.08%
Table 2: miniImageNet 64+5-way and tieredImageNet 200+5-way single-session results. We follow previous work in reporting the average of accuracies of base and novel samples over all classes rather than weighted average. In addition to accuracy, we report a quantity labeled by Ren et al. (2019), which is the gap between individual accuracies and joint accuracies of both base and novel samples averaged. Lower values of

are better. Bold numbers are not significantly different from the best result in each column under a paired t-test (

after Bonferroni correction). All results are averaged across 2000 runs.

In this section we describe the experiment setup for the single-session evaluation, , and compare our approach to state-of-the-art XtarNet (Yoon et al., 2020), as well as Ren et al. (2019), Gidaris and Komodakis (2018) and Qi et al. (2018). We evaluate our models on 1-shot and 5-shot settings.555Unlike in the preceding section, we were able to successfully reproduce the XtarNet model. Our version gives better results on the miniImageNet dataset but worse results on the tieredImageNet datasets; for fairness, we thus report results for our version of XtarNet on miniImageNet and previously reported numbers on tieredImageNet. For other models, we show accuracies reported in previous work.

miniImageNet and tieredImageNet

For miniImageNet single-session experiments, we follow the the splits provided by Yoon et al. (2020). Out of 100, 64 classes are used in session , 20 in session and the remaining for development. Following Yoon et al. (2020), we use ResNet-12 (a smaller version of the model described in Section 5.1). tieredImageNet (Ren et al., 2018) contains a total of 608 classes out of which 351 are used in session and 160 are reserved for . The remaining 97 are used for development. While previous work (Ren et al., 2019; Yoon et al., 2020) separate 151 classes out of the 351 for meta training, we pool all 351 for feature extractor training. We train the same ResNet-18 described in Section 5.1. Additional details regarding learning rate scheduling, optimizer parameters and other training configurations may be found in the appendix.

Incremental evaluation

We follow Yoon et al. (2020) for evaluation. Yoon et al. independently sample 2000 incremental datasets (“episodes”) from the testing classes

. They report average accuracies over all episodes with 95% confidence intervals. At every episode,

is resampled from both base and novel classes, and

, with equal probability for both

miniImageNet and tieredImageNet. We again fine-tune the weights until convergence. We do not reserve samples from base classes, thus the only training samples during incremental evaluation is from the novel classes . We use the same resources for label embeddings and Sentence-BERT embeddings (Reimers and Gurevych, 2019) for descriptions which are retrieved from WordNet (Miller, 1995).


We report aggregate results for 1-shot and 5-shots settings of miniImageNet and tieredImageNet (Table 2). Compared to previous work specialized for the single-session setup without a straightforward way to expand into multi-session (Ren et al., 2019; Yoon et al., 2020), even our simple fine-tuning baseline perform well on both datasets—outperforming the previous state-of-the-art in three out of four settings in Table 2. Addition of subspace and semantic regularization improves performance overall but tieredImageNet 1-shot setting. Semantic subspace regularizers match or outperform linear label mapping. Subspace regularization outperforms fine-tuning in 5-shot settings and matches it in 1-shot. In addition to accuracy, we report a quantity labeled by Ren et al. (2019). serves as a measure of catastrophic forgetting, with the caveat that it can be minimized by a model that achieves a classification accuracy of 0 on both base and novel classes. We find that our approaches result in approximately the same in miniImageNet and worse in tieredImageNet than previous work.

6 Analysis and Limitations

Figure 3: Simple fine-tuning (top) vs. subspace regularization (bottom) predictions without memory across the first four incremental sessions of miniImageNet. In the x- and y-axes, we present predictions and gold labels ranging from 0 to 79 where the first 60 are base classes. The number of classes grows by 5 every session starting from 60 up to 80. Brighter colors indicate more frequent predictions. Note that simple fune-tuning entails bias towards the most recently learned classes (top row) whereas addition of subspace regularization on the novel weights remedies the aforementioned bias; resulting in a fairer prediction performance for all classes.

What does regularization actually do?

Fine-tuning results in prediction biased towards the most recently learned classes (top of Fig. 3) when no subspace regularization is imposed. Our experiments show that preserving the base weights while regularizing novel weights gives significant improvements over ordinary fine-tuning (bottom of Fig. 3)—resulting a fairer prediction over all classes and reducing catastrophic forgetting. In Table 1, Semantic Subspace Reg. results in 73.8% and 47.6% accuracies in the 1st and 8th sessions whereas, fine-tuning results in 61.8% and 26.5%, respectively, even without any memory—suggesting that regularization ensures a better retention of accuracy. While the trade-off between accuracies of base and novel classes is inevitable due to the nature of the classification problem, the proposed regularizers provide a good balance between the two.

What are the limitations of the proposed regularization scheme?

Our approach targets only errors that originate in the final layer of the model—while a convolutional feature extractor is used, the parameters of this feature extractor are fixed, and we have focused on FSCIL as a linear classification problem. Future work might extend these approaches to incorporate fine-tuning of the (nonlinear) feature extractor itself while preserving performance on all classes in the longer term.

7 Conclusions

We have described a family of regularization-based approaches to few-shot class-incremental learning, drawing connections between incremental learning and the general multi-task and zero-shot learning literature. The proposed regularizers are extremely simple—they involve only one extra hyperparameter, require no additional training steps or model parameters, and are easy to understand and implement. Despite this simplicity, our approach enables ordinary classification architectures to achieve state-of-the-art results on the doubly challenging few-shot incremental image classification across multiple datasets and problem formulations.


  • A. Agarwal, S. Gerber, and H. Daume (2010) Learning multiple tasks using manifold regularization. In Advances in Neural Information Processing Systems, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta (Eds.), Vol. 23, pp. . Cited by: §1, §2, §4.3, §4.3.
  • Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2013) Label-embedding for attribute-based classification. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 819–826. Cited by: §2.
  • D. Anderson and K. Burnham (2004) Model selection and multi-model inference. Second. NY: Springer-Verlag 63 (2020), pp. 10. Cited by: §1.
  • A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying (2007) A spectral regularization framework for multi-task structure learning.. In NIPS, Vol. 1290, pp. 1296. Cited by: §4.3.
  • M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §4.3.
  • A. Barzilai and K. Crammer (2015) Convex multi-task learning by clustering. In Artificial Intelligence and Statistics, pp. 65–73. Cited by: §2.
  • E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell (2021)

    On the dangers of stochastic parrots: can language models be too big?

    In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623. Cited by: §1.
  • M. Chang, L. Ratinov, D. Roth, and V. Srikumar (2008) Importance of semantic representation: dataless classification.. In AAAI, Vol. 2, pp. 830–835. Cited by: §2.
  • A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, and M. Ranzato (2019) On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486. Cited by: §2.
  • K. Chen and C. Lee (2021) Incremental few-shot learning via vector quantization in deep embedded space. In International Conference on Learning Representations, Cited by: Table 4, §2, §2, §4.2, Figure 2, §5.1, §5.1, §5.1, Table 1, footnote 2, footnote 3, footnote 4.
  • A. Cheraghian, S. Rahman, P. Fang, S. K. Roy, L. Petersson, and M. Harandi (2021) Semantic-aware knowledge distillation for few-shot class-incremental learning. arXiv preprint arXiv:2103.04059. Cited by: Table 5, §2, §2, §4.4, §5.1, §5.1, Table 1, footnote 4.
  • D. Das and C. G. Lee (2019) Zero-shot image recognition using relational matching, adaptation and calibration. In

    2019 International Joint Conference on Neural Networks (IJCNN)

    pp. 1–8. Cited by: §4.4.
  • M. Delange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars (2021) A continual learning survey: defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
  • A. Evgeniou and M. Pontil (2007) Multi-task feature learning. Advances in neural information processing systems 19, pp. 41. Cited by: §2, §4.3.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. Cited by: §2.
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) DeViSE: a deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26, pp. . Cited by: §2.
  • S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4367–4375. Cited by: §2, §3, §5.2, Table 2, footnote 2.
  • I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio (2013) An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §D.3, Table 6, Appendix D, §5.1, footnote 4.
  • L. Jacob, F. Bach, and J. Vert (2008) Clustered multi-task learning: a convex formulation. In Advances in Neural Information Processing Systems, Cited by: §4.3.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1.
  • N. Kriegeskorte, M. Mur, D. A. Ruff, R. Kiani, J. Bodurka, H. Esteky, K. Tanaka, and P. A. Bandettini (2008) Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron 60 (6), pp. 1126–1141. Cited by: §4.3.
  • H. Larochelle, D. Erhan, and Y. Bengio (2008) Zero-data learning of new tasks.. In AAAI, Vol. 1, pp. 3. Cited by: §2.
  • T. Lesort, V. Lomonaco, A. Stoian, D. Maltoni, D. Filliat, and N. Diaz Rodriguez (2019) Continual learning for robotics: definition, framework, learning strategies, opportunities and challenges. Information Fusion 58, pp. . External Links: Document Cited by: §1.
  • M. Masana, X. Liu, B. Twardowski, M. Menta, A. D. Bagdanov, and J. van de Weijer (2020) Class-incremental learning: survey and performance evaluation. arXiv preprint arXiv:2010.15277. Cited by: §1.
  • P. McClure, C. Y. Zheng, J. Kaczmarzyk, J. Rogers-Lee, S. Ghosh, D. Nielson, P. Bandettini, and F. Pereira (2018) Distributed weight consolidation: a brain segmentation case study. In NeurIPS, Cited by: §1.
  • M. Mermillod, A. Bugaiska, and P. Bonin (2013) The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in psychology 4, pp. 504. Cited by: §5.1.
  • G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §5.2.
  • N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2017) A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141. Cited by: Appendix D.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    pp. 1532–1543. Cited by: §4.4, §5.1.
  • F. Pourpanah, M. Abdar, Y. Luo, X. Zhou, R. Wang, C. P. Lim, and X. Wang (2020) A review of generalized zero-shot learning methods. arXiv preprint arXiv:2011.08641. Cited by: §2, §4.4.
  • H. Qi, M. Brown, and D. G. Lowe (2018) Low-shot learning with imprinted weights. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5822–5830. Cited by: §2, §3, §5.2, Table 2.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: §2.
  • S. Rebuffi, A. Kolesnikov, and C. H. Lampert (2017) ICaRL: incremental classifier and representation learning.. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §2.
  • S. Reed, Z. Akata, H. Lee, and B. Schiele (2016) Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 49–58. Cited by: §2.
  • N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. Cited by: §5.2.
  • M. Ren, R. Liao, E. Fetaya, and R. Zemel (2019) Incremental few-shot learning with attention attractor networks. In Advances in Neural Information Processing Systems, Cited by: §D.4, §D.5, Appendix D, §1, §2, §2, §3, §5.2, §5.2, §5.2, Table 2, footnote 2.
  • M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel (2018) Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676. Cited by: Appendix A, §D.5, §5.2.
  • D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne (2019) Experience replay for continual learning. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . Cited by: §2.
  • B. Romera-Paredes and P. Torr (2015) An embarrassingly simple approach to zero-shot learning. In International conference on machine learning, pp. 2152–2161. Cited by: §4.4.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: Appendix A, §5.1.
  • W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult (2012) Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence 35 (7), pp. 1757–1772. Cited by: §2.
  • E. Schönfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019)

    Generalized zero-shot learning via aligned variational autoencoders

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §4.3, footnote 2.
  • E. Schwartz, L. Karlinsky, R. Feris, R. Giryes, and A. M. Bronstein (2019) Baby steps towards few-shot learning with multiple semantics. arXiv preprint arXiv:1906.01905. Cited by: §2.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . Cited by: §2.
  • R. Socher, M. Ganjoo, H. Sridhar, O. Bastani, C. D. Manning, and A. Y. Ng (2013) Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems, Cited by: §4.4.
  • X. Tao, X. Hong, X. Chang, S. Dong, X. Wei, and Y. Gong (2020) Few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12183–12192. Cited by: §1, §2, §2, §2, §3, §3, §5.1, §5.1, §5.1, §5.1, Table 1, footnote 3, footnote 4.
  • Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola (2020) Rethinking few-shot image classification: a good embedding is all you need?. arXiv preprint arXiv:2003.11539. Cited by: §D.1, §D.3, Table 6, Appendix D, Appendix D, §2, §5.1.
  • L. N. Trefethen and D. Bau III (1997) Numerical linear algebra. Vol. 50, Siam. Cited by: §E.1.
  • O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. arXiv preprint arXiv:1606.04080. Cited by: Appendix A, §2, §5.1.
  • T. Winograd (1972) Understanding natural language. Cognitive psychology 3 (1), pp. 1–191. Cited by: §2.
  • S. W. Yoon, D. Kim, J. Seo, and J. Moon (2020) XtarNet: learning to extract task-adaptive representation for incremental few-shot learning. In International Conference on Machine Learning, pp. 10852–10860. Cited by: §D.4, §D.5, Appendix D, §E.3, §1, §2, §2, §3, §5.2, §5.2, §5.2, §5.2, Table 2.

Appendix A Code and Datasets

Code will be made publicly available. We use miniImageNet (Vinyals et al., 2016) and tieredImageNet (Ren et al., 2018) datasets both are subsets of ImageNet dataset (Russakovsky et al., 2015). Use of terms and licenses are available through the respective sources.

Appendix B Analysis of Catastrophic Forgetting

Model 0 1 2 3 4 5 6 7 8
Semantic Subspace Reg. 80.37 73.76 68.36 64.07 60.36 56.27 53.10 50.45 47.55
Semantic Subspace Reg. no 80.37 71.69 63.81 55.99 49.99 44.06 38.83 36.76 33.25
Fine-tuning 80.37 61.77 49.93 40.45 34.04 31.63 28.43 27.91 26.54
Fine-tuning no 80.37 62.39 53.89 46.56 39.73 32.92 27.00 23.95 20.39
Table 3: miniImageNet weighted average results across multiple sessions showcasing the usefulness of in reducing catastrophic forgetting across multiple sessions for -M setting. Higher accuracies are highlighted and results are averages over 10 random splits. is useful in combination with Semantic Subspace Reg. for all sessions while it is more helpful in the long-run for Fine-tuning. Note that Semantic Subspace Reg. consistently outperforms Fine-tuning regardless the use of .

Appendix C Results in Tabular Form

In Table 4 and Table 5, we present the multi-session results in the main paper in the tabular form.

Session 0 1 2 3 4 5 6 7 8
Chen and Lee (2021) 64.77 59.87 55.93 52.62 49.88 47.55 44.83 43.14 41.84
Fine-tuning 80.37 67.69 62.91 59.52 56.87 54.37 51.92 50.26 48.13
Subspace Reg. 80.37 72.90 67.81 63.26 60.18 56.74 53.94 51.29 48.83
 Linear Mapping 80.37 73.24 67.96 64.50 61.28 57.68 54.64 52.25 50.00
 Semantic Subspace Reg. 80.37 73.92 69.00 65.10 61.73 58.12 54.98 52.21 49.65
Table 4: miniImageNet +M results across multiple sessions in tabular form. Initial number of base classes is 60 and 5 new classes are introduced at every session. Results are on the test set that grows with the increasing number of classes. In the last session we evaluate over all 100 classes.
Model 0 1 2 3 4 5 6 7 8
Tao et al. (2020) 61.31 50.09 45.17 41.16 37.48 35.52 32.19 29.46 24.42
Cheraghian et al. (2021)* 62.00 58.00 52.00 49.00 48.00 45.00 42.00 40.00 39.00
Fine-tuning 80.37 61.77 49.93 40.45 34.04 31.63 28.43 27.91 26.54
Subspace Regularization 80.37 71.69 66.94 62.53 58.90 55.00 51.94 49.76 46.79
 Linear Mapping 80.37 72.65 67.11 63.47 59.82 55.44 51.42 49.64 46.90
 Semantic Subspace Reg. 80.37 73.76 68.36 64.07 60.36 56.27 53.10 50.45 47.55
Table 5: miniImageNet -M results across multiple sessions in tabular form. Initial number of base classes is 60 and 5 new classes are introduced at every session. Results are on the test set that grows with the increasing number of classes. The last session is evaluated over all 100 classes. *Note that the entries for Cheraghian et al. (2021) are only rough estimates from the visual plot provided in their published work.

Appendix D Details of Feature Extractor Training

We use the exact ResNet described in Tian et al. (2020), the differences compared to the standard ResNet (He et al., 2016): (1) Each block (collection of convolutional blocks) is composed of three convolutional layers instead of two. (2) Number of blocks for ResNet-12 is 4 instead of 6 of the standard version, thus the total number of convolutional layers are the same. (3) Filter sizes are [64,160,320,640] rather than [64,128,256,512], though the total number of filters is comparable since Tian et al. (2020) has less blocks. (4) There is Dropblock at the end of the last blocks.

Tian et al. (2020) provides a full visualization in Appendix and their code repository666 is easy to browse on which we base our own codebase. We observe that the previous work oftentimes use their slightly modified version of the standard ResNet. Ren et al. (2019) uses the ResNet-10 (Mishra et al., 2017) and ResNet-18 for for miniImageNet and tieredImageNet, respectively. XtarNet(Yoon et al., 2020) is originally based on a slightly modified version of ResNet-12 and ResNet-18 which we replaced with our version, improving their results for miniImageNet but not in tieredImageNet, thus we report improved results for miniImageNet and their results for tieredImageNet in the main paper.

d.1 Default Settings

Unless otherwise indicated we use the following default settings of Tian et al. (2020)

in our feature extractor training. We use SGD optimizer with learning starting at 0.05 with decays by 0.1 at epochs 60 and 80. We train for a total of 100 epochs. Weight decay is 5e-4, momentum is 0.9 and batch size is 64. As per transformations on training images, we use random crop of 84x84 with padding 8. We also use color jitter (

brightness=0.4, contrast=0.4, saturation=0.4) and horizontal flips. For each run, we sample 1000 images from base classes and 25 images from each of novel classes. Our classifier does not have bias.

d.2 Multi-Session miniImageNet

We re-sample the set of base classes (60 classes) 10 times across different seeds and train ten ResNet-18 architectures. Each class has 500 training images. We follow the default settings for training.

d.3 Multi-Session Comparison to Standard ResNet-18

In Table 6 we provide validation set results for two types of ResNet-18’s: Tian et al. (2020) and He et al. (2016) across ten different seeds. Results show that use of Tian et al. (2020) does not incur unfair advantage over those who used He et al. (2016).

Seed 1 Seed 2 Seed 3 Seed 4 Seed 5 Seed 6 Seed 7 Seed 8 Seed 9 Seed 10 Mean
Our ResNet-18 (Tian et al., 2020) 84.833 79.167 83.200 81.300 81.267 78.933 82.033 82.067 81.800 82.367 81.6967
Standard ResNet-18 (He et al., 2016) 83.333 80.100 83.867 81.333 80.967 79.100 81.833 82.500 81.167 81.567 81.5767
Table 6: miniImageNet validation set accuracy with two ResNet-18 architectures with slight differences as listed in Appendix D. Overall performances are comparable.

d.4 Single-Session miniImageNet

We follow the default hyperparameters parameters Section D.1 except that for training, validation and testing we use the exact splits provided by Ren et al. (2019) also used by Yoon et al. (2020). There are 64 base, 16 validation and 20 testing classes provided (totaling 100). Training data consists of 600 images per base class. Dataset statistics are delineated in the Appendix of Ren et al. (2019) and downloadable splits are available here, courtesy of Ren et al. (2019).

d.5 Single-Session tieredImageNet

tieredImageNet is first introduced by Ren et al. (2018). Same as above, we use the default parameters except that we train for a total of 60 epochs decaying the initial learning rate of 0.05 by 0.1 at epochs 30 and 45. Again, we use the same data as previous work available at the same link above. tieredImageNet is split into 351, 97 and 160 classes. Past work that use meta-learning Ren et al. (2019); Yoon et al. (2020) split 351 training classes into further 200 and 151 clases where the latter is used for meta learning. We pool all 351 for feature extractor training. At the end of feature extractor training, we only keep the classifier weights for the first 200 classes to adhere to the evaluation scheme of 200+5 classes as past work.

Appendix E Details of Incremental Evaluation

e.1 QR Decomposition for Subspace Regularization

To compute the orthogonal basis for the subspace spanned by base classifier weights we use QR decomposition(Trefethen and Bau III, 1997):


e.2 Multi-Session

For testing, we sample 1000 images from base classes and 25 images from each of novel classes. Testing images from a given class stay the same across sessions. Harmonic mean results take into account the ratio of base classes to novel classes in a given session. In this setting, there is no explicit development set (with disjoint classes than train and test) defined by previous work thus we use the first incremental learning session (containing 5 novel classes) as our development set.

Default settings

We use the same transformations as in Section D.1 on the training images. We stop fine-tuning when loss does not change more than 0.0001 for at least 10 epochs. We use SGD optimizer. We repeat the experiments 10 times and report the average accuracy with standard deviation (95% confidence interval) in the paper.

Simple Fine-tuning

We use learning rate of 0.002 and do not use learning decay. Weight-decay is set at 5e-3. In order to limit the change in weights, we use different ’s for base and previously learned novel classes, where the former is 0.2 and the latter 0.1. We rely on the default settings otherwise.

Subspace Regularization

Different from simple fine-tuning we use a weight decay of 5e-4. There is an additional parameter called in this setting controlling the degree of pulling of novel weights toward the subspace which we set to 1.0.

Semantic Subspace Regularization

Different than simple subspace regularization, there is a temperature parameter used in the Softmax operation used in computation of ’s which we set to 3.0.

Linear mapping regularization

Same parameters as in subspace regularization are used except . We formulate as a linear layer with bias and we use gradient descent to train the parameters.

e.3 Single-Session

In our 1-shot experiments unless fine-tuning converges by then, we stop at the maximum number of epochs at 1000. We sample 2000 episodes which includes 5 novels classes and 1-5 samples from each and report average accuracy. For testing, base and novel samples have equal weight in average per previous work (Yoon et al., 2020). SGD optimizer is used. For description similarity we use Sentence-BERT’s stsb-roberta-large.

miniImageNet Settings

We use the same set of transformations on the training images as described in Section D.1. We first describe details of 1-shot setting. In 1-shot experiments we set the maximum epochs to 1000. For simple fine-tuning we use learning rate of 0.003, and weight decay of 5e-3. In Semantic Subspace Reg., we set temperature to 1.5. Both in Semantic Subspace Reg. and linear mapping and weight-decay is 5e-4. In subspace regularization, and weight-decay is set to 5e-5. Description similarity follows the same setup as Semantic Subspace Reg..

In 5-shot setting, we set weight-decay to 5e-3 and learning rate to 0.002. For subspace regularization, Semantic Subspace Reg. and linear mapping we use and for description similarity we use .

Figure 4: Clasifier weight space when subspace regularization is applied for miniImageNet single-session 5-shot setting. First two principal components are shown according to PCA. Red labels indicate novel classes while the black indicates base. The green crosses indicate the projection of the respective novel class weight to the base subspace. Note that unlike label/description similarity and linear mapping, subspace target is dynamic: it changes according to its corresponding novel weights and vice versa.
Figure 5: Clasifier weight space when Semantic Subspace Reg. is applied for miniImageNet single-session 5-shot setting. First two principal components are shown according to PCA. Red labels indicate novel classes while the black indicates base. The green crosses indicate the semantic target of the respective novel class. Note that semantic targets are static: they don’t change during fine-tuning. Notably, the semantic target for theater curtain falls closely to the class representation of the base class stage, dragging novel weight for theater curtain towards there. Same dynamic is visible for novel class crate and base barrel.

tieredImageNet Settings

In 1-shot setting, fine-tuning uses learning rate of 0.003. Semantic Subspace Reg. has learning rate of 0.005, weight-decay 5e-3 and . Subspace reg. and linear mapping use .

In 5-shot setting, for simple fine-tuning we set lr = 0.001, weight-decay=5e-3, . For Semantic Subspace Reg. and while others have .

Appendix F Visualizations

In Fig. 4 and Fig. 5 we depict principal components of classifier weights as well as semantic or subspace targets for novel weights.

Appendix G Compute

We use a single 32 GB V100 NVIDIA GPU for all our experiments.