Beyond Simple Meta-Learning: Multi-Purpose Models for Multi-Domain, Active and Continual Few-Shot Learning

by   Peyman Bateni, et al.

Modern deep learning requires large-scale extensively labelled datasets for training. Few-shot learning aims to alleviate this issue by learning effectively from few labelled examples. In previously proposed few-shot visual classifiers, it is assumed that the feature manifold, where classifier decisions are made, has uncorrelated feature dimensions and uniform feature variance. In this work, we focus on addressing the limitations arising from this assumption by proposing a variance-sensitive class of models that operates in a low-label regime. The first method, Simple CNAPS, employs a hierarchically regularized Mahalanobis-distance based classifier combined with a state of the art neural adaptive feature extractor to achieve strong performance on Meta-Dataset, mini-ImageNet and tiered-ImageNet benchmarks. We further extend this approach to a transductive learning setting, proposing Transductive CNAPS. This transductive method combines a soft k-means parameter refinement procedure with a two-step task encoder to achieve improved test-time classification accuracy using unlabelled data. Transductive CNAPS achieves state of the art performance on Meta-Dataset. Finally, we explore the use of our methods (Simple and Transductive) for "out of the box" continual and active learning. Extensive experiments on large scale benchmarks illustrate robustness and versatility of this, relatively speaking, simple class of models. All trained model checkpoints and corresponding source codes have been made publicly available.



There are no comments yet.


page 1

page 4

page 5

page 15

page 18

page 19


Improving Few-Shot Visual Classification with Unlabelled Examples

We propose a transductive meta-learning method that uses unlabelled inst...

Few-Shot Image Classification via Contrastive Self-Supervised Learning

Most previous few-shot learning algorithms are based on meta-training wi...

Trainable Class Prototypes for Few-Shot Learning

Metric learning is a widely used method for few shot learning in which t...

Fast and Flexible Multi-Task Classification Using Conditional Neural Adaptive Processes

The goal of this paper is to design image classification systems that, a...

Diversity with Cooperation: Ensemble Methods for Few-Shot Classification

Few-shot classification consists of learning a predictive model that is ...

Few-Shot Learning with Class Imbalance

Few-shot learning aims to train models on a limited number of labeled sa...

Transductive Few-Shot Classification on the Oblique Manifold

Few-shot learning (FSL) attempts to learn with limited data. In this wor...

Code Repositories


Source codes for "Improved Few-Shot Visual Classification" (CVPR 2020), "Enhancing Few-Shot Image Classification with Unlabelled Examples" (WACV 2022), "Beyond Simple Meta-Learning: Multi-Purpose Models for Multi-Domain, Active and Continual Few-Shot Learning" (TPAMI 2022 - in submission)

view repo


Source codes for "Improved Few-Shot Visual Classification" (CVPR 2020), "Enhancing Few-Shot Image Classification with Unlabelled Examples" (WACV 2022) and "Beyond Simple Meta-Learning: Multi-Purpose Models for Multi-Domain, Active and Continual Few-Shot Learning" (TPAMI 2022 - in submission)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have facilitated transformative advances in machine learning

[37, 85, 35, 32, 43, 34, 68, 70, 30, 79]. However, much of their success depends on training using large-scale extensively-labelled datasets, and when an exhaustive set of training examples is not available, performance degrades significantly. Few-shot learning [20, 96, 94, 8] seeks to address this issue by developing architectures for learning effectively from few labelled instances. For a novel task, a few-shot visual classifier is derived using a small number of labelled “support” images per category, and is then evaluated on a set of unlabelled “query” images.

(a) Euclidean
(b) Mahalanobis
(c) Transductive
(d) Active Learning
(e) Continual Learning
Fig. 1: Two-dimensional illustration of task-adapted support image features are shown in the top row. When using Euclidean distance (a) as the metric, each cluster is assumed to have identity covariance, leading to the incorrect classification of query examples. Mahalanobis distance (b) resolves this issue. In the transductive setting (c), all query examples are labelled all at once, permitting semi-supervised refinement of clusters. During active learning (d), methods can iteratively request labels from a pool of unlabelled examples. In continual learning (e), models see a new task at each iteration, aiming to perform well without forgetting previous tasks.

Existing few-shot classification approaches can be divided into two categories. First, there are approaches based on nearest-neighbour classification [92], either directly in the feature space [40, 42, 77] or the mapping of the feature vectors onto the natural language space [25]. The second group consists of methods that distill examples into class-wise vector prototypes that are either learned [29, 72] or mathematically derived by mean-pooling the examples [83]. The prototypes are usually defined in the feature or the natural language space (e.g. word2vec [98]). Most research in this area has focused on learning non-linear mappings (typically in the form of deep multi-layer neural networks) from the images to a high-dimensional vector space, namely the embedding space.

A pre-specified metric within the embedding space is then used for final nearest-class classification (such as the cosine similarity between the feature embedding of the query image and the class embedding). Significant work has focused on effective task-conditioned adaptation of these non-linear mappings at different levels of granularity

[22, 72, 67, 59]. Recently, Conditional Neural Adaptive Processes (CNAPS) [72] achieved high few-shot visual classification accuracy through the use of sparse FiLM [62] layers for partial network adaptation to prevent over-fitting problems that arise from adapting the entire embedding neural network using few support examples.

Overall, far less attention has been given to the metric used to compute distances for classification within the embedding space. Snell et al. [83] study the underlying distance function in order to justify the use of sample means as prototypes. They argue that Bregman divergences [3] are the theoretically sound family of metrics to be used in the few-shot setting, but only utilize a single instance within this family — the squared Euclidean distance, which they find to perform better than the more traditional cosine metric. However, the choice of the Euclidean metric involves making two assumptions: 1) that feature dimensions are un-correlated and 2) that they have uniform variance. In addition, the Euclidean distance is insensitive to the distribution of within-class samples with respect to their prototype and recent results [61, 83] suggest that this is problematic. Modelling this distribution (in the case of [3] using extreme value theory) is, as we observe, a key to achieving better accuracy.

Starting from this intuition, we develop our first method, the “Simple CNAPS” architecture that achieves a 6.1% improvement over CNAPS [72]

while removing 788,485 parameters (3.2% of the total) from the original CNAPS architecture, replacing them with fixed, closed-form, deterministic covariance estimation and Mahalanobis distance computations. We find that surprisingly, we are able to generate useful high-dimensional estimates of covariance even in the few-shot classification setting, where the number of available support examples per class is in theory far too small to estimate the required class-specific covariances.

In a standard few-shot learning setting, the classifier is adapted using labelled examples in the support set. Performance can be improved further by exploiting additional unlabelled support data (semi-supervised few-shot learning) [69], or examples in the query set (transductive few-shot learning) [52, 40]. Existing transductive few-shot methods reason about unlabelled examples by performing k-means clustering with Euclidean distance [69] or message passing in graph convolutional networks [52, 40]. These methods, while improving performance, lack the expressibility of the Mahalanobis-distance based classification framework used in Simple CNAPS. Furthermore, they primarily focus on tranductive adaptation within the classification space, disregarding tranductive adaptation of the feature extractor all together.

Fig. 2: Overview of research on few-shot image classification, organized by image feature extractor adaptation scheme (vertical axis) versus final classification methodology (horizontal axis).

Motivated by these observations, we develop a transductive variant of the Simple CNAPS architecture. In this variant, we infer labels for query set examples, which allows us to make use of these examples in class parameter estimates. The resulting architecture, namely “Transductive CNAPS”, extends Simple CNAPS with a transductive two-step task-encoder, and an iterative soft k-means procedure for refining class parameter estimates (mean and covariance). We demonstrate that this approach achieves state of the art performance on Meta-Dataset [90]. In addition, Transductive CNAPS achieves notable performance on mini-ImageNet [83] and tiered-ImageNet [69] benchmarks.

Furthermore, Requeima et al. [72] recently proposed to evaluate pre-trained few-shot classifiers for active learning and continual learning “out of the box” without any additional problem specific fine-tuning. Following their work, we also explore Simple and Transductive CNAPS [5, 4, 6] in the context of “out of the box” active learning and continual learning. In the active learning setting, a meta-trained few-shot classifier is presented with a set of unlabelled examples, from which it can acquire the label for a single example at each iteration. The goal is to select the examples to maximally improve performance on a separate set of test examples. We evaluate both methods without any additional training specific to active learning, and demonstrate that uncertainty-driven selection methodologies outperform random selection in our models, thus demonstrating their abilities in producing effective measures of uncertainty.

In the continual learning problem domain, a few-shot learning method is presented with a new classification task at each iteration. The goal is to continuously learn to classify accurately on these new tasks while maintaining good accuracy on the previous tasks. Continual learning requires this to be achieved without explicitly saving every data point to memory. In our work, we propose and explore three continual learning strategies within the framework of our methods.

Fig. 3:

Overview of neural adaptive feature extraction in Simple, Transductive and ”original” CNAPS. Figure adapted from


Our contributions: (1) We describe “Simple CNAPS”, a neural adaptive few-shot classifier with a regularized Mahalanobis based classifier that achieves strong performance on Meta-Dataset. (2) We expand our work to transductive few-shot learning by developing “Transductive CNAPS”, which extends our first method with a two-step transductive set-encoder and a soft-k means iterative procedure for refinement of class parameters. This method achieves state of the art (SoTA) performance on Meta-Dataset. (3) In addition to grounding our work in probabilistic mixture models, we provide an extensive discussion and interpretation of both architectures as Riemannian metric learners. This analysis is consistent with our empirical results. (4) We evaluate both methods on “out of the box” active learning using three active label acquisition strategies. We also modify Simple and Transductive CNAPS for “out of the box” continual learning, proposing three approaches to continual estimation of the task-encoding that achieve competitive performance. (5) Finally, through extensive experiments / ablations we study importance of our design choices.

2 Related Work

2.1 Few-Shot Learning with Labelled Data

Past research on few-shot classification [96] can be differentiated along two major axes: 1) how images are transformed into vectorized embeddings, and 2) how “distances” are computed between vectors in order to assign labels. This is illustrated in Figure 2.

Siamese networks [42], an early approach to few-shot learning, employed a shared feature extractor to produce vector embeddings for both the support and the query images. Classification was then done by picking the smallest weighted L1 distance between query and labelled image embeddings. Relation networks [87], and recent GCNN variants [40, 77]

, extended this by parameterizing and learning the classification metric using a Multi-Layer Perceptron (MLP). Matching networks

[92] learned distinct feature extractors for support and query images which were then used to compute cosine similarities for classification.

The feature extractors used by these models were, notably, not adapted to test-time classification tasks. It has become established that adaptation of the feature extractor to new tasks at test time is generally a good thing to do. Fine tuning transfer-learned networks

[103] did this by fine-tuning the feature extraction network using the task-specific support images but found limited success due to problems related to overfitting to, the generally very few, support examples. MAML [22] (and its various extensions [56, 59, 67]) mitigated this problem by learning a set of meta-parameters that specifically enabled feature extractors to be adapted to new tasks given few support examples using few gradient descent steps.

The two supervised few-shot learning approaches most directly related to this work are CNAPS [72] (and the related TADAM [61]) and Prototypical Networks [83]. CNAPS is a few-shot adaptive visual classifier based on conditional neural processes (CNPs) [27]. It is a high-performing approach for few-shot image classification [72] that uses a pre-trained feature extractor augmented with FiLM layers [62] that are adapted for each task using the support images specific to that task. CNAPS uses a dot-product distance in a final linear classifier; the parameters of which are also adapted at test-time to each new task (see Section 4.1).

Prototypical Networks [83] employ a non-adaptive feature extractor with a simple mean-pooling operation to form class “prototypes” from support instances in a meta-learned feature space. They employ episodic training to learn said feature extractor, where at each training iteration, a few-shot task is provided with a “support” set for adaptation and a “query” set for evaluation and calculating a classification loss that is then used for gradient descent. Squared Euclidean distances to these prototypes are calculated and subsequently used for nearest-class classification. Their choice of the distance metric was motivated by the theoretical properties of Bregman divergences [3], a family of functions of which the squared Euclidean distance is a member of. These properties allow for a correspondence between the use of the squared Euclidean distance in a Softmax classifier and performing density estimation. Expanding on the work of Snell et al.[83], we also exploit similar properties of the squared Mahalanobis distance as a Bregman divergence [3] to draw theoretical connections to Bregman soft-clustering. We additionally provide an extensive alternative theoretical grounding of both methods in Riemannian metric learning.

Our work differs from CNAPS [72] and Prototypical Networks [83] in the following ways. First, while CNAPS has demonstrated the importance of adapting the feature extractor to a specific task, we show that adapting the classifier is actually unnecessary to obtain good performance. Second, we demonstrate that an improved choice of Bregman divergence can significantly impact accuracy. Specifically, we demonstrate that regularized class-specific covariance estimation from task-specific adapted feature vectors allows for use of the Mahalanobis distance for classification, achieving significant improvements in performance and establishing state of the art in the case of Transductive CNAPS.

Our methods also relate to recent works [23, 100] in modelling per-class clusters with Gaussian means and covariance estimates. Fort et al. [23] extended Prototypical Networks to use the Mahalanobis distance by incorporating class-wise diagonal covariance estimates generated by a learned network. Unlike their work, our methods produce per-class closed-form full covariance estimates within a conditional neural adapted feature space, all trained end-to-end, making them especially effective empirically. Yang et al. [100]

produce class-wise Gaussian distributions within a fixed pre-trained feature space. These distributions are first fine-tuned using mixtures of Gaussian statistics from visually similar classes. They are then used to sample new support examples within the feature space, resulting in an augmented support set that is then used to train a logistic regression classifier for the task at hand. Although we similarly regularize the class-wise covariance estimates using the task-level covariance, our methods do not retain any Gaussian statistics about previous tasks/classes, relying solely on the support examples provided. Furthermore, we adapt the underlying feature space through an end-to-end learned adaptation procedure, resulting in empirically useful higher-dimensional mean and covariance estimates from the very few available support examples. Lastly, we use the Mahalanobis distance within the distribution space itself to perform classification, incorporating inter/intra-class variances within the decision boundaries.

2.2 Few-Shot Learning with Unlabelled Data

The use of unlabelled instances for transductive few-shot learning has also been the subject of study by several methods [40, 52, 69]. EGNN [40] employs a graph convolutional edge-labelling network for iterative propagation of labels from support to query instances. Similarly, TPN [52] learns a graph construction module for neural propagation of soft labels between elements of the query set. These methods rely on neural parameterizations of distance in the feature space. TEAM [64] uses an episode-wise transductively adaptable metric for performing inference on query examples using a task-specific metric. Song et al. [84] use a cross attention network with a transductive iterative approach for augmenting the support set using the query examples.

Fig. 4: Transductive CNAPS extends the Mahalanobis-distance based classifier in Simple CNAPS through transductive soft k-means clustering of the visual space.

The closest approach to our transductive work, namely Transductive CNAPS, is that of Ren et al. [69]. Their method extends Prototypical Networks [83]

through incorporating a single further soft-labelled weighted estimation of class prototypes. Transductive CNAPS, on the other hand, differs in three major ways. First, we produce soft-label estimates of both the mean and covariance. Second, we use an expectation-maximization (EM) inspired algorithm that performs a dynamic number of soft-label updates, depending on the task at hand. Lastly, we employ a neural-adaptive procedure for feature extraction that is conditioned on a two-step learned transductive task representation, as opposed to a fixed feature-extractor. This novel task-representation encoder is responsible for significant performance gains on out-of-domain tasks (Sec.


2.3 Active Learning

Active learning is a major research paradigm in machine learning that focuses on requesting labels for unlabelled examples such that performance gain is maximized. Specifically, active learning aims to make data-labelling part of the learning process itself, such that samples are chosen for labelling by the model.

Existing methods focus on 3 major label selection strategies. First, there are uncertainty-based methods [9, 39, 49, 66, 81, 89]

where sample selection is performed using class-wise probabilities as measure for model uncertainty on said samples. Second are diversity-based methods

[10, 26, 31, 58] where selection is performed such that diversity among categories and their labelled examples is maximized. Third, approaches have also been proposed [24, 73, 80] to use expected model change as the criterion for sample selection, where unlabelled examples are ranked on the basis of expected parametric change to the model. Dataset-independent sample selection based on uncertainty, where the uncertainty of any unlabelled example is not dependent on the rest of the set, can lead to sampling bias. However, focusing on strategies that promote diversity may result in limited performance gain relative to the number of labels obtained. Motivated by these observations, hybrid instance selection strategies [2, 82, 102] have been proposed that use mixtures of uncertainty-based, diversity-based and expected-update criteria for active learning, leveraging each of their strengths while minimizing their weaknesses.

Although active learning has been the subject of much study, “out of the box” use of few-shot classifiers for active learning has only been explored recently [72, 63]. Requeima et al. demonstrate that CNAPS [72], when used for “out of the box” active learning, is able to outperform Prototypical Networks [83]. The margins of improvement is even greater when uncertainty-based approaches for label acquisition is employed as oppose to random selection. Pezeshkpour et al. [63] study the effects of various active few-shot learning strategies within the context of Simple CNAPS and Prototypical Networks, ultimately concluding that better strategies need to be developed for effective active few-shot learning.

2.4 Continual Learning

Continual learning focuses on developing methods that can learn to perform new tasks without “forgetting” previous ones. In particular, it studies the problem of learning from an infinite stream of tasks, where the goal is to gradually acquire knowledge, using it for future learning without loss of existing knowledge. Most previous work propose different criterions and strategies for gradually updating sections of the learned network as new tasks, and by consequence, labelled examples are seen. Zenke et al. [104]

propose Synaptic Intelligence (SI), a network of synapses with complex three-dimensional state-spaces, tracking the past and the current parameter values. Online estimates of each synapse’s “importance” is employed to consolidate important synapses by preventing them from future learning, while allocating unimportant synapses to learn from future tasks. EWC

[41] slows down learning on select parameters based on how important they are to previously seen tasks, thus preserving the learned knowledge. VCL [57] fuses online variational inference (VI) with Monte Carlo VI to train deep discriminative and generative models in complex continual learning settings. Chaudhry et al. propose RWalk [13], a generalized and more efficient EWC with a theoretically grounded KL-divergence based perspective that achieves superior accuracy on a range of continual learning image classification benchmarks.

CNAPS [72] presents the first instance of “out of the box” use of few-shot image classifiers for continual learning. Here, the few-shot classifier is first trained on Meta-Dataset [90] and then used for continual learning without any additional training. They employ context-weighted estimates of class mean for computing class means from old/new tasks. This is performed with current-task only adaptation of the feature space, thus resulting in feature manifolds that differ from each other. We refer to this strategy as the “Moving Encoding”, and explain it in detail when discussing continual learning in Simple and Transductive CNAPS.

3 Problem Definition

Fig. 5: Overview of the transductive task-encoder, , used for task-adaptation in Transductive CNAPS.

Following previous work [83, 5, 72, 22], we focus on a few-shot learning setting where a distribution over image classification tasks is available for episodic training. Each task consists of a support set of labelled images and a query set of unlabelled images; the objective is to predict labels for these query examples, given the (typically small) support set. Each query image has a corresponding ground truth label available at training time. A model will be trained by minimizing, over some parameters (which are shared across tasks), the expected query set classification loss over tasks. We minimize for Simple CNAPS and to train Transductive CNAPS. Note that the inclusion of the dependence on all of in the transductive case allows for joint prediction of labels for the query set, all at once. At test time, a separate distribution of tasks generated from previously unseen images and classes is used to evaluate performance. Let us also define shot as the number of support examples per class, and way as the number of classes within the task.

4 Methods

4.1 Cnaps

Conditional Neural Adaptive Processes (CNAPS) [72] consist of two modules: a feature extractor and a linear classifier, which are each task-dependent. Adaptation is performed by meta-trained network adaptation modules that condition on the support set.

The method uses a ResNet18 [34] architecture (Figure 3) as the feature extractor. This ResNet18 is trained separately prior to episodic training of the adaptation networks. Within each residual block, Feature-wise Linear Modulation (FiLM) layers are inserted to compute a scale factor and shift for each output channel, using block-specific adaptation networks that are conditioned on a task encoding. The task encoding consists of the mean-pooled feature vectors of support examples produced by

, a separate but end-to-end learned Convolution Neural Network (CNN). This produces a task-adapted feature extractor

(which implicitly depends on the support set ) that maps support/query images onto the corresponding adapted feature space. We will use to denote versions of the support/query sets where each image is mapped into its 512-dimensional feature vector representation .

Classification in CNAPS is performed by a task-adapted linear classifier where the class probabilities for a query image are computed as . The classification weights and biases are produced by the classifier adaptation network forming where for each class in the task, the corresponding row of classification weights is produced by from the class mean . The class mean is obtained by mean-pooling the feature vectors of the support examples for class extracted by the adapted feature extractor .

4.2 Simple CNAPS

Fig. 6: Transductive CNAPS’ soft k-means Mahalanobis-distance based clustering procedure. First, cluster parameters are initialized using the support examples. Then, during cluster update iterations, the query examples are assigned class probabilities as soft labels and subsequently, both soft-labelled query examples and labelled support examples are used to estimate new cluster parameters.

In Simple CNAPS, we employ the same ResNet18 for feature extraction with the same adaptation module . However, because of the novel classification architecture we use, it becomes trained to do something different than it does in CNAPS. This choice, like for CNAPS, allows for a task-dependent adaptation of the feature extractor. Unlike CNAPS, we undertake the classification step in Simple CNAPS by computing softmax of a Mahalanobis distance111Note that we explicitly do not use the coefficient as we experimentally verified doing so to decrease performance.,


of feature vector relative to each class by estimating a mean and regularized covariance in the adapted feature space, using the support instances,


Here is the indicator function and is the number of examples with class in the support set . The ratio balances a task-conditional sample covariance and a class-conditional sample covariance ,


where is the task-level mean. When few support examples are available for a particular class, is small, and the estimate is regularized towards the task-level covariance . As the number of support examples for the class increases, the estimate tends towards the class-conditional covariance . Additionally, a regularizer (we set in our experiments) is added to ensure invertibility.

4.3 Transductive CNAPS

Transductive CNAPS extends Simple CNAPS by making use of query set, both for feature adaptation and classification. First, the task encoder is extended to incorporate both a support-set embedding and a query-set embedding such that,


where is a learned CNN. The support embedding is formed by an average of the (encoded) support examples, with weighting inversely proportional to their class counts to prevent bias from the class imbalance. The query embedding uses simple mean-pooling; both embeddings and are invariant to permutations of the corresponding support/query instances. We then process and

through two steps of a Long Short Term Memory (LSTM) network in the same order to generate a final transductive task representation

to be used for adaptation. This process is visualized in Figure 5.

Second, we can interpret Simple CNAPS as a form of “supervised clustering” in feature space; each cluster (corresponding to a class ) is parameterized with a centroid and a metric , and we interpret (1) as class assignment probabilities based on the distance to each centroid. With this viewpoint in mind, a natural extension to consider is to use the estimates of the class assignment probabilities on unlabelled instances to refine the class parameters and in a soft -means framework based on per-cluster Mahalanobis distances [55]. In this framework, as shown in Figure 6, we alternate between computing updated assignment probabilities using (1) on the query set and using those assignment probabilities to compute updated class parameters.

We define as the disjoint union of the support set and the query set. For each element of , which we index by , we define responsibilities in terms of their class predictions when it is part of the query set and in terms of the label when it is part of the support set,


Using these responsibilities we can incorporate unlabelled samples from the support set by defining weighted estimates and :


where defines . The covariance estimates and are


where is the task-level mean.

1:procedure compute_query_labels()
2:     For ranging over support and query sets,
3:     for iter =  do The first iteration is equivalent to Simple CNAPS;
4:         Compute class parameters according to update equations (9)-(11)
5:         Compute class weights using class parameters according to (8)
6:         break if the most probable class for each query example hasn’t changed
7:     end for
8:     return class probabilities for corresponding to
9:end procedure
Algorithm 1 Iterative Refinement in Transductive-CNAPS

These update equations are weighted versions of the original Simple CNAPS estimators from Section 4.2, and reduce to them exactly in the case of an empty query set.

Algorithm 1 summarizes the soft k-means procedure based on these updates. We initialize class weights using only the labelled support data. We use those weights to compute class parameters, then compute updated weights using both the support and query sets. At this point, the weights associated with the query set are the same class probabilities as estimated by Simple CNAPS. However, in Transductive CNAPS, we repeat this procedure iteratively until we reach either reach a maximum number of iterations, or until class assignments stop changing.

Unlike the transductive task-encoder, this second extension in Transductive CNAPS, namely the soft k-mean iterative estimation of class means and covariance estimates, is used at test time only. During training, a single estimation is produced for both mean and covariance using only the support examples. This, as we discuss more in Section VI, was shown to empirically perform better. See Figure 4 for a high-level comparison of classification in Simple CNAPS vs. Transductive CNAPS.

4.4 Active Learning

Following past work [71], our main objective is to evaluate if Simple and Transductive CNAPS perform better with uncertainty-based active label aquisition methods as oppose to random selection when used in “out of the box” active learning.

Here, both models are first trained on a large few-shot learning benchmark, namely Meta-Dataset [90]. Then, without any further training, they are evaluated for active learning, where given a set of unlabelled images, a label can be acquired at each iteration. We consider three standard label acquisitions methods: random selection, and two uncertainty-based approaches, specifically predictive entropy [17] and variation ratios [17].

For uncertainty-based label acquisition, unlabelled examples are first processed by the adapted network where class probabilities are produced. These class probabilities are then used to rank examples. In predictive entropy, the entropy of class probabilities are calculated and examples with maximal predictive entropy are selected for label acquisition. In variation ratios, is calculated as defined by the maximum probability assigned to any of the classes. Examples are then ranked based on this score, where the lowest scoring examples are selected for label acquisition.

4.5 Continual Learning

To perform “out of the box” continual learning, we modify Simple and Transductive CNAPS in two major ways. After both methods have been episodically meta-trained on a large few-shot learning benchmark (e.g. Meta-Dataset [90]), each architecture is deployed in a continuous learning problem setting where at each time , a new task consisting of is presented for undertaking.

Fig. 7: “Out of the box” active learning performance of Simple CNAPS and Transductive CNAPS on CIFAR 10 and select OMNIGLOT languages. For each method, we use 3 label acquisition methods. Results on other OMNIGLOT languages can be found in the Appendix.

Each task at time can consist of both previously seen classes and entirely novel categories. For novel classes, we produce and save class-wise estimates of mean and covariance using equations 1 and 3 for Simple CNAPS and their corresponding transductive extensions for Transductive CNAPS. For previously seen classes, weighted estimates are produced for both means and covariances based on the number of class examples present in the task and the class instances previously seen,


where indicates the number of support examples for class within the new task seen at time and is the number of support instances seen for class prior to the task. and specify saved mean and covariance estimates prior to the current task whereas and are task-parameters estimated using the task at time .

The second modification to both methods for continual learning involves the continuous adaptation of the feature extractor . This is primarily accomplished through developing strategies for continual updating of the task-encoding. Let us denote this task-encoding at time as . We consider three update strategies.

First, we use the “Moving Encoding”, where for Simple CNAPS and for Transductive CNAPS. That is, the encoding is set to the task-encoding of the most recent task. This naturally results in different feature space manifolds at different time steps while saved class representations are based on manifolds at the time of their respective tasks.

Second, we focus on the case of fixing the feature space to the task-encoding of the first task, . We refer to this variation at “First Encoding”, where for Simple CNAPS and for Transductive CNAPS. This assures that all class-wise and query embeddings are generated within the same feature manifold. However, that manifold has been task-adapted for maximal performance on only the first task observed during continual learning.

Third, we consider the “Averaging Encoding” strategy, with


where a -scaled weighted convex combination of the current task-encoding / and the task-encoding up to the previous time step . This update strategy strikes a hybrid balance between maintaining feature space manifolds that don’t vary considerably between time-steps, while partially adapting the feature space to new tasks as they come.

5 Theoretical Motivation

(a) Multi-H Simple CNAPS + Moving Encoding
(b) Multi-H Simple CNAPS + First Encoding
(c) Multi-H Simple CNAPS + Averaging Encoding
(d) Single-H Simple CNAPS + Moving Encoding
(e) Single-H Simple CNAPS + First Encoding
(f) Single-H Simple CNAPS + Averaging Encoding
Fig. 8:

Task-wise classification accuracy of “out of the box” continual learning on MNIST. Here, consecutive pairs of digits are grouped together one task and iteratively shown to the models. X-axis specifies the iteration in the continual learning process while the coloured bars signify classification accuracy on the query test set of each task.

5.1 Relationship to Bregman Soft Clustering

The procedure in Algorithm 1 resembles the Bregman clustering algorithms of Banerjee et al. [3]. Specifically, the updates to soft assignments in Equation 1 are the semi-supervised equivalent of those in Bregman soft clustering, in which the divergence is based on the Mahalanobis distance ,


However, Algorithm 1 differs in that it updates both and at each iteration, rather than just .

In general, any (regular) exponential family can be associated with a Bregman divergence and vice versa, which gives rise to a correspondence between EM-based clustering and Bregman soft clustering algorithms [3]. Standard Bregman soft clustering corresponds to EM in which the likelihood is a Gaussian with unknown mean and a known covariance Q

that is shared across clusters. The case where the covariance is unknown corresponds to Gaussian mixture models (GMMs), but the function

is not simply the Mahalanobis distance in this case.

The updates for and in Algorithm 1 are equivalent to those in a GMM that incorporates regularization for the covariances. However, GMM clustering differs in the calculation of the assignment probabilities


These probabilities incorporate a term

, which defines a prior probability of assignments to a cluster, and a term

, which reflects the fact that GMMs employ a likelihood with unknown covariance.

In short, our clustering procedure employs an update to soft assignments that is similar to that of soft Bregman clustering, but employs an updates to and that are similar to those in a (regularized) Gaussian Mixture Model (GMM). In Section VI we demonstrate through ablations that this combination of updates improves empirical performance relative to baselines that perform GMM-based clustering.

5.2 Connections with Riemannian Metric Learning

By dropping the log-determinant of the class covariances , we lose the ability to interpret the model as a Gaussian mixture model, since each mixture component is no longer a normalized conditional density . We show that by postulating that the feature space is best described as a Riemannian manifold, our relative class scores (1) approximate the squared geodesic distance between a test point and the centroid .

(a) Transductive
(b) Non-Transductive
Fig. 9: Class recall averaged between classes across Meta-Dataset.

The geometry of a Riemannian manifold is defined by a local metric tensor ( positive-definite matrix defined for each point in ), , in a data or feature space

(more general topologies are not considered in this discussion). The metric tensor defines the geometry of the underlying space; particularly, we can employ it to define a notion of length. The distance along a path

is computed in terms of this metric tensor via the arclength functional:


From this, we can derive a global distance (the geodesic distance) between points (at least in the case) as the length of the shortest path between and :


The arclength functional is difficult to analyze, but we can instead analyze the related energy functional [12],


Both and yield the same local minimizers; these are called geodesics, and are equivalent to straight lines within the geometry defined by .

In metric learning, our goal is to estimate the metric tensor from data. This is an underdetermined task, since its only constraints are smoothness and positive definiteness. To reduce the space of metric tensors under consideration, we treat the class centroids as local inducing points for a metric that is locally (near ) the Mahalanobis distance defined by

, and model the global metric tensor as some smooth interpolation of these local Mahalanobis metrics:


Here, is a smooth partition of unity, which satisfies


The existence of such functions is guaranteed [47].

Even with these simplifying assumptions, the global geodesic distance is extremely challenging to compute. Since the geodesic distance is the minimum path length over all paths, we can upper bound it by a specific path [65]:


where is a straight line (in the coordinate space) interpolation. The time derivative of is easily computed to be . The corresponding energy functional upper bounds the squared distance,


For purposes of classification, computing the exact distance to each centroid is not necessary; we only need to reason about the difference in distance between a test point and two class centroids , which corresponds to their relative (log) probabilities. Substituting the energy along the straight-line paths yields


where and and denote,


We can write in terms of a path parameter as

As increases, we simultaneously grow the path inward from each class centroid towards and outward from to the centroids.

A first-order Taylor expansion of this around causes the first integral to drop out (since ), and the second integral yields (when evaluated at ):

The higher order terms can be controlled to some extent by forcing the partition functions to be flat near the class centroids.

This allows us to think of the Simple CNAPS classifier logit function (the squared Mahalanobis distance to the class centroids with per-class metrics) as a coarse approximation of the (squared) geodesic distance between a query test point and a class centroid; this coarse approximation is improved slightly by noting that other low-order terms that could be considered drop out when examining the difference in squared geodesic distance between a test point and two class centroids.

There are a few implications of this viewpoint, which point to directions for future work. First, the performance gains that we see from using per-class Mahalanobis metrics suggests that modeling the local geometry of the adapted features is important. By taking the manifold viewpoint more seriously, we could consider more principled geometric algorithms such as manifold-regularized SVMs [7] to further exploit the local geometry of the adapted feature space. Second, while the inverse weighted sample covariance works in practice as a local metric, and makes intuitive sense, we could consider techniques specifically designed for estimating local Riemannian metrics [33] to provide better (or at least more principled) estimates of our local metrics .

6 Experiments

6.1 Few-Shot Learning Benchmarks

Meta-Dataset [90] is a few-shot visual classification benchmark consisting of 10 widely used datasets: ILSVRC-2012 (ImageNet) [74], Omniglot [45]

, FGVC-Aircraft (Aircraft)


, CUB-200-2011 (Birds)

[93], Describable Textures (DTD) [16], QuickDraw [38], FGVCx Fungi (Fungi) [78], VGG Flower (Flower) [60], Traffic Signs (Signs) [36] and MSCOCO [50]. Consistent with past work [72, 5], we train our model on the official training splits of the first 8 datasets and use the test splits to evaluate in-domain performance. We use the remaining two datasets as well as three external benchmarks, namely MNIST [46], CIFAR10 [44] and CIFAR100 [44], for out-of-domain evaluation.

Task generation in Meta-Dataset follows a complex procedure. The task is defined as an instance of classification problem that can be of different ways (number of classes) and individual classes can be of varying shots (number of samples per class) even within the same task. For example, an instance of MNIST classification task, may require classification of digits ‘1’, ‘2’ and ‘4’, each of which contains 7, 2 and 9 training support instances respectively. This would constitute a 3-way task with 7-, 2- and 9-shots. In task formation, the task way is first sampled uniformly between 5 and 50 and way classes are selected at random from the corresponding class/dataset split. Then, for each class, 10 instances are sampled at random and used as query examples for the class, while of the remaining images for the class, a shot is sampled uniformly from [1, 100] and shot number of images are selected at random as support examples with total support set size of 500.

Additional dataset-specific constraints are enforced, as discussed in Section 3.2 of [90], and since some datasets have fewer than 50 classes and fewer than 100 images per class, the overall way and shot

distributions resemble Poisson distributions where most tasks have fewer than 10 classes and most classes have fewer than 10 support examples (see Appendix-

A.1). Following [5] and [72], we first train our ResNet18 feature extractor on the Meta-Dataset defined training split of ImageNet following the procedure in Appendix-A.3. The ResNet18 parameters are then kept fixed while we train the adaptation network for a total of sampled 110K tasks using Episodic Training [83, 22] (see Appendix-A.3).

Mini/tiered-ImageNet [92, 69] are two benchmarks for few-shot learning. Both datasets employ subsets of ImageNet [74] with a total of 100 classes and 60K images in mini-ImageNet and 608 classes and 779K images in tiered-ImageNet. Unlike Meta-Dataset, tasks across these datasets have pre-defined shots and ways that are uniform across every task in the specified setting.

Following [59, 52, 83], we report performance on the 1/5-shot 5/10-way settings across both datasets with 10 query examples per class. We first train the ResNet18 on the training set of the corresponding benchmark at hand following the procedure noted in Appendix-A.4. We also consider a more feature-rich ResNet18 trained on the larger ImageNet dataset. However, we exclude classes and examples from test sets of mini/tiered-ImageNet to address potential class/example overlap issues, resulting in 825 classes and 1,055,494 images remaining. Then, with the ResNet18 parameters fixed, we train episodically for 20K tasks (see Appendix-A.2 for details).

6.2 Results

In-Domain Accuracy (%) Out-of-Domain Accuracy (%) Avg Rank














In Out All
RelationNet [86] 30.90.9 86.60.8 69.70.8 54.11.0 56.60.7 61.81.0 32.61.1 76.10.8 37.50.9 27.40.9 - - - 10.5 11.0 10.6
MatchingNet [91] 36.11.0 78.31.0 69.21.0 56.41.0 61.80.7 60.81.0 33.71.0 81.90.7 55.61.1 28.81.0 - - - 10.1 8.5 9.8
MAML [21] 37.81.0 83.91.0 76.40.7 62.41.1 64.10.8 59.71.1 33.51.1 79.90.8 42.91.3 29.41.1 - - - 9.2 10.5 9.5
ProtoNet [83] 44.51.1 79.61.1 71.10.9 67.01.0 65.20.8 64.90.9 40.31.1 86.90.7 46.51.0 39.91.1 - - - 8.2 9.5 8.5
ProtoMAML [90] 46.51.1 82.71.0 75.20.8 69.91.0 68.30.8 66.80.9 42.01.2 88.70.7 52.41.1 41.71.1 - - - 7.1 8.0 7.3
CNAPS [71] 52.31.0 88.40.7 80.50.6 72.20.9 58.30.7 72.50.8 47.41.0 86.00.5 60.20.9 42.61.1 92.70.4 61.50.7 50.11.0 6.6 6.0 6.4
BOHB-E [76] 55.41.1 77.51.1 60.90.9 73.60.8 72.80.7 61.20.9 44.51.1 90.60.6 57.51.0 51.91.0 - - - 6.4 4.0 5.9
TaskNorm [11] 50.61.1 90.70.6 83.80.6 74.60.8 62.10.7 74.80.7 48.71.0 89.60.5 67.00.7 43.41.0 92.30.4 69.30.8 54.61.1 4.7 4.8 4.8
SUR [19] 56.31.1 93.10.5 85.40.7 71.41.0 71.50.8 81.30.6 63.11.0 82.80.7 70.40.8 52.41.1 94.30.4 66.80.9 56.61.0 3.1 2.6 2.9
URT [51] 55.71.0 94.40.4 85.80.6 76.30.8 71.80.7 82.50.6 63.51.0 88.20.6 69.40.8 52.21.1 94.80.4 67.30.8 56.91.0 1.7 2.8 2.2
Simple CNAPS 58.61.1 91.70.6 82.40.7 74.90.8 67.80.8 77.70.7 46.91.0 90.70.5 73.50.7 46.21.1 93.90.4 74.30.7 60.51.0 3.4 3.0 3.2
Transductive CNAPS 58.81.1 93.90.4 84.10.6 76.80.8 69.00.8 78.60.7 48.81.1 91.60.4 76.10.7 48.71.0 95.70.3 75.70.7 62.91.0 2.1 1.6 1.9

Few-shot classification on Meta-Dataset, MNIST, and CIFAR10/100. Error intervals correspond to 95% confidence intervals, and bold values indicate statistically significant SoTA performance. Average rank is obtained by ranking methods on each dataset and averaging the ranks.

mini-ImageNet Acc (%) tiered-ImageNet Acc (%)
5-way 10-way 5-way 10-way
Model T? 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
MAML [22] BN 48.7 63.1 31.3 46.9 51.7 70.3 34.4 53.3
MAML+ [52] Yes 50.8 66.2 31.8 48.2 53.2 70.8 34.8 54.7
Reptile [59] No 47.1 62.7 31.1 44.7 49.0 66.5 33.7 48.0
Reptile+BN [59] BN 49.9 66.0 32.0 47.6 52.4 71.0 35.3 52.0
ProtoNet [83] No 46.1 65.8 32.9 49.3 48.6 69.6 37.3 57.8
RelationNet [87] BN 51.4 67.0 34.9 47.9 54.5 71.3 36.3 58.0
TPN [52] Yes 55.5 69.8 38.4 52.8 59.9 73.3 44.8 59.4
AttWeightGen [28] No 56.2 73.0 - - - - - -
TADAM [61] No 58.5 76.7 - - - - - -
Simple CNAPS No 53.2 70.8 37.1 56.7 63.0 80.0 48.1 70.2
Transductive CNAPS Yes 55.6 73.1 42.8 59.6 65.9 81.8 54.6 72.5
LEO [75] No 61.8 77.6 - - 66.3 81.4 - -
MetaOptNet [48] No 62.6 78.6 - - 66.0 81.6 - -
MetaBaseline [15] No 63.2 79.3 - - 68.6 83.7 - -
FEAT [101] No 66.8 82.0 - - 70.8 84.8 - -
SimpleShot [95] No 62.8 80.0 - - 71.3 86.6 - -
RFS [88] No 64.8 82.1 - - 71.5 86.0 - -
FRN [97] No 66.5 82.8 - - 71.2 86.0 - -
DeepEMD [105] No 68.8 84.1 - - 74.3 87.0 - -
S2M2 [54] No 64.9 83.2 - - 73.7 88.6 - -
LR+DC [99] No 68.6 82.9 - - 78.2 89.9 - -
TABLE II: Few-shot visual classification results on 1/5-shot 5/10-way tasks on mini/tiered-ImageNet. See Table VII for error intervals.
mini-ImageNet Acc (%) tiered-ImageNet Acc (%)
5-way 10-way 5-way 10-way
Model T? 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
RS-FSL [1] No 65.3 - - - - - - -
AmdimNet [14] No 76.8 91.0 - - - - - -
Simple CNAPS + FETI No 77.4 90.3 63.5 83.1 71.4 86.0 57.1 78.5
Transductive CNAPS + FETI Yes 79.9 91.5 68.5 85.9 73.8 87.7 65.1 80.6
TABLE III: Few-shot visual classification results on 1/5-shot 5/10-way tasks on mini/tiered-ImageNet with additional training data. For our models, “FETI” indicates that the feature extractor used was trained on ImageNet [74] excluding classes in the test splits of mini/tiered-ImageNet. See Table VIII for error intervals.

Evaluation on Meta-Dataset: In-domain, out-of-domain and overall rankings on Meta-Dataset are shown in Table I. Following [72], we pretrain the feature extractor on the Meta-Dataset specified training split of the ImageNet subset of Meta-Dataset (See Requeima et al.[72]-C.1.1 for ResNet18 training details). Note that this excludes any examples or classes present in the test/validation sets of the Meta-Dataset’s ImageNet split. As shown in Table I, Transductive CNAPS is able to establish state-of-the-art performance with an overall rank of 1.9 while Simple CNAPS ranks fourth with an average rank of 3.2.

Evaluation on mini/tiered-ImageNet: We consider two feature extractor training settings on these benchmarks. First, we use the feature extractor trained on the corresponding training split of the mini/tiered-ImageNet. As shown in Table II, on tiered-ImageNet, Transductive CNAPS achieves best accuracy on both 10-way settings with Simple CNAPS ranking second-best, as compared to other previous work that report results on these settings. On 5-way tiered-ImageNet settings, however, despite outperforming a number of major baselines, both methods trail behind a number of more recent works [48, 15, 101, 95, 88, 97, 105, 54, 99] in classification accuracy. It’s also interesting to note that performance of Simple and Transductive CNAPS generally ranks higher on the tiered-ImageNet benchmark as compared to mini-ImageNet. We attribute this difference in performance between mini-ImageNet and tiered-ImageNet to the fact that mini-ImageNet only provides 38,400 training examples, compared to 448,695 examples provided by tiered-ImageNet. This results in a lower performing ResNet-18 feature extractor (which is trained in a traditional supervised manner).

This hypothesis leads us to consider a second evaluation (denoted by “FETI”, for “Feature Extractor Trained with ImageNet”, in Table III). In this model, we train the feature extractor with a much larger subset of ImageNet, which has been carefully selected to prevent any possible overlap (in examples or classes) with the test sets of mini/tiered-ImageNet. Both Simple and Transductive CNAPS are able to take advantage of the more example-rich feature extractor, resulting in substantially better performance across the board, even as compared to other baselines that employ additional data (such as RS-FSL which pre-trains on the full ImageNet). Furthermore, Transductive CNAPS outperforms Simple CNAPS by a large margin, even when using the same example-rich feature extractor; this demonstrates that leveraging additional query set information yields gains.

Performance vs. Class Shot: In Figure 9, we examine the relationship between class recall (i.e. accuracy among query examples belonging to the class itself) and the number of support examples in the class (shot). As shown, Simple CNAPS outperforms CNAPS with even as few as four support examples, indicating its ability to produce useful estimates of covariance from few examples. In addition, Transductive CNAPS is very effective when class shot is below 10, showing large average recall improvements, especially at the 1-shot level. However, as the class shot increases beyond 10, performance drops compared to Simple CNAPS. This suggests that soft k-means learning of cluster parameters can be effective when very few support examples are available. Conversely, in high-shot classes, transductive updates can act as distractors.

In-Domain Accuracy (%) Out-of-Domain Accuracy (%) Avg Acc.
Simple CNAPS + Metric














In Out All
Negative Dot Product 48.01.1 83.50.9 73.70.8 69.01.0 66.30.6 66.50.9 39.71.1 88.60.5 53.90.9 32.51.0 86.40.6 57.90.8 38.80.9 66.9 53.9 61.9
Cosine Similarity 51.31.1 89.40.7 80.50.8 70.91.0 69.70.7 72.60.9 41.91.0 89.30.6 65.40.8 41.01.0 92.80.4 69.50.8 53.61.0 70.7 64.5 68.3
Absolute Distance () 53.61.1 90.60.6 81.00.7 73.20.9 61.10.7 74.10.8 47.01.0 87.30.6 66.40.8 44.71.0 88.00.5 70.00.8 57.91.0 71.0 65.4 68.8
Squared Euclidean () 53.91.1 90.90.6 81.80.7 73.10.9 64.40.7 74.90.8 45.81.0 88.80.5 68.50.7 43.41.0 91.60.5 70.50.7 57.31.0 71.7 66.3 69.6
Squared Mahalanobis 58.61.1 91.70.6 82.40.7 74.90.8 67.80.8 77.70.7 46.91.0 90.70.5 73.50.7 46.21.1 93.90.4 74.30.7 60.51.0 73.8 69.7 72.2
TABLE IV: Performance of various metric ablations on Meta-Dataset. Error intervals indicate 95% confidence intervals, and bold values indicate statistically significant state of the art performance.
In-Domain Accuracy (%) Out-of-Domain Accuracy (%) Avg Acc.














In Out All
GMM 45.31.0 88.00.9 80.80.8 71.40.8 61.10.7 70.70.8 42.91.0 88.10.6 68.90.7 37.20.9 91.40.5 64.50.7 46.60.9 68.5 61.7 65.9
GMM-EM 52.31.0 92.00.5 84.30.6 75.20.8 64.30.7 72.60.8 44.61.0 90.80.5 71.40.7 44.70.9 93.00.4 71.10.7 56.40.9 72.0 67.3 70.2
Transductive+ 53.31.1 92.30.5 81.20.7 75.00.8 72.00.7 74.80.8 45.11.0 92.10.4 71.00.8 44.01.1 95.90.3 71.10.7 57.31.1 73.2 67.9 71.2
FEOT 57.31.1 90.50.7 82.90.7 74.80.8 67.30.8 76.30.8 47.71.0 90.50.5 75.80.7 47.11.1 94.90.4 74.30.8 61.21.0 73.4 70.7 72.4
COT 58.81.1 95.20.3 84.00.6 76.40.7 68.50.8 77.80.7 49.71.0 92.70.4 70.80.7 47.31.0 94.20.4 75.20.7 61.21.0 75.4 69.7 73.2
Simple 58.61.1 91.70.6 82.40.7 74.90.8 67.80.8 77.70.7 46.91.0 90.70.5 73.50.7 46.21.1 93.90.4 74.30.7 60.51.0 73.8 69.7 72.2
Transductive 58.81.1 93.90.4 84.10.6 76.80.8 69.00.8 78.60.7 48.81.1 91.60.4 76.10.7 48.71.0 95.70.3 75.70.7 62.91.0 75.2 71.8 73.9
TABLE V: Performance of various ablations of Transductive and Simple CNAPS on Meta-Dataset. Error intervals indicate 95% confidence intervals, and bold values indicate statistically significant state of the art performance.
Method Multi Single Multi Single Multi Single
SI [104] 99.3 57.6 - - 73.2 22.8
EWC [41] 99.3 55.8 - - 72.8 23.1
VCL [57] 98.50.4 - - - - -
RWalk [13] 99.3 82.5 - - 74.2 34.0
CNAPS [71] + Moving Encoding 98.90.2 80.90.9 - - 76.00.5 37.20.6
Simple CNAPS + Moving Encoding 83.83.2 19.20.2 83.03.0 18.11.2 67.70.8 12.12.0
Simple CNAPS + First Encoding 96.70.3 69.51.0 87.40.8 41.91.0 69.00.7 34.20.8
Simple CNAPS + Averaging Encoding 95.20.9 22.12.2 86.01.8 27.74.3 68.60.7 25.92.8
Transductive CNAPS + First Encoding 90.71.4 9.30.3 86.00.5 10.00.3 66.70.6 1.00.1
Transductive CNAPS + Averaging Encoding 88.70.5 9.20.3 85.70.7 9.90.3 66.80.9 1.00.1
TABLE VI: “Out of the box” continual learning performance on MNIST, CIFAR10 and CIFAR100. Tasks here are generated with 100-shots per category.

Classification-Time Soft K-means Clustering: We use soft k-means iterative updates of means and covariance at test-time only. It is natural to consider training the feature adaptation network end-to-end through the soft k-means transduction procedure. We provide this comparison in the bottom-half of Table V, with “Transductive+” denoting this variation. Iterative updates during training result in an average accuracy decrease of 2.5%, which we conjecture to be due to training instabilities caused by applying this iterative algorithm early in training on noisy features.

Transductive Feature Extraction vs. Classification: Our approach extends Simple CNAPS in two ways: improved adaptation of the feature extractor using a transductive task-encoding, and the soft k-means iterative estimation of class means and covariances. We perform two ablations, “Feature Extraction Only Transductive” (FEOT) and “Classification Only Transductive” (COT), to independently assess the impact of these extensions. The results are presented in Table V. As shown, both extensions outperform Simple CNAPS. The transductive task-encoding is especially effective on out-of-domain tasks whereas the soft k-mean learning of class parameters boosts accuracy on in-domain tasks. Transductive CNAPS is able to leverage the best of both worlds, allowing it to achieve statistically significant gains over Simple CNAPS overall.

Comparison to Gaussian Mixture Models: We consider two GMM-based ablations of our method where the log-determinant is introduced into the weight updates (using a uniform class prior ). Note that with the exception of how these ablations produce class probabilities (Equation 1 vs. Equations 18/19), all other aspects of GMM-based ablations remain identical to that of Simple and Transductive CNAPS. Results are shown in Table V where GMM and GMM-EM correspond to the GMM-based ablations of Simple and Transductive CNAPS. As demonstrated, the GMM-based variations of our method and Simple CNAPS result in a notable 4-8% loss in overall accuracy.

Metric Ablation: To test the significance of our choice of Mahalanobis distance, we substitute it within our architecture with other distance metrics - absolute difference (), squared Euclidean (), cosine similarity and negative dot-product. Performance comparisons are shown in Table IV. We observe that using the Mahalanobis distance results in the best in-domain, out-of-domain, and overall average performance on all datasets.

Active Learning: We present active learning results on CIFAR10 and OMNIGLOT in Figure 7. As shown, Transductive CNAPS and Simple CNAPS both benefit from uncertainty-based label aquisition thanks to having well-calibrated measures of uncertainty. It is interesting to note that the margin of performance gain over random selection is larger in Simple CNAPS. This can suggest that Transductive CNAPS is able to exploit much of the unlabelled information through transductive learning. It may also imply that its measure of uncertainty is less calibrated as soft-label class parameter estimates can be more noisy. We also compare both methods against fixed feature extractor (“FixedFE”) baselines. In these baselines, the adaptation networks are turned off and the pre-trained ResNet18 is used to produce feature vectors without any task-specific adaptation. As demonstrated, this results is lower performance across the board, signifying the role of conditional adaptation networks. Furthermore, the gap between Transductive and Simple FixedFE is considerably smaller, indicating that the transductive feature extractor adaptation module in Transductive CNAPS is responsible for a considerable proportion of its performance gain over Simple CNAPS.

Continual Learning: We evaluate continual learning on MNIST, CIFAR10, and CIFAR100 for both multi-head and single-head learning. In the multi-head setting, each task is assigned a separate classification head that focuses on the particular task’s categories. In our work, this corresponds to only calculate distances to classes that are present in the task. In the single-head case, by contrast, we evaluate with respect on all current and previously seen classes.

As noted in Table VI, while neither variation is able to achieve state of the art results on MNIST and CIFAR100, we observe that the task-encoding is extremely important in achieving competitive performance. We explore this further in Figure 8. As shown when using “Moving Encoding”, the performance on previous tasks is completely lost in the single-head setting and significantly reduced in the multi-head case. This suggests that Simple CNAPS and Transductive CNAPS produce vastly different manifolds depending on the task at hand, and previously estimated class parameters may no more be valid or useful in these new feature spaces. This is further supported by the evidence we see when using “First Encoding”. As shown, not only the best performance is achieved here, but accuracy on previously learned classes is substantially greater even in the single-head setting. Unsurprisingly, the best performance is always scene at the query of the first task, as the feature space has been adapted to maximize accuracy on that task. This shows the importance of adapting the feature space to new tasks, which in part motivates the use of the “Averaging Encoding” variation as a balance between the two. However, as we see empirically, the differences in the feature space manifolds generated with the averaged encoding still result in major loss of performance in the single head, although in the multi-head setting, performance matching that of “First Encoding” is achieved. It is interesting to note that in the single-head “Averaging Encoding” variation, previous task performance tends to improve as more tasks are seen. This is in part due to the fact that with more tasks, the average encoding will be more stable and less likely to change substantially with a new task, thus providing a more stable feature space across the subsequent tasks. Overall, “out of the box” continual learning remains an open question, even within the specific context of Simple and Transductive CNAPS models.

7 Discussion

We propose two meta-learned few-shot image classification approaches, Simple and Transductive CNAPS, that focus on effective estimation of class-wise cluster parameters, including covariance. These models are able to accomplish high accuracy, with the latter establishing state of the art performance on Meta-Dataset, mini-ImageNet and tiered-ImageNet benchmarks. We further study our methods in paradigms of active and continual learning, proposing extensions for undertaking such tasks.

Future research on better continual learning of the task-wise feature manifolds in both models can be beneficial in boosting the performance. Additional research on generating support examples through learnable augmentations, better cluster refinement algorithms, and further theoretical understanding of the Mahalanobis-based classifiers in our methods are among the many very interesting avenues for potentially fruitful future research.

8 Acknowledgments

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada Research Chairs (CRC) Program, the Canada CIFAR AI Chairs Program, Compute Canada, Intel, and DARPA under its D3M and LWLL programs. Additionally, this material is based upon work supported by the United States Air Force under Contract No. FA8750-19-C-0515.


  • [1] M. Afham, S. Khan, M. H. Khan, M. Naseer, and F. S. Khan (2021) Rich semantics improve few-shot learning. External Links: 2104.12709 Cited by: TABLE VIII, TABLE III.
  • [2] J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and A. Agarwal (2019) Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv: Learning. Cited by: §2.3.
  • [3] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh (2005) Clustering with bregman divergences. Journal of Machine Learning Research 6 (Oct), pp. 1705–1749. Cited by: §1, §2.1, §5.1, §5.1.
  • [4] P. Bateni, J. Barber, J. van de Meent, and F. Wood (2022-01) Enhancing few-shot image classification with unlabelled examples. In

    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    pp. 2796–2805. Cited by: Fig. 3, §1.
  • [5] P. Bateni, R. Goyal, V. Masrani, F. Wood, and L. Sigal (2020) Improved few-shot visual classification. In

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §A.3, TABLE VIII, §1, §3, §6.1, §6.1.
  • [6] P. Bateni (2021) On label-efficient computer vision: building fast and effective few-shot image classifiers. Ph.D. Thesis, University of British Columbia. External Links: Document Cited by: §1.
  • [7] M. Belkin, P. Niyogi, and V. Sindhwani (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples.. Journal of Machine Learning Research 7 (11). Cited by: §5.2.
  • [8] A. Bellet, A. Habrard, and M. Sebban (2013) A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709. Cited by: §1.
  • [9] W. H. Beluch, T. Genewein, A. Nurnberger, and J. M. Kohler (2018) The power of ensembles for active learning in image classification. pp. 9368–9377. Cited by: §2.3.
  • [10] M. Bilgic and L. Getoor (2009) Link-based active learning. Cited by: §2.3.
  • [11] J. Bronskill, J. Gordon, J. Requeima, S. Nowozin, and R. Turner (2020)

    TaskNorm: rethinking batch normalization for meta-learning

    In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 119. Cited by: TABLE I.
  • [12] M. P. d. Carmo (1992) Riemannian geometry. Birkhäuser. Cited by: §5.2.
  • [13] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. S. Torr (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In European Conference on Computer Vision (ECCV), pp. 556–572. External Links: ISBN 978-3-030-01252-6 Cited by: §2.4, TABLE VI.
  • [14] D. Chen, Y. Chen, Y. Li, F. Mao, Y. He, and H. Xue (2019)

    Self-supervised learning for few-shot image classification

    arXiv preprint arXiv:1911.06045. Cited by: TABLE VIII, TABLE III.
  • [15] Y. Chen, Z. Liu, H. Xu, T. Darrell, and X. Wang (2021) Meta-baseline: exploring simple meta-learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: TABLE VII, §6.2, TABLE II.
  • [16] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3606–3613. Cited by: §6.1.
  • [17] D. A. Cohn, Z. Ghahramani, and M. I. Jordan (1996-03) Active learning with statistical models. J. Artif. Int. Res. 4 (1), pp. 129–145. External Links: ISSN 1076-9757 Cited by: §4.4.
  • [18] C. Doersch, A. Gupta, and A. Zisserman (2020) CrossTransformers: spatially-aware few-shot transfer. In Advances in neural information processing systems, Cited by: TABLE IX, §B.4.
  • [19] N. Dvornik, C. Schmid, and J. Mairal (2020) Selecting relevant features from a multi-domain representation for few-shot classification. External Links: 2003.09338 Cited by: TABLE I.
  • [20] A. R. Feyjie, R. Azad, M. Pedersoli, C. Kauffman, I. B. Ayed, and J. Dolz (2020) Semi-supervised few-shot learning for medical image segmentation. External Links: 2003.08462 Cited by: §1.
  • [21] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of Machine Learning Research (PMLR), Cited by: TABLE IX, TABLE I.
  • [22] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), Cited by: §A.3, TABLE VII, §1, §2.1, §3, §6.1, TABLE II.
  • [23] S. Fort (2017) Gaussian prototypical networks for few-shot learning on omniglot. CoRR abs/1708.02735. Cited by: §2.1.
  • [24] A. Freytag, E. Rodner, and J. Denzler (2014) Selecting influential examples: active learning with expected model output changes. pp. 562–577. Cited by: §2.3.
  • [25] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov (2013) DeViSE: a deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pp. 2121–2129. Cited by: §1.
  • [26] Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep bayesian active learning with image data. CoRR. Cited by: §2.3.
  • [27] M. Garnelo, D. Rosenbaum, C. J. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. J. Rezende, and S. M. A. Eslami (2018) Conditional neural processes. International Conference on Machine Learning (ICML). Cited by: §2.1.
  • [28] S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) abs/1804.09458. Cited by: TABLE VII, TABLE II.
  • [29] S. Gidaris and N. Komodakis (2019)

    Generating classification weights with gnn denoising autoencoders for few-shot learning

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1.
  • [30] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Vol. 27, pp. 2672–2680. Cited by: §1.
  • [31] Y. Guo (2010) Active instance sampling via matrix partition. pp. 802–810. Cited by: §2.3.
  • [32] G. Guz, P. Bateni, D. Muglich, and G. Carenini (2020) Neural RST-based evaluation of discourse coherence. In

    Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

    pp. 664–671. Cited by: §1.
  • [33] S. Hauberg, O. Freifeld, and M. J. Black (2012) A geometric take on metric learning. In Advances in Neural Information Processing Systems, pp. 2024–2032. Cited by: §5.2.
  • [34] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. . Cited by: §1, §4.1.
  • [35] MD. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga (2019-02)

    A comprehensive survey of deep learning for image captioning

    ACM Comput. Surv. 51 (6). External Links: Document, ISSN 0360-0300 Cited by: §1.
  • [36] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel (2013) Detection of traffic signs in real-world images: the german traffic sign detection benchmark. In International Joint Conf. on Neural Networks (IJCNN), pp. 1–8. Cited by: §6.1.
  • [37] L. Jiao, F. Zhang, F. Liu, S. Yang, L. Li, Z. Feng, and R. Qu (2019) A survey of deep learning-based object detection. CoRR abs/1907.09408. Cited by: §1.
  • [38] J. Jongejan, H. Rowley, T. Kawashima, J. Kim, and N. Fox-Gieg (2016) The quick, draw!-ai experiment.(2016). Cited by: §6.1.
  • [39] J. A. Joshi, F. Porikli, and N. Papanikolopoulos (2010) Multi-class batch-mode active learning for image classification. Robotics and Automation, pp. 1873–1878. Cited by: §2.3.
  • [40] J. Kim, T. Kim, S. Kim, and C. D. Yoo (2019) Edge-labeling graph neural network for few-shot learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §1, §2.1, §2.2.
  • [41] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2016-12) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114, pp. . External Links: Document Cited by: §2.4, TABLE VI.
  • [42] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §1, §2.1.
  • [43] A. Krizhevsky, I. Sutskever, and G. Hinton (2012) ImageNet classification with deep convolutional neural networks. Neural Information Processing Systems 25, pp. . External Links: Document Cited by: §1.
  • [44] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §6.1.
  • [45] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §6.1.
  • [46] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Cited by: §6.1.
  • [47] J. M. Lee (2013) Smooth manifolds. In Introduction to Smooth Manifolds, pp. 1–31. Cited by: §5.2.
  • [48] K. Lee, S. Maji, A. Ravichandran, and S. Soatto (2019-06) Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: TABLE VII, §6.2, TABLE II.
  • [49] D. D. Lewis and W. A. Gale (1994) A sequential algorithm for training text classifiers. pp. 3–12. Cited by: §2.3.
  • [50] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014)

    Microsoft coco: common objects in context

    In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §6.1.
  • [51] L. Liu, W. Hamilton, G. Long, J. Jiang, and H. Larochelle (2020) A universal representation transformer layer for few-shot image classification. External Links: 2006.11702 Cited by: TABLE I.
  • [52] Y. Liu, J. Lee, M. Park, S. Kim, and Y. Yang (2019) Learning to propagate labels: transductive propagation network for few-shot learning. International Conference on Learning Representations (ICLR). Cited by: TABLE VII, §1, §2.2, §6.1, TABLE II.
  • [53] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: §6.1.
  • [54] P. Mangla, N. Kumari, A. Sinha, M. Singh, B. Krishnamurthy, and V. N. Balasubramanian (2020) Charting the right manifold: manifold mixup for few-shot learning. In The IEEE Winter Conference on Applications of Computer Vision, pp. 2218–2227. Cited by: TABLE VII, §6.2, TABLE II.
  • [55] I. Melnykov and V. Melnykov (2014) On k-means algorithm with the use of mahalanobis distances. Statistics & Probability Letters 84, pp. 88–95. Cited by: §4.3.
  • [56] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2017) Meta-learning with temporal convolutions. CoRR abs/1707.03141. Cited by: §2.1.
  • [57] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner (2018) Variational continual learning. In International Conference on Learning Representations (ICLR), Cited by: §2.4, TABLE VI.
  • [58] T. H. Nguyen and A. Smeulders (2004) Active learning using pre-clustering. ICML, pp. 79–79. Cited by: §2.3.
  • [59] A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. CoRR abs/1803.02999. Cited by: TABLE VII, §1, §2.1, §6.1, TABLE II.
  • [60] M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In IEEE Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §6.1.
  • [61] B. Oreshkin, P. Rodríguez López, and A. Lacoste (2018) TADAM: task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721–731. Cited by: TABLE VII, §1, §2.1, TABLE II.
  • [62] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville (2018) Film: visual reasoning with a general conditioning layer. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1, §2.1.
  • [63] S. S. Pouya Pezeshkpour (2020) On the utility of active instance selection for few-shot learning. In NeurIPS Workshop on Human And Model in the Loop Evaluation and Training Strategies, Cited by: §2.3.
  • [64] L. Qiao, Y. Shi, J. Li, Y. Wang, T. Huang, and Y. Tian (2019) Transductive episodic-wise adaptive metric for few-shot learning. In IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2.2.
  • [65] D. Ramanan and S. Baker (2010) Local distance functions: a taxonomy, new algorithms, and an evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (4), pp. 794–806. Cited by: §5.2.
  • [66] H. Ranganathan, H. Venkateswara, S. Chakraborty, and S. Panchanathan (2017) Deep active learning for image classification. pp. 3934–3938. Cited by: §2.3.
  • [67] S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1.
  • [68] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. . External Links: Document Cited by: §1.
  • [69] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel (2018) Meta-learning for semi-supervised few-shot classification. International Conference on Learning Representations (ICLR). Cited by: §B.6, §1, §1, §2.2, §2.2, §6.1.
  • [70] S. Ren, K. He, R. Girshick, and J. Sun (2015-06) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 39, pp. . External Links: Document Cited by: §1.
  • [71] J. Requeima, J. Gordon, J. Bronskill, S. Nowozin, and R. E. Turner (2019) Fast and flexible multi-task classification using conditional neural adaptive processes. In Advances in Neural Information Processing Systems, pp. 7957–7968. Cited by: §4.4, TABLE I, TABLE VI.
  • [72] J. Requeima, J. Gordon, J. Bronskill, S. Nowozin, and R. E. Turner (2019) Fast and flexible multi-task classification using conditional neural adaptive processes. Advances in Neural Information Processing Systems. Cited by: §A.3, §1, §1, §1, §2.1, §2.1, §2.3, §2.4, §3, §4.1, §6.1, §6.1, §6.2.
  • [73] N. Roy and A. McCallum (2001) Toward optimal active learning through monte carlo estimation of error reduction. International Conference on Machine Learning. Cited by: §2.3.
  • [74] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §A.4, TABLE VIII, §6.1, §6.1, TABLE III.
  • [75] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2018) Meta-learning with latent embedding optimization. International Conference on Learning Representations (ICLR). Cited by: TABLE VII, TABLE II.
  • [76] T. Saikia, T. Brox, and C. Schmid (2020) Optimized generic feature learning for few-shot classification across domains. External Links: 2001.07926 Cited by: TABLE IX, §B.4, TABLE I.
  • [77] V. G. Satorras and J. B. Estrach (2018) Few-shot learning with graph neural networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1.
  • [78] B. Schroeder and Y. Cui (2018) FGVCx fungi classification challenge 2018. Cited by: §6.1.
  • [79] A. Ścibior, V. Lioutas, D. Reda, P. Bateni, and F. Wood (2021) Imagining the road ahead: multi-agent trajectory prediction via differentiable simulation. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Vol. , pp. 720–725. External Links: Document Cited by: §1.
  • [80] B. Settles, M. Craven, and S. Ray (2007) Multiple-instance active learning. pp. 1289–1296. Cited by: §2.3.
  • [81] H. S. Seung, M. Opper, and H. Sompolinsky (1992) Query by committee. pp. 287–294. Cited by: §2.3.
  • [82] C. Shui, F. Zhou, C. Gagne, and B. Wang (2019) Deep active learning: unified and principled method for query and training.. arXiv: Learning. Cited by: §2.3.
  • [83] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, Cited by: §A.3, TABLE VII, TABLE IX, §1, §1, §1, §2.1, §2.1, §2.1, §2.2, §2.3, §3, §6.1, §6.1, TABLE I, TABLE II.
  • [84] X. Song, Y. Dai, D. Zhou, L. Liu, W. Li, H. Li, and R. Yang (2020)

    Channel attention based iterative residual learning for depth map super-resolution

    In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [85] M. Sornam, K. Muthusubash, and V. Vanitha (2017) A survey on image classification and activity recognition using deep convolutional neural network architecture. In International Conference on Advanced Computing (ICoAC), Vol. . External Links: Document, ISSN Cited by: §1.
  • [86] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1199–1208. Cited by: TABLE IX, TABLE I.
  • [87] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: TABLE VII, §2.1, TABLE II.
  • [88] Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola (2020) Rethinking few-shot image classification: a good embedding is all you need?. arXiv preprint arXiv:2003.11539. Cited by: TABLE VII, §6.2, TABLE II.
  • [89] S. Tong and D. Koller (2002) Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2 (1), pp. 45–66. Cited by: §2.3.
  • [90] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P. Manzagol, and H. Larochelle (2020) Meta-dataset: a dataset of datasets for learning to learn from few examples. International Conference on Learning Representations (ICLR). Cited by: §A.1, §A.3, TABLE IX, §1, §2.4, §4.4, §4.5, §6.1, §6.1, TABLE I.
  • [91] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: TABLE IX, TABLE I.
  • [92] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §1, §2.1, §6.1.
  • [93] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §6.1.
  • [94] W. Wang, V. W. Zheng, H. Yu, and C. Miao (2019-01) A survey of zero-shot learning: settings, methods, and applications. ACM Trans. Intell. Syst. Technol. 10 (2), pp. 13:1–13:37. External Links: Document, ISSN 2157-6904 Cited by: §1.
  • [95] Y. Wang, W. Chao, K. Q. Weinberger, and L. van der Maaten (2019) SimpleShot: revisiting nearest-neighbor classification for few-shot learning. arXiv preprint arXiv:1911.04623. Cited by: TABLE VII, §6.2, TABLE II.
  • [96] Y. Wang and Q. Yao (2019) Few-shot learning: A survey. CoRR abs/1904.05046. Cited by: §1, §2.1.
  • [97] D. Wertheimer, L. Tang, and B. Hariharan (2021-06) Few-shot classification with feature map reconstruction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8012–8021. Cited by: TABLE VII, §6.2, TABLE II.
  • [98] C. Xing, N. Rostamzadeh, B. N. Oreshkin, and P. O. Pinheiro (2019) Adaptive cross-modal few-shot learning. Advances in Neural Information Processing Systems. Cited by: §1.
  • [99] S. Yang, L. Liu, and M. Xu (2021) Free lunch for few-shot learning: distribution calibration. In International Conference on Learning Representations (ICLR), Cited by: TABLE VII, §6.2, TABLE II.
  • [100] S. Yang, L. Liu, and M. Xu (2021) Free lunch for few-shot learning: distribution calibration. In International Conference on Learning Representations (ICLR), Cited by: §2.1.
  • [101] H. Ye, H. Hu, D. Zhan, and F. Sha (2020) Few-shot learning via embedding adaptation with set-to-set functions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8808–8817. Cited by: TABLE VII, §6.2, TABLE II.
  • [102] C. Yin, B. Qian, S. Cao, X. Li, J. Wei, Q. Zheng, and I. Davidson (2017) Deep similarity-based batch mode active learning with exploration-exploitation. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 575–584. Cited by: §2.3.
  • [103] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. Advances in Neural Information Processing Systems. Cited by: §2.1.
  • [104] F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In International Conference on Machine Learning (ICML), pp. 3987–3995. Cited by: §2.4, TABLE VI.
  • [105] C. Zhang, Y. Cai, G. Lin, and C. Shen (2020-06) DeepEMD: few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: TABLE VII, §6.2, TABLE II.

Appendix A Benchmarks and Training

a.1 Meta-Dataset

A brief description of the sampling procedure used in the Meta-Dataset setting is already provided in Section 6.1. This sampling procedure, however, comes with additional specifications that are uniform across all tasks (such as count enforcing) and dataset specific details such as considering the class hierarchy in ImageNet tasks. The full algorithm for sampling is outlined in [90], and we refer the interested reader to Section 3.2 in [90] for complete details. This procedure results in a task distribution where most tasks have fewer than 10 classes and each class has fewer than 20 support examples. The task frequency relative to the number of classes is presented in Figure (a)a, and the class frequency as compared to the class shot is presented in Figure (b)b. The query set contains between 1 and 10 (inclusive) examples per class for all tasks; fewer than 10 query examples occur only when there are not enough total images to support 10 query examples.

a.2 mini/tiered-ImageNet

Task sampling across both mini-ImageNet and tiered-ImageNet first starts by defining a constant number of ways and shots that will be used for each generated task. For a -shot -way problem setting, first classes are sampled from the dataset with uniform probability. Then, for each sampled class, of the class images are sampled with uniform probability and used as the support examples for the class. In addition, 10 query images (distinct from the support images) are sampled per class.

Fig. 10: Evaluating Transductive CNAPS on Meta-Dataset with different minimum and maximum number of steps. Performances reported stem from five run averages.
(a) Number of Tasks vs. Ways
(b) Number of Classes vs. Shots
Fig. 11: Test-time task-way and class-shot frequency graphs. As shown, most tasks have fewer than 10 classes (way) and most classes have less than 20 support examples (shot).
mini-ImageNet Accuracy (%) tiered-ImageNet Accuracy (%)
5-way 10-way 5-way 10-way
Model Transductive 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
MAML [22] BN 48.71.8 63.10.9 31.31.1 46.91.2 51.71.8 70.31.7 34.41.2 53.31.3
MAML+ [52] Yes 50.81.8 66.21.8 31.80.4 48.21.3 53.21.8 70.81.8 34.81.2 54.71.3
Reptile [59] No 47.10.3 62.70.4 31.10.3 44.70.3 49.00.2 66.50.2 33.70.3 48.00.3
Reptile+BN [59] BN 49.90.3 66.00.6 32.00.3 47.60.3 52.40.2 71.00.2 35.30.3 52.00.3
ProtoNet [83] No 46.10.8 65.80.7 32.90.5 49.30.4 48.60.9 69.60.7 37.30.6 57.80.5
RelationNet [87] BN 51.40.8 67.00.7 34.90.5 47.90.4 54.50.9 71.30.8 36.30.6 58.00.6
TPN [52] Yes 55.50.8 69.80.7 38.40.5 52.80.4 59.90.9 73.30.7 44.80.6 59.40.5
AttWeightGen [28] No 56.20.9 73.00.6 - - - - - -
TADAM [61] No 58.50.3 76.70.3 - - - - - -
Simple CNAPS No 53.20.9 70.80.7 37.10.5 56.70.5 63.01.0 80.00.8 48.10.7 70.20.6
Transductive CNAPS Yes 55.60.9 73.10.7 42.80.7 59.60.5 65.91.0 81.80.7 54.60.8 72.50.6
LEO [75] No 61.80.1 77.60.1 - - 66.30.1 81.40.1 - -
MetaOptNet [48] No 62.60.6 78.60.5 - - 66.00.7 81.60.5 - -
MetaBaseline [15] No 63.20.2 79.30.2 - - 68.60.3 83.70.2 - -
FEAT [101] No 66.80.2 82.00.1 - - 70.80.2 84.80.2 - -
SimpleShot [95] No 62.80.2 80.00.1 - - 71.30.2 86.60.2 - -
RFS [88] No 64.80.6 82.10.4 - - 71.50.7 86.00.5 - -
FRN [97] No 66.50.2 82.80.1 - - 71.20.2 86.00.2 -