A Baseline for Few-Shot Image Classification

09/06/2019 ∙ by Guneet S. Dhillon, et al. ∙ Amazon University of Pennsylvania 9

Fine-tuning a deep network trained with the standard cross-entropy loss is a strong baseline for few-shot learning. When fine-tuned transductively, this outperforms the current state-of-the-art on standard datasets such as Mini-Imagenet, Tiered-Imagenet, CIFAR-FS and FC-100 with the same hyper-parameters. The simplicity of this approach enables us to demonstrate the first few-shot learning results on the Imagenet-21k dataset. We find that using a large number of meta-training classes results in high few-shot accuracies even for a large number of test classes. We do not advocate our approach as the solution for few-shot learning, but simply use the results to highlight limitations of current benchmarks and few-shot protocols. We perform extensive studies on benchmark datasets to propose a metric that quantifies the "hardness" of a test episode. This metric can be used to report the performance of few-shot algorithms in a more systematic way.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Are we making progress? State-of-the-art few-shot learning methods have enjoyed steady, if limited, improvements in the mean accuracy. The boxes show the

25% quantiles of the accuracy while the notches indicate the median and the 95% confidence interval of the median for the 1-shot 5-way few-shot protocol on the Mini-ImageNet 

[1] dataset. Whiskers denote the 1.5

inter-quantile range which captures 99.3% of the probability mass for a normal distribution. However, the error in the estimate of the median (notches in the box plot) does not completely reflect the standard deviation of the accuracy. That the latter is so large suggests that this progress may be illusory, especially considering that none outperform the simple transductive fine-tuning baseline discussed in this paper (rightmost).

As image classification systems begin to tackle more and more classes, the cost of annotating a massive number of images and the difficulty of procuring images of rare categories increases proportionally. This has fueled interest in few-shot learning, where only few samples per class are available for training. Fig. 1 displays a snapshot of the state-of-the-art. We estimated this plot using published numbers for the estimate of the mean accuracy, the 95% confidence interval of this estimate and the number of few-shot episodes. For MAML [2] and MetaOpt SVM [3], we use the number of episodes in the author’s Github implementation. Numerical values for these accuracies are given in Table 1.

The field appears to be progressing steadily albeit slowly based on Fig. 1

. However, the variance of the estimate of the mean accuracy is not the same as the variance of the accuracy. The former can be zero (e.g., asymptotically for an unbiased estimator), yet the latter could be arbitrarily large. The variance of the accuracies is extremely large in 

Fig. 1. This suggests that progress in the past few years may be less significant than it seems. To compound the problem, many algorithms report results using different models for different numbers of ways (classes) and shots (number of labeled samples per class), with aggressive hyper-parameter optimization spanning many orders of magnitude.111For instance, [4] tune specifically for different few-shot protocols, with parameters changing by up to six orders of magnitude; [5] uses a different query shot for different few-shot protocols. Our goal is to develop a simple baseline for few-shot learning, one that does not require specialized training depending on the number of ways or shots, nor hyper-parameter tuning for different tasks.

The simplest baseline we can think of is to pre-train a model on the meta-training dataset using the standard cross-entropy loss; this model is called the “backbone” and it can be fine-tuned on the few-shot dataset. Although this approach is basic and has been considered before [1, 6, 7], it has gone unnoticed that it outperforms many sophisticated few-shot algorithms. Indeed, with a small twist of performing fine-tuning transductively, this baseline outperforms all state-of-the-art algorithms on all standard benchmarks and few-shot protocols (cf. Table 1).

Our contribution is then to develop a transductive fine-tuning baseline for few-shot learning that can handle low-shot fine-tuning, e.g., our approach works even for a single labeled example and a single test datum per class. It employs well-understood softmax and cross-entropy training for ordinary classification and allows us to exploit effective regularization and acceleration techniques from the recent literature [8]. Our baseline outperforms the state-of-the-art on a variety of benchmark datasets such as Mini-ImageNet [1], Tiered-ImageNet [9], CIFAR-FS [10] and FC-100 [5], all with the same hyper-parameters. Current approaches to few-shot learning are hard to scale to large datasets. We report the first few-shot learning results on the Imagenet-21k dataset [11] which contains 14.2 million images across 21,814 classes. The rare classes in Imagenet-21k form a natural benchmark for few-shot learning.

The success of this baseline, should not be understood as us suggesting that this is the

right way of performing few-shot learning. We believe that sophisticated meta-training, understanding of taxonomies and meronomies, transfer learning, and domain adaptation are necessary for effective few-shot learning. The performance of the baseline however indicates that we need to interpret existing results

222For instance, [1, 12] use different versions of Mini-ImageNet; [5] report results for a backbone pre-trained on the training set while [13] use both the training and validation sets; [6] use full-sized images from the parent Imagenet-1k dataset [11][2, 14, 7, 5, 4] all use different architectures for the backbone, of varying sizes, which makes it difficult to disentangle the effect of their algorithmic contributions. with a grain of salt, and be wary of methods that tailor to the benchmark. To facilitate that, we propose a metric to quantify the hardness of episodes and a way to systematically report performance for different few-shot protocols.

2. Problem definition and related work

We first introduce some notation and formalize the few-shot image classification problem. Let denote an image and its ground-truth label respectively. The training and test datasets are and respectively, where for some set of classes . The training and test datasets are disjoint. In the few-shot learning literature, training and test datasets are commonly referred to as support and query datasets respectively, and are collectively called a few-shot episode. The number of ways, or classes, is . The set of samples is the support of class and its cardinality is (non-zero) support shots (more generally referred to as shots). The set of samples is the query of class and its cardinality is query shots. The goal is to learn a function to exploit the training set to predict the label of a test datum , where , by

(1)

Typical approaches for supervised learning replace

above with a statistic,

that is, ideally, sufficient to classify the training data, as measured by, say, the cross-entropy loss

(2)

where

is the probability distribution on the set of classes

as predicted by the model in response to an input . When presented with a test datum, the classification rule is typically chosen to be of the form

(3)

where the training set is represented by the parameters (weights) . This form of the classifier entails a loss of generality unless is a minimal sufficient statistic, , which is of course never the case, especially given few labeled data in . However, it conveniently separates training and inference phases, so we never have to revisit the training set. This is desirable in ordinary image classification where the training set consists of millions of samples, but not in few-shot learning. We therefore adopt the more general form of in Eq. 1.

If we call the test datum , then we can obtain the general form of the classifier by

(4)

In addition to the training (support) set, one typically also has a meta-training set,

with classes disjoint from . The goal of meta-training is to use to infer the parameters of the few-shot learning model: where is a meta-training loss that depends on the specific method.

2.1. Related work

2.1.1. Few-shot learning

The meta-training loss is designed to make few-shot training efficient [15, 16, 17, 18]. This approach partitions the problem into a base-level that performs standard supervised learning and a meta-level that accrues information from the base-level. Two main approaches have emerged to do so.

Gradient-based approaches: These approaches treat the updates of the base-level as a learnable mapping [19]. This mapping can be learnt using temporal models [20, 12], or one can back-propagate the gradient across the base-level updates [21, 2]. It is however challenging to perform this dual or bi-level optimization, respectively. These approaches have not been shown to be competitive on large datasets. A recent line of work learns the base-level in closed-form using simpler models such as SVMs [10, 3] which restricts the capacity of the base-level although it alleviates the optimization problem.

Metric-based approaches: A large majority of state-of-the-art algorithms are metric-based meta-learners. These techniques learn an embedding over the meta-training tasks that can be used to compare [22, 23] or cluster [1, 14] query samples. A number of recent results build upon this idea with increasing levels of sophistication in how they learn the embedding [1, 5, 7], creating exemplars from the support set and picking a metric for the embedding [24, 7, 25]. There are numerous hyper-parameters and design choices involved in implementing these approaches which makes it hard to evaluate them systematically [6].

2.1.2. Transductive learning

This approach is more efficient at using few labeled data than supervised learning [26, 27, 28]. The idea is to use information from the test datum to restrict the hypothesis space while searching for the classifier at test time. This search can be bootstrapped using a model trained on and . Our approach is closest to this line of work. We use the backbone trained on the meta-training set and initialize a classifier using the support set . Both the classifier and the backbone are then fine-tuned to adapt to the new test datum .

In the few-shot learning context, there are recent papers such as [29, 30]

that are motivated from transductive learning and exploit the unlabeled query samples. The former updates batch-normalization parameters using the query samples while the latter uses label propagation to estimate labels of all the query samples at once.

2.1.3. Semi-supervised learning

We penalize the entropy of the predictions on the query samples at test time. This is a simple technique in the semi-supervised learning literature and is closest to 

[31]. Modern augmentation techniques such as [32, 33, 34] or graph-based approaches [35] can also be used with our approach; we used the entropic penalty for the sake of simplicity. Semi-supervised few-shot learning is typically formulated as having access to extra unlabeled data during meta-training or few-shot training [36, 9]. Note that this is different from our approach which uses the unlabeled query samples for transductive learning.

2.1.4. Initialization for fine-tuning

We use recent ideas from the deep metric learning literature [37, 6, 38, 39] to initialize the fine-tuning for the backbone. These works connect the softmax cross-entropy loss with cosine distance and are discussed further in Section 3.1.

3. Approach

The simplest form of meta-training is pre-training with the cross-entropy loss, which yields

(5)

where the second term denotes a regularizer, say weight decay

. The model predicts logits

for and the distribution is computed from these logits using the softmax operator. The loss Eq. 5

is typically minimized by stochastic gradient descent-based algorithms.

If few-shot training is performed according to the general form (4), then the optimization is identical to the above, and amounts to fine-tuning a pre-trained model on a different set of classes. However, the architecture needs to be modified to account for the new classes and initialization needs to be done carefully to make the process efficient.

3.1. Support-based initialization

Given a pre-trained model (backbone) trained on , we append a new fully-connected “classifier” layer that takes the logits of the backbone as input and predicts the labels in . For a support sample , denote the logits of the backbone by , dropping the hat from . We denote the weights and biases of the classifier by and respectively; and the row of and by and

respectively. The ReLU non-linearity is denoted by

. The combined parameters of the backbone and the classifier are .

If the classifier’s logits are , the first term in the cross-entropy loss would be the cosine distance between and if both were normalized to unit norm and bias . This suggests

(6)

as candidates for initialization of the classifier. It is easy to see that such an initialization maximizes the cosine similarity between the features

and weights . For multiple support samples per class, we take the Euclidean average of the features for each class in , before normalization in Eq. 6.

We normalize the logits of the backbone to further exploit the connection of the softmax loss with the cosine distance. The logits of the classifier are thus given by

(7)

Note that we have added a ReLU non-linearity between the backbone and the classifier, before the normalization. This completes the support-based initialization. All the parameters are trainable in the fine-tuning phase.

Remark 1 (Relation to weight imprinting).

The support-based initialization is motivated from previous papers [39, 37, 7, 38]. In particular, the last one uses a similar technique, with minor differences, to expand the size of the final layer of the backbone for low-shot transfer learning. The authors call their technique “weight imprinting” because row can be thought of as a template for new class . In our case, we are only interested in performing well on the few-shot classes and can imprint simply by appending a new fully-connected layer.

Remark 2 (Using logits of the backbone instead of features as input to the classifier).

A natural way to adapt the backbone to predict new classes is to re-initialize its final fully-connected layer. We instead append a new fully-connected layer after the logits of the backbone. This is motivated from [40] who show that activations of all layers are entangled without any class-specific clusters. In contrast, logits of the backbone are peaked on the correct class, and therefore well-clustered, if the backbone has a low training loss. They are thus a cleaner input to the classifier as compared to the features. We explore this choice via an experiment in Appendix C.

3.2. Transductive fine-tuning

In Eq. 4, we have assumed that there is a single test (query) sample and training is performed with the support data. However, we can also process multiple query samples together. The data term in the loss in Eq. 4 does not change; it is simply summed over all query samples and minimized over all unknown labels. However, we wish to introduce a regularizer, in view of the slim chance that optimizing the data term yields a sufficient statistic. We seek solutions that yield models with a peaked posterior, or low Shannon Entropy . So the transductive fine-tuning phase solves for

(8)

Note that the data fitting term uses the labeled support samples whereas the regularizer uses the unlabeled query samples. The two terms are imbalanced because the support-shot can be as small as 1 while the query-shot can be arbitrary which results in a large variance in the estimate of the former and a small variance in that of the latter, or vice-versa. To allow finer control on this imbalance, one can use a coefficient for the entropic term and/or a temperature in the softmax distribution of the query samples. Tuning these hyper-parameters per dataset and few-shot protocol leads to uniform improvements in the numerical results in Section 4 by 1-2%. However, we wish to keep in line with our goal of developing a simple benchmark and refrain from optimizing these hyper-parameters, and set them equal to 1 for all experiments on benchmark datasets.

4. Experimental results

This section shows results of transductive fine-tuning on benchmark datasets in few-shot learning, namely Mini-ImageNet [1], Tiered-ImageNet [9], CIFAR-FS [10] and FC-100 [5]. We also show large-scale experiments on the Imagenet-21k dataset [11] in Section 4.2. Along with the analysis in Section 4.3, these help us design a metric that measures the hardness of an episode in Section 4.4. We sketch key points of the experimental setup here; see Appendix A for details.

Pre-training: We use various backbone architectures for our experiments - conv  [1, 14], ResNet-12 [41, 5, 3], WRN-28-10 and WRN-16-4 [13, 4, 42]. Experiments on conv and ResNet-12 are in Appendix C. These networks are trained using standard data augmentation, cross-entropy loss with label smoothing [43] of 0.1, mixup regularization [44] of 0.25, SGD with batch-size of 256, Nesterov’s momentum of 0.9, weight-decay of and no dropout. Batch-normalization [45] is used, but its parameters are excluded from weight decay [46]. We use cyclic learning rates [47] and half-precision distributed training on 8 Nvidia V100 GPUs [48, 49] to reduce training time.

Meta-training dataset: Some papers in the literature use only the training classes as the meta-training set and report few-shot results, while others use both training and validation classes for meta-training. For completeness we report results using both methodologies; the former is denoted as (train) while the latter is denoted as (train + val) in Table 1.

Fine-tuning:

We perform fine-tuning on one GPU in full-precision for 25 epochs and a fixed learning rate of

with Adam [50] without any regularization. We do not use mini-batches but make two weight updates in each epoch: one for the cross-entropy term using support samples and one for the Shannon Entropy term using query samples (cf. Eq. 8).

Hyper-parameters: We used images from Imagenet-1k belonging to the training classes of Mini-ImageNet as the validation set for pre-training the backbone for Mini-ImageNet. We used the validation set of Mini-ImageNet to choose hyper-parameters for fine-tuning. All hyper-parameters are kept constant for experiments on benchmark datasets, namely Mini-ImageNet, Tiered-ImageNet, CIFAR-FS and FC-100.

Evaluation: Few-shot episodes contain classes that are sampled uniformly from classes in the test sets of the respective datasets; support and query samples are further sampled uniformly for each class; the query shot is fixed to 15 for all experiments unless noted otherwise. All networks are evaluated over 1,000 few-shot episodes unless noted otherwise. To enable easy comparison with existing literature, we report an estimate of the mean accuracy and the 95% confidence interval of this estimate.

All experiments in Sections 4.4 and 4.3 use the (train + val) setting, pre-training on both the training and validation data of the corresponding datasets.

Mini-ImageNet Tiered-ImageNet CIFAR-FS FC-100
Algorithm Architecture 1-shot (%) 5-shot (%) 1-shot (%) 5-shot (%) 1-shot (%) 5-shot (%) 1-shot (%) 5-shot (%)
Matching networks [1] conv 46.6 60
LSTM meta-learner [12] conv 43.44 0.77 60.60 0.71
Prototypical Networks [14] conv 49.42 0.78 68.20 0.66
MAML [2] conv 48.70 1.84 63.11 0.92
R2D2 [10] conv 51.8 0.2 68.4 0.2 65.4 0.2 79.4 0.2
TADAM [5] ResNet-12 58.5 0.3 76.7 0.3 40.1 0.4 56.1 0.4
Transductive Propagation [51] conv 55.51 0.86 69.86 0.65 59.91 0.94 73.30 0.75
Transductive Propagation [51] ResNet-12 59.46 75.64
MetaOpt SVM [3] ResNet-12 62.64 0.61 78.63 0.46 65.99 0.72 81.56 0.53 72.0 0.7 84.2 0.5 41.1 0.6 55.5 0.6
Support-based initialization (train) WRN-28-10 56.17 0.64 73.31 0.53 67.45 0.70 82.88 0.53 70.26 0.70 83.82 0.49 36.82 0.51 49.72 0.55
Fine-tuning (train) WRN-28-10 57.73 0.62 78.17 0.49 66.58 0.70 85.55 0.48 68.72 0.67 86.11 0.47 38.25 0.52 57.19 0.57
Transductive fine-tuning (train) WRN-28-10 65.73 0.68 78.40 0.52 73.34 0.71 85.50 0.50 76.58 0.68 85.79 0.50 43.16 0.59 57.57 0.55
Activation to Parameter (train + val) [13] WRN-28-10 59.60 0.41 73.74 0.19
LEO (train + val) [4] WRN-28-10 61.76 0.08 77.59 0.12 66.33 0.05 81.44 0.09
MetaOpt SVM (train + val) [3] ResNet-12 (wider) 64.09 0.62 80.00 0.45 65.81 0.74 81.75 0.53 72.8 0.7 85.0 0.5 47.2 0.6 62.5 0.6
Support-based initialization (train + val) WRN-28-10 58.47 0.66 75.56 0.52 67.34 0.69 83.32 0.51 72.14 0.69 85.21 0.49 45.08 0.61 60.05 0.60
Fine-tuning (train + val) WRN-28-10 59.62 0.66 79.93 0.47 66.23 0.68 86.08 0.47 70.07 0.67 87.26 0.45 43.80 0.58 64.40 0.58
Transductive fine-tuning (train + val) WRN-28-10 68.11 0.69 80.36 0.50 72.87 0.71 86.15 0.50 78.36 0.70 87.54 0.49 50.44 0.68 65.74 0.60
Table 1. Few-shot accuracies on benchmark datasets for 5-way few-shot episodes. The notation conv denotes a CNN with layers and channels in the layer. Best results in each column are shown in bold. Results where the support-based initialization is better than or comparable to existing algorithms are denoted by . The notation (train + val) indicates that the backbone was pre-trained on both training and validation classes of the datasets; the backbone is trained only on the training classes when not indicated. The authors in [3] use a wider ResNet-12 which we denote as ResNet-12 .

4.1. Results on benchmark datasets

Table 1 shows the results of transductive fine-tuning on benchmark datasets using standard few-shot protocols. We see that this simple baseline is uniformly better than state-of-the-art algorithms. We also include results for support-based initialization, which involves only pre-training and initializing a classifier for the few-shot episode (cf. Section 3.1), and for fine-tuning, which involves using only cross-entropy loss on the support samples (the first term in Eq. 8).

The support-based initialization is sometimes better than or comparable to state-of-the-art algorithms, these entries are marked using . The few-shot literature has gravitated towards larger backbones recently [4]. Our results indicate that for large backbones even standard cross-entropy pre-training and support-based initialization work well. A similar observation was also made by [7].

For the 1-shot 5-way setting, fine-tuning using only the support examples leads to minor improvement over the initialization, and sometimes marginal degradation. However, for the 5-shot 5-way setting non-transductive fine-tuning is better than the state of the art.

In both (train) and (train + val) settings, transductive fine-tuning leads to 2-7% improvement for 1-shot 5-way setting over the state of the art for all datasets. It results in an increase of 1.5-4% for the 5-shot 5-way setting except for the Mini-ImageNet dataset, where the performance is matched. This suggests that the use of the unlabeled query samples is vital for the low-shot setting.

For the Mini-ImageNet, CIFAR-FS and FC-100 datasets using additional data from the validation set to pre-train the backbone results in 2-8% improvements on the few-shot episodes; the improvement is smaller for Tiered-ImageNet. This suggests that having more training classes leads to improved few-shot performance as a consequence of a better embedding. This observation is further corroborated by our experiments on the Imagenet-21k dataset in Section 4.2.

4.2. Large-scale few-shot learning

The Imagenet-21k dataset [11] with 14.2M images across 21,814 classes is an ideal large-scale few-shot learning benchmark due to the high class imbalance. The simplicity of our approach allows us to present the first few-shot learning results on this large dataset. We take the 7,491 most frequent classes having more than 1,000 images each as the meta-training set and the next 13,007 classes with at least 10 images each for constructing few-shot datasets. As compared to Imagenet-21k, the largest few-shot image classification dataset used in the current literature is Tiered-ImageNet which consists of 315, 97 and 160 classes, each with 1300 images, in the training, validation and test datasets respectively. Appendix B provides more details of the setup.

Way
Algorithm Model Shot 5 10 20 40 80 160
Support-based initialization WRN-28-10 1 87.20 1.72 78.71 1.63 69.48 1.30 60.55 1.03 49.15 0.68 40.57 0.42
Transductive fine-tuning WRN-28-10 1 89.00 1.86 79.88 1.70 69.66 1.30 60.72 1.04 48.88 0.66 40.46 0.44
Support-based initialization WRN-28-10 5 95.73 0.84 91.00 1.09 84.77 1.04 78.10 0.79 70.09 0.71 61.93 0.45
Transductive fine-tuning WRN-28-10 5 95.20 0.94 90.61 1.03 84.21 1.09 77.13 0.82 68.94 0.75 60.11 0.48
Table 2. Accuracy (%) on the few-shot data of Imagenet-21k. The confidence intervals are large because we compute statistics only over 80 few-shot episodes so as to test for large number of ways.

Table 2 shows the mean accuracy of transductive fine-tuning evaluated over 80 few-shot episodes on Imagenet-21k. It shows that the accuracy is extremely high as compared to corresponding results in Table 1 even for large way. E.g., the 1-shot 5-way accuracy on Tiered-ImageNet is 72.87 0.71% while it is 89 1.86% here. This indicates that pre-training with a large number of classes may be an effective strategy to build large-scale few-shot learning systems. The large improvement in few-shot accuracy comes only at the cost of a tolerably longer pre-training time: we pre-trained for about 14 hours on Tiered-ImageNet with 8 GPUs and about 40 hours on Imagenet-21k. The inference time is same for both.

The improvements of transductive fine-tuning are minor for Imagenet-21k because the accuracies are extremely high even at initialization. We noticed a slight degradation of the accuracy due to transductive fine-tuning at high ways because the entropic term in Eq. 8 is much larger than the the cross-entropy loss. The experiments for Imagenet-21k therefore scale down the entropic term by and forego the ReLU before the input to the classifier in Eqs. 7 and 6. We find that the difference in accuracy between support-based initialization and transductive fine-tuning, after this change, is small for high ways.

4.3. Analysis

This section presents a comprehensive analysis of transductive fine-tuning on the Mini-ImageNet, Tiered-ImageNet and Imagenet-21k datasets.

(a)
(b)
(c)
Figure 2. Mean accuracy of transductive fine-tuning for different query shot, way and support shot. Fig. 1(a) shows the mean accuracy (with 95% confidence interval) and suggests that a larger query shot helps if the support shot is low; this effect is minor for Tiered-ImageNet. The accuracy for query shot of 1 is high because transductive fine-tuning can specialize the network specifically for the single query shot; this specialization is possible if there are few query samples. Fig. 1(b) shows that the mean accuracy degrades logarithmically with way with fixed support shot and a query shot of 15; both Tiered-ImageNet and Imagenet-21k follow this trend with a similar slope. Fig. 1(c) suggests that the mean accuracy improves logarithmically with the support shot (1, 2, 5, 10) for fixed way and a query shot of 15. The trends in Figs. 1(c) and 1(b) suggest thumb rules for building few-shot systems.

Robustness of transductive fine-tuning to query shot: Fig. 1(a) shows the effect of changing the query shot on the mean accuracy. For the 1-shot 5-way setting, the entropic penalty in Eq. 8 helps as the query shot increases. This effect is minor in the 5-shot 5-way setting as more labeled data is used in the transductive fine-tuning phase. We observe that one query shot is enough to benefit from transductive fine-tuning. The 1-shot 5-way accuracy with one query shot is 66.94 1.55% which is significantly better than without the transductive loss (59.62 0.66% in Table 1) and already higher than competing approaches. Fig. 1(a) points to an interesting phenomenon where transductive fine-tuning achieves a relatively higher accuracy with a query shot of 1 because the network adapts to this one query shot.

Performance for different way and support shot: State-of-the-art algorithms are often tuned specifically to each few-shot protocol which makes it difficult to judge real-world performance. Indeed, a meta-learned few-shot system should be able to robustly handle different test scenarios. Figs. 1(c) and 1(b), show the performance of transductive fine-tuning with changing way and support shot. The mean accuracy changes logarithmically with the way and support shot which provides thumb rules for building few-shot systems.

Computational complexity: There is no free lunch and our advocated baseline has its limitations. It performs gradient updates during the fine-tuning phase which is significantly slower than metric-based approaches at inference time. Specifically, transductive fine-tuning is about 300 slower (20.8 vs. 0.07 seconds) for a 1-shot 5-way episode with 15 query shot as compared a prototypical network [14] with the same backbone. The latency factor reduces with higher support shot. Interestingly, for a single query shot, the former takes 4 seconds vs. 0.07 seconds. This is a more reasonable factor of 50, especially considering that the mean accuracy of the former is 66.2% compared to about 58% of the latter in our implementation. Experiments in Appendix C suggest that freezing the backbone is much faster but with worse mean accuracy, however using a smaller backbone partially compensates for the latency with some degradation of accuracy. A number of recent approaches such as [51, 4] also perform test-time processing and are expected to be slow.

4.4. A proposal for reporting few-shot classification performance

As discussed in Section 1, we need better metrics to report the performance of few-shot algorithms. There are two main issues: (i) standard deviation of the few-shot accuracy across different sampled episodes for a given algorithm, dataset and few-shot protocol is very high (cf. Fig. 1), and (ii) different models and hyper-parameters for different few-shot protocols makes evaluating algorithmic contributions difficult (cf. Table 1). This section takes a step towards resolving these issues.

Hardness of an episode: Classification performance on the few-shot episode is determined by the relative location of the features corresponding to query samples. If query samples belonging to different classes have similar features one expects the accuracy of the classifer to be low. On the other hand, if the features are far apart the classifier can distinguish easily between the classes to obtain a high accuracy. The following definition characterizes this intuition.

For training data and test data , we will define the hardness

as the average log-odds of a test datum being classified incorrectly by the prototypical loss 

[14]. More precisely,

(9)

where is a unit temperature softmax distribution with logits for a query sample where is the weight matrix constructed using Eq. 6 and . We imagine to be the normalized embedding computed using any rich-enough feature generator, say a deep network trained for standard image classification.

Note that does not depend on the few-shot learner and gives a measure of how difficult the classification problem is for any few-shot episode, using a generic feature extractor.

Figure 3. Comparing accuracy of transductive fine-tuning (solid lines) vs. support-based initialization (dotted lines) for different datasets, ways (5,10,20,40,80 and 160) and support shots (1 and 5). E.g., the ellipse contains accuracies of different 5-shot 10-way episodes for Imagenet-21k. Abscissae are computed using Eq. 9 and a Resnet-152 [52] network trained for standard image classification on the Imagenet-1k dataset. Markers indicate the accuracy of transductive fine-tuning of a few-shot episodes; markers for support-based initialization are hidden to avoid clutter. Shape of the markers denotes different ways; ways increase from left to right (5,10,20,40,80 and 160). Size of the markers denotes the support shot (1 and 5); it increases from the bottom to the top. Regression lines are drawn for each algorithm and dataset by combining the episodes of all few-shot protocols. This plot is akin to a precision-recall curve and allows comparing two algorithms, or models, for different test scenarios. The area in the first quadrant under the fitted regression lines is 295 vs. 284 (CIFAR-FS), 167 vs. 149 (FC-100), 208 vs. 194 (Mini-ImageNet), 280 vs. 270 (Tiered-ImageNet) and 475 vs. 484 (Imagenet-21k) for transductive fine-tuning and support-based initialization.

Fig. 3 demonstrates how to use the hardness metric. Few-shot accuracy degrades linearly with hardness. Performance for all hardness can thus be estimated simply by testing for two different ways. We advocate selecting few-shot learning hyper-parameters using the area under the fitted curves as a metric instead of tuning them specifically for each few-shot protocol. The advantage of such a test methodology is that it predicts the performance of the model across multiple few-shot protocols systematically.

Different algorithms can be compared directly, e.g., transductive fine-tuning (solid lines) and support-based initialization (dotted lines). For instance, the former leads to large improvements on easy episodes, the performance is similar for hard episodes, especially for Tiered-ImageNet and Imagenet-21k.

The high standard deviation of accuracy of few-shot learning algorithms in Fig. 1 can be seen as the spread of the cluster corresponding to each few-shot protocol, e.g., the ellipse denotes 5-shot 10-way protocol for Imagenet-21k. It is the nature of few-shot learning that episodes have very different hardness even if the way and shot are fixed. Episodes within the ellipse lie on a (different) line which indicates that hardness is a good indicator of accuracy.

Fig. 3 also shows that due to fewer test classes, CIFAR-FS, FC-100 and Mini-ImageNet have less diversity in the hardness of episodes while Tiered-ImageNet and Imagenet-21k allow sampling of both very hard and very easy diverse episodes. For a given few-shot protocol, the hardness of episodes in the former three is almost the same as that of the latter two datasets. This indicates that CIFAR-FS, FC-100 and Mini-ImageNet may be good benchmarks for applications with few classes.

The hardness metric in Eq. 9 naturally builds upon existing ideas in metric-based few-shot learning, namely [14]. We propose it as a means to evaluate few-shot learning algorithms uniformly across different few-shot protocols for different datasets; ascertaining its efficacy and comparisons to other metrics will be part of future work.

5. Discussion

Our aim is to provide grounding to the practice of few-shot learning. The current literature is in the spirit of increasingly sophisticated approaches for modest improvements in mean accuracy using inadequate evaluation methodology. This is why we set out to establish a baseline, namely transductive fine-tuning, and a systematic evaluation methodology, namely the hardness metric. We would like to emphasize that our advocated baseline, namely transductive fine-tuning, is not novel and yet performs better than existing algorithms on all standard benchmarks. This is indeed surprising and indicates that we need to take a step back and re-evaluate the status quo in few-shot learning. We hope to use the results in this paper as guidelines for the development of new algorithms.

References

Appendix A Setup

Datasets: We use the following datasets for our benchmarking experiments.

  • The Mini-ImageNet dataset [1] which is a subset of Imagenet-1k [11] and consists of 84 84 sized images with 600 images per class. There are 64 training, 16 validation and 20 test classes. There are multiple versions of this dataset in the literature; we obtained the dataset from the authors of [7]333https://github.com/gidariss/FewShotWithoutForgetting.

  • The Tiered-ImageNet dataset [9] is a larger subset of Imagenet-1k with 608 classes split as 351 training, 97 validation and 160 testing classes, each with about 1300 images of size 84 84. This dataset ensures that training, validation and test classes do not have a semantic overlap and is a potentially harder few-shot learning dataset.

  • We also consider two smaller CIFAR-100 [53] derivatives, both with 32 32 sized images. The first is the CIFAR-FS dataset [10] which splits classes randomly into 64 training, 16 validation and 20 test with 600 images in each. The second is the FC-100 dataset [5] which splits CIFAR-100 into 60 training, 20 validation and 20 test classes with minimal semantic overlap, containing 600 images per class.

During meta-training, we run experiments for using the training set classes, denoted by (train), or we use both the training and validation set classes, denoted by (train + val). We use the test sets to construct the few-shot episodes.

Architecture and training procedure: We use a wide residual network [42] with a widening factor of 10 and a depth of 28 which we denote as WRN-28-10. The smaller networks conv , ResNet-12 and WRN-16-4 are used for ablation studies. All networks are trained using SGD with a batch-size of 256, Nesterov’s momentum set to 0.9, no dropout, weight decay of . We use two-cycles of learning rate annealing [47], these are 40 and 80 epochs each for all datasets except Imagenet-21k, which uses cycles of 8 and 16 epochs each. The learning rate is set to 0.1 at the beginning of each cycle and decreased to with a cosine schedule [54]. We use data parallelism across 8 GPUs and half-precision training using techniques from [49, 48].

We use the following regularization techniques that have been discovered in the non-few-shot, standard image classification literature [8] for pre-training the backbone.

  • Mixup [44]:

    This augments data by a linear interpolation between input images and their one-hot labels. If

    are two samples, mixup creates a new sample where and its label ; here is the one-hot vector with a non-zero entry and is sampled from for a hyper-parameter .

  • Label smoothing [43]: When using a softmax operator, the logits can increase or decrease in an unbounded manner causing numerical instabilities while training. Label smoothing sets if and otherwise, for a small constant and number of classes

    . The ratio between the largest and smallest output neuron is thus fixed which helps large-scale training.

  • We exclude the batch-normalization [45] parameters from weight-decay [46].

We set 0.1 for label smoothing cross-entroy loss and 0.25 for mixup regularization for all our experiments.

Fine-tuning hyper-parameters: We used 1-shot 5-way episodes on the validation set of Mini-ImageNet to manually tune hyper-parameters. Fine-tuning is done for 25 epochs with a fixed learning rate of with Adam [50]. Adam is used here as it is more robust to large changes in the magnitude of the loss and gradients which occurs if the number of classes in the few-shot episode (ways) is large. We do not use any regularization (weight-decay, mixup, dropout, or label smoothing) in the fine-tuning phase. These hyper-parameters are kept constant on all benchmark datasets.

All fine-tuning and evaluation is performed on a single GPU in full-precision. We update the parameters sequentially by computing the gradient of the two terms in the transductive fine-tuning loss function independently. This updates both the weights of the model and the batch-normalization parameters.

Data augmentation:

Input images are normalized using the mean and standard-deviation computed on Imagenet-1k. Our Data augmentation consists of left-right flips with probability of 0.5, padding the image with 4px and adding brightness and contrast changes of

40%. The augmentation is kept the same for both meta-training and fine-tuning. We explored augmentation using affine transforms of the images but found that adding this has minor effect with no particular trend on the numerical results.

Evaluation procedure: The few-shot episode contains classes that are uniformly sampled from the test classes. Support and query samples are further uniformly sampled for each class. The query shot is fixed to 15 for all experiments unless noted otherwise. We evaluate all networks over 1000 episodes unless noted otherwise. For ease of comparison, we report the mean accuracy and the 95% confidence interval of the estimate of the mean accuracy.

Appendix B Setup for Imagenet-21k

Figure 4. Imagenet-21k is a highly imbalanced dataset. The most frequent class has about 3K images while the rarest class has a single image.

The blue region in Fig. 4 denotes our training set with 7,491 classes. The green region shows 13,007 classes with at least 10 images each, and is the test set. We do not use the red region consisting of 1,343 classes with less than 10 images each. We train the same backbone WRN-28-10 with the same procedure as that in Appendix A on 84 84 resized images, albeit for only 24 epochs. Since we use the same hyper-parameters as the other benchmark datasets, we did not create validation sets for meta-training or the few-shot fine-tuning phases. We create few-shot episodes from the test set in the same way as Appendix A. We evaluate using fewer few-shot episodes (80) on this dataset because we would like to demonstrate the performance across a large number of different ways.

Appendix C Ablation experiments

This section contains additional experiments and analysis, complementing Section 4.3.

Figure 5. t-SNE [55] embedding of the logits for 1-shot 5-way few-shot episode of Mini-ImageNet. Colors denote the ground-truth labels; crosses denote the support samples; circles denote the query samples; translucent markers and opaque markers denote the embeddings before and after transductive fine-tuning respectively. Even though query samples are far away from their respective supports in the beginning, they move towards the supports by the end of transductive fine-tuning. Logits of support samples are relatively unchanged which suggests that the support-based initialization is effective.

c.1. Transductive fine-tuning changes the embedding dramatically

Fig. 5 demonstrates this effect. The logits for query samples are far from those of their respective support samples and metric-based loss functions, e.g., those for prototypical networks [14] would have a poor loss on this episode; indeed the accuracy after the support-based initialization is 64%. Logits for the query samples change dramatically during transductive fine-tuning and majority of the query samples cluster around their respective supports. The post transductive fine-tuning accuracy of this episode is 73.3%. This suggests that modifying the embedding using the query samples is crucial to obtaining good performance on new classes. This example also demonstrates that the support-based initialization is efficient, logits of the support samples are relatively unchanged during the transductive fine-tuning phase.

c.2. Using features of the backbone as input to the classifier

Instead of re-initializing the final fully-connected layer of the backbone to classify new classes, we simply append the classifier on top of it. We implemented the former, more common, approach and found that it achieves an accuracy of 64.20 0.65% and 81.26 0.45% for 1-shot 5-way and 5-shot 5-way respectively on Mini-ImageNet, while the accuracy on Tiered-ImageNet is 67.14 0.74% and 86.67 0.46% for 1-shot 5-way and 5-shot 5-way respectively. These numbers are significantly lower for the 1-shot 5-way protocol on both datasets compared to their counterparts in Table 1. However, the 5-shot 5-way accuracy is marginally higher in this experiment than that in Table 1. As noted in Remark 2, logits of the backbone are well-clustered and that is why they work better for low-shot scenarios.

c.3. Freezing the backbone restricts performance

The previous observation suggests that the network changes a lot in the fine-tuning phase. Freezing the backbone severely restricts the changes in the network to only changes to the classifier. As a consequence, the accuracy of freezing the backbone is 58.38 0.66 % and 75.46 0.52% on Mini-ImageNet and 67.06 0.69% and 83.20 0.51% on Tiered-ImageNet for 1-shot 5-way and 5-shot 5-way respectively. While the 1-shot 5-way accuracies are much lower than their counterparts in Table 1, the gap in the 5-shot 5-way scenario is smaller.

c.4. Large backbone vs. small backbone

The expressive power of the backbone plays an important role in the efficacy of fine-tuning. We observed that a WRN-16-4 architecture (2.7M parameters) performs worse than WRN-28-10 (36M parameters). The former obtains 63.28 0.68% and 77.39 0.5% accuracy on Mini-ImageNet and 69.04 0.69% and 83.55 0.51% accuracy on Tiered-ImageNet on 1-shot 5-way and 5-shot 5-way protocols respectively. While these numbers are comparable to those of state-of-the-art algorithms, they are lower than their counterparts for WRN-28-10 in Table 1. This suggests that a larger network is effective in learning richer features from the meta-training classes, and fine-tuning is effective in taking advantage of this to further improve performance on samples belonging to few-shot classes.

The above experiment also helps understand why fine-tuning as a strong baseline for few-shot learning has not been noticed in the literature before, the baseline architectures were smaller. This observation was partially made by [6]. They however only concluded that gaps between meta-learning and fine-tuning-based approaches are reduced with a larger backbone. We take this point further and show in Table 1 that fine-tuning a large backbone can be better than meta-learning approaches even for the same backbone. The support-based initialization is another reason why our approach is so effective. With this initialization, we need to fine-tune the classifier, transductively or not, only for a few epochs (25 in our experiments).

c.5. Latency with a smaller backbone

The WRN-16-4 architecture (2.7M parameters) is much smaller than WRN-28-10 (36M parameters) and transductive fine-tuning on the former is much faster. As compared to our implementation of [14] with the same backbone, WRN-16-4 is 20-70 slower (0.87 vs. 0.04 seconds for a query shot of 1, and 2.85 vs. 0.04 seconds for a query shot of 15) for the 1-shot 5-way scenario. The latency with respect to metric-based approaches is thus smaller for WRN-16-4. Compare this to the computational complexity experiment in Section 4.3.

As discussed in Section C.4, the accuracy of WRN-16-4 is 63.28 0.68% and 77.39 0.5% for 1-shot 5-way and 5-shot 5-way on Mini-ImageNet respectively. As compared to this, our implementation of [14] using a WRN-16-4 backbone obtains 57.29 0.40% and 75.34 0.32% accuracies for the same settings; the former number in particular is significantly worse than its counterpart for WRN-16-4.

c.6. Using mixup during pre-training

Mixup improves the few-shot accuracy by about 1%: The accuracy for WRN-28-10 trained without mixup is 67.06 0.71% and 79.29 0.51% on Mini-ImageNet for 1-shot 5-way and 5-shot 5-way respectively.

c.7. Comparisons against backbone architectures in the current literature

Mini-ImageNet Tiered-ImageNet CIFAR-FS FC-100
Algorithm Architecture 1-shot (%) 5-shot (%) 1-shot (%) 5-shot (%) 1-shot (%) 5-shot (%) 1-shot (%) 5-shot (%)
MAML [2] conv 48.70 1.84 63.11 0.92
Matching networks [1] conv 46.6 60
LSTM meta-learner [12] conv 43.44 0.77 60.60 0.71
Prototypical Networks [14] conv 49.42 0.78 68.20 0.66
Transductive Propagation [51] conv 55.51 0.86 69.86 0.65 59.91 0.94 73.30 0.75
Support-based initialization (train) conv 50.69 0.63 66.07 0.53 58.42 0.69 73.98 0.58 61.77 0.73 76.40 0.54 36.07 0.54 48.72 0.57
Fine-tuning (train) conv 49.43 0.62 66.42 0.53 57.45 0.68 73.96 0.56 59.74 0.72 76.37 0.53 35.46 0.53 49.43 0.57
Transductive fine-tuning (train) conv 50.46 0.62 66.68 0.52 58.05 0.68 74.24 0.56 61.73 0.72 76.92 0.52 36.62 0.55 50.24 0.58
R2D2 [10] conv 51.8 0.2 68.4 0.2 65.4 0.2 79.4 0.2
TADAM [5] ResNet-12 58.5 0.3 76.7 0.3 40.1 0.4 56.1 0.4
Transductive Propagation [51] ResNet-12 59.46 75.64
Support-based initialization (train) ResNet-12 54.21 0.64 70.58 0.54 66.39 0.73 81.93 0.54 65.69 0.72 79.95 0.51 35.51 0.53 48.26 0.54
Fine-tuning (train) ResNet-12 56.67 0.62 74.80 0.51 64.45 0.70 83.59 0.51 64.66 0.73 82.13 0.50 37.52 0.53 55.39 0.57
Transductive fine-tuning (train) ResNet-12 62.35 0.66 74.53 0.54 68.41 0.73 83.41 0.52 70.76 0.74 81.56 0.53 41.89 0.59 54.96 0.55
MetaOpt SVM [3] ResNet-12 62.64 0.61 78.63 0.46 65.99 0.72 81.56 0.53 72.0 0.7 84.2 0.5 41.1 0.6 55.5 0.6
Support-based initialization (train) WRN-28-10 56.17 0.64 73.31 0.53 67.45 0.70 82.88 0.53 70.26 0.70 83.82 0.49 36.82 0.51 49.72 0.55
Fine-tuning (train) WRN-28-10 57.73 0.62 78.17 0.49 66.58 0.70 85.55 0.48 68.72 0.67 86.11 0.47 38.25 0.52 57.19 0.57
Transductive fine-tuning (train) WRN-28-10 65.73 0.68 78.40 0.52 73.34 0.71 85.50 0.50 76.58 0.68 85.79 0.50 43.16 0.59 57.57 0.55
Support-based initialization (train + val) conv 52.77 0.64 68.29 0.54 59.08 0.70 74.62 0.57 64.01 0.71 78.46 0.53 40.25 0.56 54.53 0.57
Fine-tuning (train + val) conv 51.40 0.61 68.58 0.52 58.04 0.68 74.48 0.56 62.12 0.71 77.98 0.52 39.09 0.55 54.83 0.55
Transductive fine-tuning (train + val) conv 52.30 0.61 68.78 0.53 58.81 0.69 74.71 0.56 63.89 0.71 78.48 0.52 40.33 0.56 55.60 0.56
Support-based initialization (train + val) ResNet-12 56.79 0.65 72.94 0.55 67.60 0.71 83.09 0.53 69.39 0.71 83.27 0.50 43.11 0.58 58.16 0.57
Fine-tuning (train + val) ResNet-12 58.64 0.64 76.83 0.50 65.55 0.70 84.51 0.50 68.11 0.70 85.19 0.48 42.84 0.57 63.10 0.57
Transductive fine-tuning (train + val) ResNet-12 64.50 0.68 76.92 0.55 69.48 0.73 84.37 0.51 74.35 0.71 84.57 0.53 48.29 0.63 63.38 0.58
MetaOpt SVM (train + val) [3] ResNet-12 64.09 0.62 80.00 0.45 65.81 0.74 81.75 0.53 72.8 0.7 85.0 0.5 47.2 0.6 62.5 0.6
Activation to Parameter (train + val) [13] WRN-28-10 59.60 0.41 73.74 0.19
LEO (train + val) [4] WRN-28-10 61.76 0.08 77.59 0.12 66.33 0.05 81.44 0.09
Support-based initialization (train + val) WRN-28-10 58.47 0.66 75.56 0.52 67.34 0.69 83.32 0.51 72.14 0.69 85.21 0.49 45.08 0.61 60.05 0.60
Fine-tuning (train + val) WRN-28-10 59.62 0.66 79.93 0.47 66.23 0.68 86.08 0.47 70.07 0.67 87.26 0.45 43.80 0.58 64.40 0.58
Transductive fine-tuning (train + val) WRN-28-10 68.11 0.69 80.36 0.50 72.87 0.71 86.15 0.50 78.36 0.70 87.54 0.49 50.44 0.68 65.74 0.60
Table 3. Few-shot accuracies on benchmark datasets for 5-way few-shot episodes. The notation conv denotes a CNN with layers and channels in the layer. The rows are sorted by the backbone architecture. Best results in each column and for a given backbone architecture are shown in bold. Results where the support-based initialization is better than existing algorithms are denoted by . The notation (train + val) indicates that the backbone was trained on both training and validation data of the datasets; the backbone is trained only on the training set when not indicated. The authors in [3] use a wider ResNet-12 which we denote as ResNet-12 .

We include experiments using conv  [1, 14], ResNet-12 [41, 5, 3] and WRN-16-4 [42] in Table 3, in addition to WRN-28-10 in Section 4, in order to facilitate comparisons of the proposed baseline for different architectures. Our results are comparable or better than existing results for a given backbone architectures, except for those in [51], which use a graph-based transduction algorithm, for conv on Mini-ImageNet and Tiered-ImageNet. In line with our goal of simplicity, we kept the hyper-parameters for pre-training and fine-tuning the same as the ones used for WRN-28-10 (cf. Sections 4 and 3).

Appendix D Frequently asked questions

  1. Why has it not been noticed yet that this simplistic approach works so well?

    We believe there are two main reasons:

    • There are papers in the few-shot learning literature [1, 14] that have explored nearest-neighbor techniques as baselines (our support-based initialization is directly comparable to these). Fine-tuning as a baseline was used in [6]. These papers however used small backbone architectures, e.g., the conv (64) architecture which is a 4-layer CNN with 64 features each only has 113K parameters and obtained a low accuracy even for standard cross-entropy training on the training set of Mini-ImageNet (about 75% vs. 82% for WRN-28-10). Our experiments show that simply using a larger backbone works; a similar observation was also made by [7, 6]. We obtained very good accuracies with a much smaller architecture; see Sections C.7 and C.4 for results using the conv (64), ResNet-12 and WRN-16-4 architectures.

    • Given that there are only a few labeled support samples provided in the few-shot setting, initializing the classifier becomes important. The support-based initialization (cf. Section 3.1) motivated from the deep metric learning literature [39, 37, 7, 38] classifies support samples correctly (for a support shot of 1, this may not be true for higher shots). This initialization, as opposed to initializing the weights of the classifier randomly, was critical to performance.

  2. Transductive fine-tuning works better than existing algorithms because you use a bigger backbone. You should compare on the same architectures as the existing algorithms for a fair comparison.

    The current literature is in the spirit of increasingly sophisticated approaches for modest performance gains, often with different architectures (cf. Table 1). This is why we set out to establish a baseline. Our simple baseline, namely standard cross-entropy training followed by standard or transductive fine-tuning is comparable or better than existing approaches. The backbone we have used is common in the recent few-shot learning literature [4, 13] (cf. Table 1). This indicates that we should take results on existing benchmarks with a grain of salt.

    We have included experiments using the smaller WRN-16-4 architecture in Section C.4 and commonly used architectures in Section C.7.

  3. Fine-tuning for few-shot learning is not novel.

    Transductive fine-tuning is our advocated baseline for few-shot classification. It is not novel. And yet, it performs better than existing algorithms on all few-shot protocols with fixed hyper-parameters. We emphasize that this indicates the need to re-interpret existing results on benchmarks and re-evaluate the status quo in the literature.

  4. The baseline advocated in this paper has a very high latency at inference time, this is not practical.

    Our goal is to establish a systematic baseline for accuracy, which might help judge the accuracy of few-shot learning algorithms in the future. The question of test-time latency is indeed important but we have not focused on it in this paper. Section C.5 provides results using a smaller backbone where we see that the WRN-16-4 network is about 20-70x slower than metric-based approaches employing the same backbone while having significantly better accuracy. The latencies with WRN-28-10 are larger (see the computational complexity section in Section 4.3) but with a bigger advantage in terms of accuracy.

  5. Transductive fine-tuning does not make sense in the online setting when query samples are shown in a sequence.

    Transductive fine-tuning can be performed even with a single test datum. Indeed, the network can specialize itself completely to classify this one datum. We explore a similar scenario in Section 4.3 and Fig. 1(a), which discuss the performance of transductive fine-tuning with a query shot of 1 (this means 5 query samples one from each class for 5-way evaluation). Note that the loss function in Eq. 8 leverages multiple query samples when available. It does not require that the query samples be balanced in terms of their ground-truth classes. In particular, the loss function in Eq. 8 is well-defined even for a single test datum. For concerns about latency, see the previous question.

  6. Why is having the same hyper-parameters for different few-shot protocols so important?

    A practical few-shot learning algorithm cannot assume full knowledge of the few-shot scenario at meta-training time: a model trained for 5-shot 5-way episodes might be faced with 1-shot 5-way few-shot episodes. Current algorithms do not handle this scenario well. One would have to use a bag of models, each trained for a different few-shot scenario, and run one of these models depending upon the few-shot scenario. A single model which can handle any test scenario is thus desirable.

  7. Is this over-fitting to the test datum?

    No, label of the test datum is not used in the loss function.

  8. Can you give some intuition about the hardness metric? How did you come up with the formula?

    The hardness metric is motivated from the prototypical loss [14]. This is the clustering loss on the feature space where the labeled support samples form the centers of the clusters. The special form, namely, allows an interpretation of log-odds. We used this form because it is sensitive to the number of few-shot classes (cf. Fig. 3). Similar metrics, e.g., can also be used but they come with a few caveats. Note that it is easier for to be large for small way because the normalization constant in softmax has fewer terms. For large way, could be smaller. This effect is better captured by our metric.

  9. How does Fig. 3 look for algorithm X, Y, Z?

    We compared two algorithms in Fig. 3, namely transductive fine-tuning and support-based initialization. Section 4.4 and the caption of Fig. 3 explains how the former is better. We will consider adding comparisons to other algorithms to this plot.