Few-shot Image Classification: Just Use a Library of Pre-trained Feature Extractors and a Simple Classifier

01/03/2021 ∙ by Arkabandhu Chowdhury, et al. ∙ 2

Recent papers have suggested that transfer learning can outperform sophisticated meta-learning methods for few-shot image classification. We take this hypothesis to its logical conclusion, and suggest the use of an ensemble of high-quality, pre-trained feature extractors for few-shot image classification. We show experimentally that a library of pre-trained feature extractors combined with a simple feed-forward network learned with an L2-regularizer can be an excellent option for solving cross-domain few-shot image classification. Our experimental results suggest that this simpler sample-efficient approach far outperforms several well-established meta-learning algorithms on a variety of few-shot tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 11

Code Repositories

PreTrainedFullLibrary_FewShot

Codes for "Few-shot Image Classification: Just Use a Library of Pre-trained Feature Extractors and a Simple Classifier"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been a lot of recent interest in few-shot image classification (Fei-Fei et al., 2006; Lee et al., 2019; Li et al., 2019; Finn et al., 2017, 2018; Kim et al., 2018; Nichol and Schulman, 2018; Lee and Choi, 2018). Various papers have explored different formulations of the problem, but in one general formulation, we are given a data set of (image, label) pairs sampled from a distribution . The goal is to devise a method that uses to learn a function that is itself a few-shot learner. The few-shot learner takes as input a new labeled data set consisting of a set of samples from a new distribution . then returns a classification function , which is targeted at classifying samples from the distribution .

The process of learning is often referred to as meta-learning in the literature. Learning to classify samples from is a “few shot” problem because is very small, perhaps having only one example for each class produced by . In the most difficult and general variant of the few-shot problem—which we consider in this paper— has no known relationship to , and is not available at the time we learn . Thus, this is “cross-domain” few-shot learning. In cross-domain learning, the only information we have about is the set

. We cannot, for example, choose any hyperparameters controlling

or using information that has not been extracted from or .

Our goal is to devise a learner that works well, out-of-the-box, on virtually any new distribution . We argue that in such a scenario, developing novel meta-learning methods to learn from scratch on given may not be the most productive direction of inquiry. Because no information about is available at training time, it makes sense to choose a that has many different types of images, so it is likely to contain some classes that resemble those produced by

, whatever form this distribution takes. Fortunately, in computer image classification, the standard benchmark data set is now ILSVRC2012, a 1000-class version of the full ImageNet

(Russakovsky et al., 2015). ILSVRC2012 consists of a wide variety of images, and it has become quite standard for researchers who design and train new image classifiers to publish classifiers trained on ILSVRC2012. Such published artifacts represent thousands of hours of work by researchers who are well-versed in the “black art” of wringing every last percent of accuracy out of a deep CNN. We argue that rather than trying to develop new meta-learning methods, it may be more productive to simply fix = ILSVRC2012, and attempt to leverage all of the effort that has gone into learning deep CNNs over ILSVRC2012, by using those CNNs as the basis for the few-shot learner

. As other, even more wide-ranging and difficult benchmark data sets become prevalent (such as the full, 20,000+ class ImageNet), high-quality classifiers trained using that data set may be preferred.

We first show that it is possible to use any of a number of published, high-quality, deep CNNs, learned over ILSVRC2012, as the basis for a few-shot learner that significantly outperforms state-of-the art methods. The way to do this is embarrassingly simple: remove the classifier from the top of the deep CNN, fix the weights of the remaining deep feature extractor, and replace the classifier with a simple MLP that is trained using

regularization to prevent over-fitting. We call these “library-based” learners because they are based on standard, published feature extractors.

Next, we ask: if virtually any high-quality, published deep CNN can be used to produce a state-of-the-art, few-shot learner, can we produce an even higher-quality few-shot learner by combining together many high-quality, deep CNNs? We call such a learner a “full library” learner.

Then, we note that other researchers have suggested the utility of re-using high-quality, pre-trained feature extractors for few-shot image classification. In particular, the authors of the “Big Transfer” paper (Kolesnikov et al., 2019) argue that a very large network trained on a huge data set (the JFT-300M data set (Sun et al., 2017), with 300 million images) can power an exceptionally accurate transfer-learning-based few-shot learning. Unfortunately, the authors have not made their largest JFT-300M-trained network public, and so we cannot experiment with it directly. However, they have made public several versions of their “Big Transfer” network, trained on the full, 20,000+ class ImageNet benchmark public. Interestingly, we find that “big” may not be as important as “diverse”: a single, few-shot classifier comprised of many different high-quality ILSVRC2012-trained deep CNNs seems to be a better option than a single few-shot classifier built on top of any of the Google-trained CNNs.

Finally, we investigate why a full library learner works so well. We postulate two reasons for this, and give some evidence to back our hypothesis. First, having a very large number of features does not seem to be a problem for few-shot learning; even with a single example per class and nearly 14,000 features, the learning process seems to be able to pick out the same features that are selected when training on a much more complete data set. Second, there seems to be strength in diversity, in the sense that different deep CNNs appear to be useful for different tasks, which may explain why no single feature extractor can outperform a larger library of them.

2 High Accuracy of Library-Based Learners

2.1 Designing a Library-Based Learner

We begin by considering the exceedingly simple idea at the heart of this paper: what if we eschew advanced meta-learning methods, and instead simply use a very high-quality deep CNN pulled from a library, trained on the ILSVRC2012 data set, as the basis for a few-shot classifier? How would such a few-shot learner compare to the advanced few-shot learners proposed in the literature?

Aircraft FC100 Omniglot Texture Traffic Fungi Quick Draw VGG Flower
5-way, 1-shot
Worst 41.6 0.8 50.8 0.9 77.20.9 59.1 0.9 55.5 0.8 53.0 0.9 57.3 0.9 79.7 0.8
RN34 DN121 RN152 DN169 RN152 DN201 RN101 RN18
Best 46.2 1.0 61.20.9 86.5 0.7 65.1 0.9 66.6 0.9 56.6 0.9 62.8 0.9 83.5 0.8
DN161 RN152 DN121 RN101 DN201 DN121 RN18 DN161
20-way, 1-shot
Worst 20.1 0.3 27.8 0.4 56.2 0.5 38.0 0.4 29.7 0.3 31.7 0.4 33.2 0.5 62.4 0.5
RN101 DN121 RN101 RN18 RN101 RN101 RN101 RN101
Best 24.3 0.3 36.4 0.4 69.1 0.5 42.5 0.4 38.5 0.4 33.9 0.5 39.5 0.5 70.0 0.5
DN161 RN152 DN121 RN152 DN201 DN161 DN201 DN161
40-way, 1-shot
Worst 14.2 0.2 19.6 0.2 47.3 0.3 28.9 0.2 22.2 0.2 23.7 0.3 26.4 0.3 53.1 0.3
RN34 RN18 RN152 RN18 RN152 RN34 RN152 RN34
Best 17.4 0.2 27.2 0.3 61.6 0.3 33.2 0.3 29.5 0.2 26.8 0.3 31.2 0.3 62.8 0.3
DN161 RN152 DN201 DN152 DN201 DN161 DN201 DN161
Table 1: Accuracy obtained using library deep CNNs for few-shot learning.

Specifically, we are given a high-quality, pre-trained deep CNN, from a library of pre-trained networks; we take the CNN as-is, but remove the topmost layers used for classification. This results in a function that takes as input an image, and returns an embedding. We then use that embedding to in an elementary fashion: we feed the embedding into a multi-layer perceptron with a single hidden layer; a softmax is then used to produce the final classification. Given a few-shot classification problem, two weight matrices

and are learned; the first connecting the embedding to the hidden layer, the second connecting the hidden layer to the softmax. To prevent over-fitting during training, simple regularization is used on the weight matrices.

2.2 Evaluation

To evaluate this very simple few-shot learner, we first identify nine, high-quality deep CNNs with published models, trained on ILSVRC2012: ResNet18, ResNet34, ResNet50, ResNet101, ResNet152 (all of the ResNet implementations are the ones from the original ResNet designers (He et al., 2016)), DenseNet121, DenseNet161, DenseNet169, and DenseNet201 (all of the DenseNet implementations are also from the original designers (Huang et al., 2017)).

To produce an “out-of-the-box” few-shot learner that can be used on any (very small) training set without additional data or knowledge of the underlying distribution, we perform an extensive hyper-parameter search, using the Caltech-UCSD Birds 200 set (Welinder et al., 2010)

as a validation data set, and then use the best hyperparameters from that data set in all of our experiments. Hyperparameters considered were learning rate, number of training epochs, regularization penalty weight, the number of neurons in the MLP hidden layer, and whether to drop the hidden layer altogether. A separate hyper-parameter search was used for 5-way, 20-way, and 40-way classification. Interestingly, it was difficult to discern any meaningful pattern in the best-performance hyperparamters (except for the fact that dropping the hidden layer did not significantly help performance, but did occasionally hurt performance significantly on the validation set; as a result, we always used a hidden layer).

We then test the resulting few-shot learner—one learner per deep CNN—on eight different data sets, FGVC-Aircraft

(Maji et al., 2013), FC100 (Oreshkin et al., 2018), Omniglot (Lake et al., 2015), Traffic Sign (Houben et al., 2013), FGCVx Fungi (Schroeder and Cui, 2018), Quick Draw (Jongejan et al., 2016), and VGG Flower (Nilsback and Zisserman, 2008). To evaluate a few-shot learner on a data set, for an “-way -shot” classification problem, we randomly select different classes, and then randomly select images from each class for training; the remainder are used for testing. As this is “out-of-the-box” learning, no validation is allowed.

We performed this evaluation for in and in . Due to space constraints, the full results are presented as supplementary material, and we give only a synopsis here. In Table 1, we show the best and worst accuracy achieved across the nine learners, for each of the eight data sets, for in and .

To give the reader an idea of how this accuracy compares to the state-of-the-art, we compare these results with a number of few-shot learners from the literature. We compare against Baseline and Baseline++ (Chen et al., 2019), MAML (Finn et al., 2017), MatchingNet (Vinyals et al., 2016), ProtoNet (Snell et al., 2017), RelationNet (Sung et al., 2018), Meta-transfer (Sun et al., 2019), and FEAT (Ye et al., 2020). When a deep CNN classifier must be chosen for any of these methods, we use a ResNet18. For methods that require a pre-trained CNN (FEAT and Meta-transfer), we use the ResNet18 trained by the ResNet designers (He et al., 2016). Lest the reader be concerned that we chose the worst option (ResNet18), we point out that of the library-based few-shot learners, on the 5-way, 5-shot problem ResNet18 gave the best accuracy out of all of the ResNet-based learners for three of the data sets (see Table 5). Further, these competitive methods tend to be quite expensive to train—MAML, for example, requires running gradient descent over a gradient descent—and for such a method, the shallower ResNet18 is a much more reasonable choice than the deeper models (even using a ResNet18, we could not successfully train first-order MAML for 5-shot, 40-way classification, due to memory constraints).

For each competitive method, we follow the same procedure as was used for the library-based few-shot classifiers: any training that is necessary is performed on the ILSVRC2012 data set, and hyperparameter validation is performed using the Caltech-UCSD Birds 200 data set. Each method is then used without further tuning on the remaining eight data sets.

A comparison of each of these competitive methods with the the best and worst-performing library-based learners on the 5-way, 5-shot learning problem is shown in Table 2; a comparison on 20-way, 5-shot and 40-way, 5-shot learning in Table 3 and 4. A more complete set of results is in the supplementary material.

2.3 Discussion

There are a few key takeaways from these results. The best library-based learner always beat every one of the other methods tested. The gap only grows as the number of ways increases. In fact, for the 20-way problem, the worst library-based learner always beat all of the other methods tested. The gap can be quite dramatic, especially on the 20-way problem. The best non-transfer based few-shot learner (MAML, Proto Nets, Relation Nets, and Matching Nets fall in this category) was far worse than even the worst library-based learner: 39% accuracy for Proto Nets vs. 59% accuracy for a classifier based on RestNet18 on Texture, 48% accuracy for Relation Nets vs. 57% accuracy for classifier based on ResNet101 on Quick Draw. There have been a large number of non-transfer-based methods proposed in the literature (with a lot of focus on improving MAML in particular (Finn et al., 2017, 2018; Kim et al., 2018; Nichol and Schulman, 2018; Lee and Choi, 2018) but the gap between such methods and the library-based classifiers is so large that it may be difficult to span.

We also note that of the non-library methods, Meta-transfer, Baseline, and FEAT were clearly the best.

Aircraft FC100 Omniglot Texture Traffic Fungi Quick Draw VGG Flower
Baseline 47.6 0.7 66.9 0.7 96.5 0.2 62.5 0.7 82.1 0.7 57.6 1.7 75.6 0.7 90.9 0.5
Baseline++ 40.9 0.7 59.8 0.8 90.6 0.4 59.9 0.7 79.6 0.8 54.1 1.7 68.4 0.7 83.1 0.7
MAML 33.1 0.6 62.0 0.8 82.6 0.7 56.9 0.8 67.4 0.9 48.3 1.8 77.5 0.8 78.0 0.7
MatchingNet 33.5 0.6 59.4 0.8 89.7 0.5 54.7 0.7 73.7 0.8 55.7 1.7 70.4 0.8 74.2 0.8
ProtoNet 41.5 0.7 64.7 0.8 95.5 0.3 62.9 0.7 75.0 0.8 53.1 1.8 74.9 0.7 86.7 0.6
RelationNet 37.5 0.7 64.8 0.8 91.2 0.4 60.0 0.7 68.6 0.8 58.7 1.8 71.9 0.7 80.6 0.7
Meta-transfer 46.2 0.7 75.7 0.8 93.5 0.4 70.5 0.7 80.0 0.8 66.1 0.8 77.7 0.7 90.5 0.6
FEAT 46.2 0.9 60.5 0.8 85.3 0.7 70.7 0.7 70.5 0.8 67.3 1.7 69.9 0.7 92.1 0.4
Worst library- 61.0 0.9 71.9 0.8 94.0 0.4 79.3 0.6 78.8 0.7 77.1 0.8 77.8 0.7 95.3 0.4
based RN34 DN121 RN152 RN18 RN152 RN34 RN152 RN34
Best library- 66.0 0.9 80.0 0.6 96.7 0.2 83.4 0.6 85.3 0.7 79.1 0.7 81.8 0.6 96.8 0.3
based DN161 RN152 DN201 DN161 DN201 DN121 DN201 DN161
Table 2: Comparing competitive methods with the simple library-based learners, on the 5-way, 5-shot problem.
Aircraft FC100 Omniglot Texture Traffic Fungi Quick Draw VGG Flower
Baseline 24.2 0.3 40.0 0.4 87.5 0.3 37.0 0.3 59.9 0.4 32.5 0.8 52.8 0.4 76.7 0.4
Baseline++ 18.4 0.3 33.8 0.3 76.2 0.2 34.8 0.3 55.3 0.4 28.2 0.8 45.5 0.4 64.0 0.4
MAML 11.8 0.2 25.7 0.3 46.5 0.4 21.9 0.3 27.0 0.3 17.3 0.7 30.7 0.3 32.9 0.3
MatchingNet 11.9 0.2 31.6 0.3 64.6 0.6 31.6 0.3 46.5 0.4 28.1 0.8 41.2 0.4 53.7 0.5
ProtoNet 22.1 0.3 38.9 0.4 88.1 0.2 38.9 0.3 46.9 0.4 33.0 0.9 33.0 0.9 70.9 0.4
RelationNet 17.1 0.3 39.1 0.4 79.7 0.3 32.1 0.3 41.9 0.4 27.8 0.8 47.5 0.4 62.5 0.4
Meta-transfer 19.1 0.3 48.0 0.4 75.3 0.4 45.5 0.4 52.0 0.4 38.6 0.5 52.6 0.4 74.0 0.4
FEAT 23.1 0.5 34.4 0.5 66.4 0.6 47.5 0.6 43.4 0.6 43.9 0.8 47.0 0.6 80.5 0.5
Worst library- 37.5 0.4 47.1 0.4 84.3 0.3 58.7 0.4 55.9 0.4 56.1 0.5 57.2 0.4 86.8 0.3
based RN18 RN18 RN101 RN18 RN152 RN34 RN101 RN101
Best library- 44.6 0.4 58.8 0.4 92.0 0.2 65.1 0.4 66.0 0.4 60.8 0.5 63.9 0.4 91.6 0.2
based DN161 RN152 DN201 DN161 DN201 DN161 DN161 DN161
Table 3: Comparing competitive methods with the simple library-based learners, on the 20-way, 5-shot problem.
Aircraft FC100 Omniglot Texture Traffic Fungi Quick Draw VGG Flower
Baseline 17.0 0.2 29.4 0.3 80.0 0.3 27.1 0.2 49.1 0.3 24.0 0.5 41.5 0.3 68.3 0.3
Baseline++ 12.3 0.2 24.4 0.3 67.0 0.3 25.1 0.3 44.1 0.3 19.7 0.5 35.2 0.3 53.9 0.3
MatchingNet 8.6 0.2 22.1 0.3 59.1 0.4 23.3 0.3 37.1 0.3 19.0 0.5 31.5 0.3 46.7 0.4
ProtoNet 13.8 0.2 28.0 0.3 80.0 0.3 29.6 0.3 39.7 0.3 23.6 0.5 42.6 0.4 61.6 0.3
RelationNet 10.7 0.2 26.1 0.3 77.0 0.3 18.7 0.2 29.6 0.3 18.3 0.5 40.6 0.3 47.0 0.3
Meta-transfer 9.7 0.2 29.9 0.2 48.1 0.3 29.8 0.2 33.3 0.2 12.2 0.3 31.6 0.2 55.4 0.3
FEAT 16.2 0.3 27.1 0.4 58.5 0.4 36.8 0.3 37.3 0.3 32.9 0.6 35.6 0.4 74.0 0.4
Worst library- 28.4 0.2 37.3 0.2 79.9 0.3 48.4 0.2 47.2 0.2 46.6 0.3 49.8 0.3 81.4 0.2
based RN34 RN18 RN152 RN18 RN152 RN34 RN152 RN34
Best library- 35.9 0.2 48.2 0.3 89.4 0.2 55.4 0.2 57.5 0.2 52.1 0.3 55.5 0.3 88.9 0.2
based DN161 RN152 DN201 DN161 DN201 DN161 DN201 DN161
Table 4: Comparing competitive methods with the simple library-based learners, on the 40-way, 5-shot problem.

Baseline, for example, consistently outperformed Baseline++, in contradiction to the general findings of the designers of those two methods (Chen et al., 2019). However, for cross-domain experiments, baseline is shown to do a better job than baseline++ (Chen et al., 2019). We note that Meta-transfer, Baseline, and FEAT are the three methods that use the pre-trained ResNet18 without modification. This tends to support the hypothesis at the heart of this paper: starting with a state-of-the-art feature extractor, trained by experts, may be the most important decision in designing a few-shot learner.

3 A Simple Full Library Classifier

3.1 Extreme Variation in Few-Shot Quality

There is not a clear pattern to which of the library-based classifiers tends to perform best, and which tends to perform worst. Consider the complete set of results, over all of the library-based few-shot learners, for the 5-way, 5-shot problem, shown in Table 5. For “out-of-the-box” use, where no validation data are available, it is very difficult to see any sort of pattern that might help with picking a particular library-based classifier. The DenseNet variations sometimes do better than ResNet (on the Aircraft data set, for example), but sometimes they do worse than the ResNet variations (on the FC100 data set, for example). And within a family, it is unclear which library-based CNN to use. As mentioned before, ResNet18 provides the best ResNet-based few-shot learner for three of the data sets, but it forms the basis of the worst ResNet-based learner on two of the data sets.

Aircraft FC100 Omniglot Texture Traffic Fungi Quick Draw VGG Flower
DenseNet121 64.7 0.9 71.9 0.8 96.7 0.3 82.1 0.6 85.0 0.7 79.1 0.7 81.3 0.7 96.0 0.4
DenseNet161 66.0 0.9 73.7 0.7 96.6 0.3 83.4 0.6 83.9 0.7 78.4 0.8 81.3 0.6 96.8 0.3
DenseNet169 63.6 0.9 73.5 0.7 95.0 0.3 82.3 0.6 84.6 0.7 78.4 0.8 80.6 0.7 96.1 0.3
DenseNet201 62.6 0.9 75.1 0.7 96.7 0.6 83.2 0.6 85.3 0.7 78.0 0.7 81.8 0.6 96.5 0.3
ResNet18 61.2 0.9 72.1 0.8 95.4 0.3 79.3 0.6 83.2 0.7 77.7 0.7 81.7 0.6 95.3 0.4
ResNet34 61.0 0.9 76.2 0.7 94.9 0.3 82.5 0.6 81.3 0.7 77.1 0.7 79.7 0.6 95.3 0.4
ResNet50 62.3 0.9 73.9 0.8 94.3 0.4 83.2 0.6 79.4 0.8 77.9 0.8 78.1 0.8 95.6 0.4
ResNet101 62.4 0.9 79.2 0.7 95.5 0.3 82.8 0.6 81.3 0.7 78.6 0.8 78.6 0.7 95.8 0.3
ResNet152 61.7 1.0 80.0 0.6 94.0 0.4 83.0 0.6 78.8 0.7 79.0 0.8 77.8 0.7 95.6 0.3
Table 5: Complete results for library-based few-shot learners on the 5-way, 5-shot problem.

3.2 Combining Library-Based Learners

Given the relatively high variance in accuracies obtained using the library deep CNNs across various data sets, it is natural to ask: Is it perhaps possible to use all of these library feature extractors in concert with one another, to develop a few-shot learner which consistently does as well as (or even better then) the best library CNN?

Given the lack of training data in few-shot learning, the first idea that one might consider is some simple variant of ensembling: given a few-shot learning problem, simply train a separate neural network on top of each of the deep CNNs, and then use a majority vote at classification time (hard ensembling) or we can average the class weights at classification time (soft ensembling).

Another option is to take all of the library deep CNNs together, and view them as a single feature extractor. Using the nine models considered thus far in this paper in this way results in a 13,984 features. We then train an MLP on this of this, using regularization. Again using the Caltech-UCSD Birds 200 data set for validation, we perform hyperparameter search to build a few-shot learner for -way, -way, and -way classification problems.

We test both of these options over the eight test sets, and give a synopsis of the results in Table 6. We find that across all 24 test cases (eight data sets, three classification tasks), the best single learner was never able to beat the best method that used all nine deep CNNs. All of the tested methods had similar performance on the 5-way classification task, though the best single learner was generally a bit worse than the other methods.

Where the difference becomes more obvious is on the classification tasks with more categories. For the -way and -way problems, the two ensemble-based methods had a small but consistent drop in accuracy, and building a single network on top of all nine deep CNNs is clearly the best. This may be somewhat surprising; for the 40-way problem, hyperparameter search on the Caltech-UCSD Birds 200 data set settled on a single network with 1024 neurons in a hidden layer; this means that more than 14 million parameters must be learned over just 200 images. Clearly, the neural network is massively over-parameterized, and yet the accuracies obtained are remarkable, with over 90% accuracy on two of the data sets.

We close the section by noting that we did try other, non-neural learners on top of the nine library CNNs (such as a random forest) and found that the results were quite inferior to the MLP tested here.

Aircraft FC100 Omniglot Texture Traffic Fungi Quick Draw VGG Flower
5-way, 5-shot
Full Library 68.9 0.9 79.1 0.8 97.5 0.3 85.3 0.6 85.8 0.7 81.2 0.8 84.2 0.6 97.4 0.3
Hard Ensemble 67.8 0.9 79.9 0.7 97.8 0.2 85.4 0.5 85.2 0.7 82.1 0.7 83.5 0.6 97.7 0.2
Soft Ensemble 68.4 0.9 80.5 0.6 98.0 0.2 85.7 0.6 85.2 0.7 82.0 0.7 84.1 0.5 97.9 0.2
Best Single 66.0 0.9 80.0 0.6 96.7 0.2 83.4 0.6 85.2 0.7 79.1 0.7 81.8 0.6 96.8 0.3
20-way, 5-shot
Full Library 49.5 0.4 61.6 0.4 95.4 0.2 68.5 0.4 70.4 0.4 65.5 0.5 69.4 0.4 94.3 0.2
Hard Ensemble 46.6 0.4 60.1 0.4 94.5 0.2 67.8 0.4 67.8 0.4 64.4 0.4 67.9 0.4 93.5 0.2
Soft Ensemble 47.5 0.4 60.7 0.4 94.9 0.2 68.2 0.4 68.3 0.4 64.4 0.4 68.8 0.4 93.7 0.2
Best Single 44.6 0.4 58.8 0.4 92.0 0.2 65.1 0.4 66.0 0.4 60.8 0.5 63.9 0.4 91.6 0.2
40-way, 5-shot
Full Library 41.2 0.3 51.8 0.2 93.2 0.1 59.3 0.2 62.7 0.2 57.6 0.3 60.8 0.3 91.9 0.2
Hard Ensemble 38.0 0.3 50.2 0.2 92.1 0.1 58.5 0.2 59.6 0.2 55.6 0.3 59.3 0.3 90.6 0.2
Soft Ensemble 39.0 0.3 51.2 0.3 92.5 0.1 59.0 0.2 60.2 0.2 56.5 0.3 60.1 0.3 91.1 0.2
Best Single 35.9 0.2 48.2 0.3 89.4 0.2 55.4 0.2 57.5 0.2 52.1 0.3 55.5 0.3 88.9 0.2
Table 6: Accuracy obtained using all nine library CNNs as the basis for a few-shot learner.

4 Data vs. Diversity: Who Wins?

The results from the last section clearly show that an MLP learned on top of a library of high-quality, deep CNNs makes for an excellent, general-purpose, few-shot learner. This leaves open the question. When basing a few-shot learner on a fairly simple, transfer-based approach, which is more important, diversity or size, when constructing the few-shot learner? That is, is it better to have a large variety of deep CNNs to combine together, each of which is trained on a smaller , or is it preferable to base a few-shot learner on a single, deep CNN that has been trained on a larger and more diverse ?

To answer this question, we compare the single MLP built upon all nine of the library deep CNNs with an MLP built upon a high-quality deep CNN that was itself constructed upon an even larger data set: the full ImageNet data set, with more than 20,000 categories. High-quality, publicly available deep CNNs for this data set are rare, but Google recently released a set of deep CNNs trained on the full ImageNet, specifically for use in transfer learning(Kolesnikov et al., 2019). We consider three of their deep CNNs. Each is a ResNet: BiT-ResNet-101-3 (“BiT” stands for “Big Transfer”; “101-3” is a ResNet101 that has the width of a standard ResNet), BiT-ResNet-152-4, and BiT-ResNet-50-1.

For each of these Big Transfer models, we perform a full hyperparameter search using the Caltech-UCSD Birds data set for validation, as we did for each of the deep CNNs trained on ILSVRC2012. Interestingly, the Google models tends to do better with a much larger regularization parameter weight (0.5 to 0.7) compared to the other deep CNNs trained on ILSVRC2012 (which typically performed best on the validation set using a weight of around 0.1). Results are shown in 7.

The headline finding is that the single model utilizing a library of nine, ILSVRC20120-trained CNNs, is the best model. It did not not always perform the best on each data set. In fact, on four of the data sets (FC100, Texture, Fungi, and VGG Flower) at least one of the Google Big Transfer models outperformed the single model consisting of nine ILSVRC20120-trained CNNs.

However, in each of the those cases, the performance was comparable accross models, except, perhaps, for the VGG Flower data set, where the best Big Transfer models always obtained more than 99% accuracy. However, on the other data sets (Aircraft, Omniglot, Trafic Sign, and QuickDraw) the combined model far outperformed any of the Big Transfer models. The gap was often significant, and the combined model outperformed the best Big Transfer-based model by an average of more than 11% for the 40-way task.

It was also interesting that while the Big Transfer models were generally better than the ILSVRC20120-trained library CNNs, this varied across data sets. On the Aircraft and Omniglot data sets, for example, even the best individual ILSVRC20120-trained library CNNs outperformed the Big Transfer models.

All of this taken together seems to suggest that, when building a transfer-based few-shot learner, having a large library of deep CNNs is at least as important—and likely more important—than having access to a CNN that was trained on a very large .

5 Why Does This Work?

In this section of the paper, we turn our attention to asking the question: why does using a full library of pre-trained feature extractors seem to work so well?

5.1 Few-Shot Fine-Tuning Is Surprisingly Easy

One key reason is that fine-tuning appears to be very easy with a library of pre-trained features. Consider the following simple experiment, designed to test whether the number of training points has much effect on the learned model.

We choose a large number of 40-way, problems, over all eight of our benchmark data sets, and train two classifiers for each. Both classifiers use all of the 13,984 features provided by the nine library feature extractors. However, the first classifier is trained as a one-shot classifier, and the second classifier is trained using all of the available data in the data set. Our goal is to see whether the sets of learned weights have a strong correspondence across the two learners; if they do, it is evidence that the number of training points has a relatively small effect. Note that in a neural network with a hidden layer consisting of hundreds or thousands of neurons, there are like to be a large number of learned weights that are of equal quality; in fact, simply permuting the neurons in the hidden layer results in a model that is functionally identical, but that has very different weight matrices. Thus, we do not use a hidden layer in either classifier, and instead use a softmax directly on top of an output layer that linearly combines the input features. Both the one-shot and full-data classifiers were learned without regularization.

Aircraft FC100 Omniglot Texture Traffic Fungi QDraw Flower
5-way, 5-shot
Full Library 68.9 0.9 79.1 0.8 97.5 0.3 85.3 0.6 85.8 0.7 81.2 0.8 84.2 0.6 97.4 0.3
BiT-ResNet-101-3 54.0 1.1 78.6 0.8 82.5 1.2 82.0 0.9 69.2 0.9 81.2 1.2 63.7 1.1 99.6 0.1
BiT-ResNet-152-4 59.5 1.0 80.9 0.7 94.2 0.5 85.4 0.6 73.3 0.8 82.5 0.9 74.8 0.8 99.7 0.1
BiT-ResNet-50-1 61.9 1.2 79.0 0.8 87.2 1.1 84.2 0.6 75.6 1.0 82.5 0.8 71.5 0.8 99.3 0.2
20-way, 5-shot
Full Library 49.5 0.4 61.6 0.4 95.4 0.2 68.5 0.4 70.4 0.4 65.5 0.5 69.4 0.4 94.3 0.2
BiT-ResNet-101-3 35.8 0.4 60.4 0.4 87.8 0.3 69.6 0.4 51.1 0.4 68.4 0.5 57.0 0.4 99.3 0.1
BiT-ResNet-152-4 33.5 0.4 63.4 0.4 85.4 0.4 70.9 0.4 49.2 0.4 68.1 0.5 52.6 0.5 99.5 0.1
BiT-ResNet-50-1 39.6 0.4 60.9 0.4 83.9 0.4 66.4 0.4 53.5 0.4 68.7 0.4 55.0 0.4 99.1 0.1
40-way, 5-shot
Full Library 41.2 0.3 51.8 0.2 93.2 0.1 59.3 0.2 62.7 0.2 57.6 0.3 60.8 0.3 91.9 0.2
BiT-ResNet-101-3 24.6 0.3 49.6 0.2 56.4 0.8 61.5 0.2 40.2 0.2 60.3 0.3 28.9 0.5 99.0 0.1
BiT-ResNet-152-4 25.4 0.2 53.0 0.3 81.0 0.3 58.6 0.2 40.0 0.2 53.9 0.4 44.8 0.3 98.8 0.1
BiT-ResNet-50-1 33.0 0.3 48.8 0.3 84.6 0.2 60.0 0.2 46.9 0.2 59.2 0.3 48.3 0.3 99.0 0.1
Table 7: Comparing a few-shot learner utilizing the full library of nine ILSVRC2012-trained deep CNNs with the larger CNNs trained on the full ImageNet.

Over each of the eight benchmark data sets, for each feature, we take the norm of all of the 40 weights associated with the feature; this serves as an indication of the importance of the feature. We compute the Pearson correlation coefficient for each of the 13,984 norms obtained using the models learned for the 1-shot and full data classifiers. These correlations are shown in Table 8.

What we see is a remarkably high correlation between the two sets of learned weights; above 80% correlation in every case, except for the Traffic Sign data set. However, even in the case of the Traffic Sign data set, there was a weak correlation (the Traffic Sign data is a bit of an outlier in other ways as well, as we will see subsequently).

This would seem to indicate that there is a strong signal with only one data point per class, as the learned weights do not differ much when they are learned using the whole data set. This may be one explanation for the surprising accuracy of the full library few-shot learner. Of course, the strong correlation may gloss over significant differences, and more data does make a significant difference (consider the difference between the one-shot accuracies in Table 1 and the five-shot accuracies in Table 2.). But even one image per class seems to give a lot of information to a full library few-shot learner.

Aircraft Birds FC100 Fungi Omniglot Quick Draw Texture Traffic VGG Flower
40-way, 1-shot vs. 40-way, all data
Correlation 0.95 0.88 0.89 0.86 0.93 0.92 0.80 0.18 0.87
Table 8: Correlation between weight learned using one example per class, and the full data.

5.2 Different Problems Utilize Different Features

Another reason for the accuracy obtained by the full library method may be that different problem domains seem to utilize different sets of features, and different feature extractors. Having a large library ensures that some features relevent to any task are always present.

To investiage this, we construct a large number of 40-way training tasks over each of the various data sets using all available data, and for each, we learn a network without a hidden layer on top of all 13,984 features obtained using the library of deep CNNs. Again, we compute the norm of the set of weights associated with each of the features. This time, however, for each problem we then consider the features whose norms are in the top 20%; these could be considered the features most important to solving the classification task.

For each of the (data set, data set) pairs, we compute the average Jaccard similarity of these top-feature sets. Since each set consists of 20% of the features, if each set of features chosen was completely random, for features in all, we would expect a Jaccard similarity of

. Anything greater indicates the sets of features selected are positively correlated; lower indicates a negative correlation. Results are in Figure

1. For each of the nine CNNs, we also compute the fraction of each CNN’s features that are in the top 20% when a full library classifier is constructed. These percentages are in Figure 2.

The data in these two plots, along with the previous results, seem to tell a consistent story. First there appears to be surprisingly little correspondence between data sets in terms of the set of features that are chosen as important across data sets. The largest Jaccard value in Figure 1 is less than 0.5 (observed between FC100 and Texture). This shows, in our opinion, a relatively weak correspondence. The Traffic Sign data set, had an average Jaccard of across the other eight data sets, which is even lower than the that would be expected under a purely random feature selection regime.

One might speculate that the lack of correspondence across data sets is evidence for the hypothesis that different tasks tend to use different features, which would explain why it is so effective to use an entire library of deep CNNs for few-shot learning.

Figure 1: Jaccard similarity of sets of most important features for the various (data set, data set) pairs.
Figure 2: Percent of each deep CNN’s features that appear in the top 20% of features, on each data set.

Also note that in Figure 2, we tend to see that different deep CNNs contribute “important” features at very different rates, depending on the particular few-shot classification task. This also seems to be evidence that diversity is important. In general, the DenseNets’ features are preferred over the ResNets’ features, but this is not universal, and there is a lot of variation. Note that for the Traffic Sign data set, the ResNet features are strongly preferred. It may not be an accident that the three data sets where the selection of features from library CNNs shows the most diversity in Figure 2 (Traffic Sign, Quick Draw, and Omniglot) are the three data sets where the classifier built on top of all nine library CNNs has the largest advantage compared to the few-shot classifiers built on top of the single, “Big Transfer”-based deep CNN. This may be further evidence that the benefit of having a wide variety of features available to a few-shot learner is of key importance.

6 Related Work

Few-shot learning for image processing tasks has received a lot of research attention (Wang et al., 2018; Oreshkin et al., 2018; Rusu et al., 2018). While most of the research has focused on developing new and sophisticated methods for learning a few-shot learner from a data set , there have recently been a few papers that, like this work, have suggested that very simple, transfer-based methods may be the most promising direction.

In the Introduction to the paper, we mentioned Google’s “Big Transfer” paper (Kolesnikov et al., 2019), which is closest in spirit to this work. In that paper, the authors argued that perhaps the best approach to few-shot learning is to concentrate on using a high-quality feature extractor rather than a sophisticated meta-learning approach. We agree with this, but also give evidence that training a huge model and a massive data set may not be the only key to few-shot image classification. We found that a library containing wide diversity of high-quality deep CNNs can lead to substantially better accuracy than a single massive CNN, even when that deep CNN is trained on a much larger data set. Tian et al. make a similar argument (Tian et al., 2020) to the Big Transfer authors, observing that a simple transfer-based approach can outperform a sophisticated meta-learner, though they do not investigate the effect of the importance of the size of training data, nor the importance of diversity in the feature extractors.

Chen et al. (Chen et al., 2019) were among the first researchers to point out that simple, transfer-based methods may outperform sophisticated few-shot learners. They proposed Baseline which uses a simple linear classifier on top of a pre-trained deep CNN, and Baseline++ which uses a distance-based classifier. Dhillon et al. (Dhillon et al., 2019) also point out the utility of transfer-based methods, and propose a transductive fine-tuner; however, such a method relies on having an appropriately large number of relevant, unlabeled images to perform the transductive learning. Sun et al. (Sun et al., 2019) consider transfer learning, but where transfer is realized via shifting and scaling of the classification parameters.

Dvornik et al. (Dvornik et al., 2019) also consider the idea of ensembling for few-shot learning, although their idea is to simultaneously train a number of diverse deep CNNs as feature extractors during meta-learning; adding penalty terms that encourage both diversity and conformity during learning. We propose the simple idea of simply using a set of existing deep CNNs, trained by different groups of engineers.

7 Conclusions

We have examined the idea of using a library of deep CNNs as a basis for a high-quality few-shot learner. We have shown that a learner built on top of a high-quality deep CNN can have remarkable accuracy, and that a learner built upon an entire library of CNNs can significantly outperform a few-shot learner built upon any one deep CNN. The accuracy observed was remarkable. On a suite of eight cross-domain problems, the full library learner achieved more than 50% accuracy on a 40-way, 5-shot problem for seven of the data sets, and greater than 90% accuracy on two of them. Further, we have shown that a diversity of CNNs in the library may be even more important than training those library CNNs on a very large data set.

While we conjecture that it will be difficult to do better than using a library of high-quality deep CNNs as the basis for a few-shot learner, there are key unanswered questions. Does the accuracy continue to improve as the library is made larger? Further, it may be expensive to feed a test image through a series of nine (or more) deep CNNs; can the computational cost of the library-based approach be reduced? And finally, are there better methods to facilitate transfer than learning a simple MLP?

Appendix

1 Description of Data

A Data sets

ILSVRC2012 (Russakovsky et al., 2015) Figure 0(a). A data set of natural images of 1000 diverse categories, the most commonly used Imagenet data set, primarily released for ‘Large Scale Visual Recognition Challenge’. We use the ILSVRC-2012 version as the original data set for the classification challenge has not been modified since. The data set has a little more than 1.2 million (1,281,167 to be precise) images with each class consisting of images ranging from 732 to 1300.

CUB-200-2011 Birds (Welinder et al., 2010) Figure 0(b). A data set for fine-grained classification of 200 different bird species, an extended version of the CUB-200 data set. The total number of images in the data set is 11,788 with mostly 60 images per class.

FGVC-Aircraft (Maji et al., 2013) Figure 0(c). A data set of images of aircrafts spanning 102 model variants with 10,200 total images and 100 images per class.

FC100 (Oreshkin et al., 2018) Figure 0(d). A data set curated for few-shot learning based on the popular CIFAR100 (Krizhevsky et al., 2009) includes 100 classes and 600 32 32 color images per class. It offers a more challenging scenario with lower image resolution.

Omniglot (Lake et al., 2015) Figure 0(e). A data set of images of 1623 handwritten characters from 50 different alphabets. We consider each character as a separate class. The total number of images is 32,460 with 20 examples per class.

Texture (Cimpoi et al., 2014) Figure 0(f). A data set consists of 5640 images, organized according to 47 categories inspired from human perception. There are 120 images for each category. Image sizes range between 300 300 and 640 640, and the images contain at least 90% of the surface representing the category attribute.

Traffic Sign (Houben et al., 2013) Figure 0(g). A data set of German Traffic Sign Recognition benchmark consisting of more than 50,000 images across 43 classes.

FGCVx Fungi (Schroeder and Cui, 2018) Figure 0(h). A data set of wild mushrooms species which have been spotted and photographed by the general public in Denmark, containing over 100,000 images across 1,394 classes.

Quick Draw (Jongejan et al., 2016) Figure 0(i). A data set of 50 million drawings across 345 categories. We take the simplified drawings, which are 28 28 gray-scale bitmaps and aligned to the center of the drawing’s bounding box. Considering the size of the full data set, we randomly sample 1,000 images from each category.

VGG Flower (Nilsback and Zisserman, 2008) Figure 0(j). A data set consisting of 8189 images among 102 flower categories that commonly occuring in the United Kingdom. There are between 40 to 258 images in each category.

B Data Preparation

For all the data sets, we resize each image into 256 256 then crop 224 224 from the center (except quick draw which is already aligned to the center of the drawing’s bounding box, so we directly resized quick draw images to 224 224).

C Test Protocol

In this work, for all the methods the training process is performed solely on ILSVRC data set. For our library based methods, this is followed by hyperparameter validation on the CUB birds data set. After that, each method is tested on the remaining eight data sets without further tuning.

To be more specific, for the library based methods we only use the pre-trained (on ILSVRC data set) models. While for the meta-learning based methods, we randomly split ILSVRC into a base (training) set of 900 classes for meta-training and a validation set of the remaining 100 classes.

In order to evaluate a few-shot learner on a data set, for an “-way -shot” classification problem, we randomly select different classes, and then randomly select images from each class for training (equivalent to ‘support’ images in meta-learning literature). We then randomly select images for rest of the images from each class for testing (equivalent to ‘query’ images in meta-learning literature). We perform this evaluation for in and in .

For the library based methods, the query size is set to 15 (except FGCVx Fungi data set). For the meta-learning based methods, due to GPU memory constraints, for each class in a task we used 15 query images for 5-way, 10 query images for 10-way, and 5 query images for 40-way,

(a) ILSVRC-2012
(b) CUB-200-2011 Birds
(c) Aircraft
(d) FC100
(e) Omniglot
(f) Texture
(g) Traffic Sign
(h) Fungi
(i) Quick Draw
(j) VGG Flower
Appendix, Figure A1: Images sampled from the data sets used in our experiments.

problems, with only exception being the Fungi data set. Fungi data set has several classes with a very small number of images (6 being the minimum). Hence for Fungi data set, we use 5 query images per class for 1-shot and 1 query image per class for 5-shot problems. Finally, for every problem, we report the mean of 600 randomly generated test tasks along with the 95% confidence intervals.

2 Hyper-parameters for Library-Based
Learners

In order to achieve an “out-of-the-box” few-shot learner that can be used on any (very small) training set without additional data or knowledge of the underlying distribution, an extensive hyper-parameter search is performed on the validation data set (CUB birds). The hyper-parameters that are found to be the best are then applied in the few-shot training and testing phase for each of our library based methods. The hyper-parameters we are considering include learning rate, number of training epochs, L2-regularization penalty constant, the number of neurons in the MLP hidden layer, and whether to drop the hidden layer altogether. A separate hyper-parameter search is used for 5-way, 20-way, and 40-way 1-shot classification. 5-shot problems are using the same hyper-parameters as 1-shot ones. Experiments suggest that dropping the hidden layer did not significantly help performance, but did occasionally hurt performance significantly on the validation set; as a result, we always use a hidden layer. We train all our methods using Adam optimizer. The hyper-parameters details can be found in Table A6.

3 Competitive Methods

Aircraft FC100 Omniglot Texture Traffic Fungi Quick Draw VGG Flower
Baseline 36.6 0.7 40.3 0.8 65.2 1.0 42.9 0.7 57.6 0.9 35.9 0.9 49.0 0.9 67.9 0.9
Baseline++ 33.9 0.7 39.3 0.7 59.8 0.7 40.8 0.7 58.8 0.8 35.4 0.9 46.6 0.8 58.6 0.9
MAML 26.5 0.6 39.4 0.8 50.7 1.0 38.1 0.9 45.6 0.8 34.8 1.0 46.0 1.0 54.7 0.9
MatchingNet 29.9 0.6 38.2 0.8 51.7 1.0 39.3 0.7 55.2 0.8 38.1 0.9 50.2 0.8 51.8 0.9
ProtoNet 31.8 0.6 40.9 0.8 79.2 0.8 42.0 0.8 54.4 0.9 36.7 0.9 55.8 1.0 59.1 0.9
RelationNet 31.2 0.6 46.3 0.9 69.6 0.9 41.0 0.8 54.6 0.8 36.8 1.0 52.5 0.9 55.5 0.9
Meta-transfer 30.4 0.6 57.6 0.9 78.9 0.8 50.1 0.8 62.3 0.8 45.8 1.0 58.4 0.9 73.2 0.9
FEAT 33.2 0.7 42.1 0.8 69.8 0.9 51.8 0.9 49.0 0.8 46.9 1.0 53.1 0.8 75.3 0.9
Worst library- 40.9 0.9 50.8 0.9 77.2 0.8 59.0 0.9 55.5 0.8 53.0 0.9 57.3 0.9 79.7 0.8
based RN18 DN121 RN152 DN169 RN152 DN201 RN101 RN18
Best library- 46.1 1.0 61.2 0.9 86.5 0.6 65.1 0.9 66.6 0.9 56.6 0.9 62.8 0.9 83.2 0.8
based DN161 RN152 DN121 RN101 DN201 DN121 RN18 DN201
Table A1: Comparing competitive methods with the simple library-based learners, on the 5-way, 1-shot problem.
Aircraft FC100 Omniglot Texture Traffic Fungi Quick Draw VGG Flower
Baseline 11.4 0.2 15.4 0.3 38.9 0.6 17.6 0.3 27.8 0.4 13.1 0.3 21.9 0.4 38.8 0.4
Baseline++ 10.1 0.2 15.8 0.3 39.2 0.4 18.1 0.3 31.5 0.3 13.7 0.3 22.5 0.3 33.1 0.4
MAML 7.6 0.1 17.2 0.3 29.1 0.4 14.8 0.2 19.4 0.3 11.5 0.3 19.7 0.3 21.8 0.3
MatchingNet 7.7 0.1 17.0 0.3 44.6 0.5 19.8 0.3 26.2 0.3 16.1 0.4 26.7 0.4 29.9 0.4
ProtoNet 11.5 0.2 20.1 0.3 58.8 0.5 20.0 0.3 29.5 0.4 16.2 0.4 30.7 0.4 40.7 0.4
RelationNet 11.0 0.2 18.7 0.3 51.3 0.5 18.6 0.3 28.5 0.3 15.9 0.4 29.1 0.4 35.5 0.4
Meta-transfer 12.8 0.2 30.9 0.4 58.6 0.5 27.2 0.4 35.8 0.4 23.1 0.4 35.2 0.4 52.0 0.5
FEAT 15.1 0.3 18.9 0.4 51.1 0.6 28.9 0.5 31.7 0.4 25.7 0.4 28.7 0.5 56.5 0.6
Worst library- 20.1 0.3 27.8 0.4 56.2 0.5 38.0 0.4 29.7 0.3 31.7 0.4 33.3 0.5 62.4 0.5
based RN101 DN121 RN101 RN18 RN101 RN101 RN101 RN101
Best library- 24.3 0.3 36.4 0.4 69.1 0.4 42.5 0.4 38.5 0.4 33.9 0.5 39.5 0.5 70.0 0.5
based DN161 RN152 DN121 DN152 DN201 DN161 DN201 DN161
Table A2: Comparing competitive methods with the simple library-based learners, on the 20-way, 1-shot problem.
Aircraft FC100 Omniglot Texture Traffic Fungi Quick Draw VGG Flower
Baseline 6.9 0.2 9.7 0.1 27.1 0.4 11.2 0.1 20.0 0.3 8.2 0.2 14.8 0.3 28.7 0.3
Baseline++ 6.1 0.1 10.1 0.1 29.9 0.3 12.1 0.2 23.2 0.2 8.7 0.2 15.9 0.2 24.2 0.3
MatchingNet 5.1 0.1 11.6 0.2 32.5 0.4 15.0 0.2 19.3 0.2 11.0 0.2 19.0 0.3 26.9 0.3
ProtoNet 7.2 0.2 13.8 0.2 47.2 0.4 14.4 0.2 21.6 0.3 11.3 0.2 22.9 0.3 31.0 0.3
RelationNet 6.2 0.2 13.8 0.2 45.2 0.4 11.4 0.2 18.2 0.2 10.7 0.2 21.1 0.3 28.1 0.3
Meta-transfer 7.6 0.2 20.6 0.2 37.5 0.3 19.2 0.2 24.4 0.2 10.2 0.2 22.2 0.2 40.4 0.3
FEAT 10.1 0.2 12.9 0.3 35.7 0.4 21.8 0.3 24.9 0.3 18.0 0.3 19.4 0.3 47.6 0.4
Worst library- 14.2 0.2 19.6 0.2 47.3 0.3 28.8 0.2 22.2 0.2 23.7 0.3 26.4 0.3 53.1 0.3
based RN34 RN18 RN152 RN18 RN152 RN34 RN152 RN34
Best library- 17.4 0.2 27.2 0.3 61.6 0.3 33.2 0.3 29.5 0.2 26.8 0.3 31.2 0.3 62.8 0.3
based DN161 RN152 DN201 DN152 DN201 DN161 DN201 DN161
Table A3: Comparing competitive methods with the simple library-based learners, on the 40-way, 1-shot problem.

For methods that require a pre-trained CNN (FEAT and Meta-transfer), we use the pre-trained (on ILSVRC(Russakovsky et al., 2015)

) ResNet18 pytorch library as the backbone. We follow the hyper-parameter setting from

(Sun et al., 2019) and (Ye et al., 2020) and perform meta-training on ILSVRC data set before testing on eight different data sets. To be more specific, the meta-training setting are as follows:

Meta-transfer Learning
The base-learner is optimized by batch gradient descent with the learning rate of and gets updated every 50 steps. The meta learner is optimized by Adam optimizer with an initial learning rate of , and decaying 50% every 1,000 iterations until .

FEAT

The vanilla stochastic gradient descent with Nesterov acceleration is applied. The initial rate is set to

and decreases every 40 steps with a decay rate of and momentum of 0.9. The learning rate for the scheduler is set to 0.5.

Baselines
We train the two pre-training based methods, Baseline and Baseline++ (Chen et al., 2019) following the hyper-parameters suggested by the original authors. However, since we train them on ILSVRC data as opposed to mini-imagenet (Ravi and Larochelle, 2016). During the training stage, we train 50 epochs with a batch size of 16. In the paper, the authors have trained 400 epochs on the base set of mini-imagenet consisting of 64 classes. Mini-imagenet has 600 images per class, whereas ILSVRC has an average of around 1,200 images per class. So, the total number of batches trained in our baselines is , as opposed to in the original paper.

Metric-Learning Methods and MAML
For the three most popular metric-learning methods, MatchingNet (Vinyals et al., 2016), ProtoNet (Snell et al., 2017) and RelationNet (Sung et al., 2018), again we follow the implementation and hyper-parameters provided by (Chen et al., 2019).

All the metric-learning methods and MAML (Finn et al., 2017) are trained using the Adam optimizer with initial learning rate of . For MAML, the inner learning rate is kept at as suggested by the original authors. And following (Chen et al., 2019)

, we do the following modifications: For MatchingNet, we use an FCE classification layer without fine-tuning and also multiply the cosine similarity by a constant scalar. For RelationNet, we replace the L2 norm with a softmax layer to expedite training. For MAML, we use a first-order approximation in the gradient for memory efficiency. Even then, we could not train MAML for 40-way in a single GPU due to memory shortage. Hence, we drop MAML for 40-way experiments.

During the meta-training stage of all these methods, we train 150,000 episodes for 5-way 1-shot, 5-way 5-shot, and 20-way 1-shot problems, and 50,000 episodes for all other problems on the base split of ILSVRC. Here an episode refers to one meta-learning task that includes training on the ‘support’ images and testing on the ‘query’ images. The stopping episode number is chosen based on no significant increase in validation accuracy. In the original paper, the authors trained 60,000 episodes for 1-shot and 40,000 episodes for 5-shot tasks. We meticulously observe that those numbers are too low for certain problems in terms of validation accuracy. Hence, we allow more episodes. We select the best accuracy model on the validation set of ILSVRC for meta-testing on other data sets.

4 Complete Results

Here we conduct additional experiments which are not reported in our main paper.

Single Library Learners VS. Competitive Methods
Table A1, A2, A3 show the performance comparison of single library learners and the competitive methods including Baseline, Baseline++, MAML, MatchingNet, ProtoNet, RelationNet, Meta-transfer Learning, and FEAT. The comparison is conducted on problems of 1-shot under 5-way, 20-way, and 40-way.

Aircraft FC100 Omniglot Texture Traffic Fungi QDraw Flower
5-way, 1-shot
Full Library 44.9 0.9 60.9 0.9 88.4 0.7 68.4 0.8 66.2 0.9 57.7 1.0 65.4 0.9 86.3 0.8
BiT-ResNet-101-3 42.2 1.0 58.5 0.9 76.2 1.1 63.6 0.9 47.4 0.8 63.7 0.9 54.9 0.9 98.0 0.3
BiT-ResNet-152-4 40.5 0.9 61.0 0.9 78.4 0.9 65.6 0.8 48.4 0.8 62.2 1.0 55.4 0.9 97.9 0.3
BiT-ResNet-50-1 45.0 1.0 61.9 0.9 79.4 0.9 68.5 0.9 56.1 0.8 61.6 1.0 54.0 0.9 98.5 0.2
20-way, 1-shot
Full Library 25.2 0.3 36.8 0.4 76.4 0.4 46.3 0.4 40.0 0.4 37.6 0.5 44.2 0.5 75.5 0.5
BiT-ResNet-101-3 19.9 0.3 34.5 0.4 53.4 0.6 44.2 0.4 24.6 0.3 41.6 0.5 32.0 0.4 95.7 0.2
BiT-ResNet-152-4 18.2 0.3 37.2 0.4 51.0 0.6 44.3 0.4 23.6 0.3 40.3 0.5 29.0 0.4 95.0 0.2
BiT-ResNet-50-1 22.6 0.4 36.1 0.4 58.1 0.5 45.7 0.4 28.3 0.3 39.2 0.4 30.0 0.4 95.4 0.2
40-way, 1-shot
Full Library 18.6 0.2 27.3 0.2 67.7 0.3 37.0 0.3 30.3 0.2 29.3 0.3 34.5 0.3 67.6 0.3
BiT-ResNet-101-3 12.7 0.2 24.5 0.2 19.7 0.4 34.4 0.3 15.9 0.2 30.7 0.3 13.5 0.2 91.9 0.2
BiT-ResNet-152-4 12.4 0.2 26.9 0.2 30.8 0.5 35.8 0.3 16.0 0.2 30.7 0.3 18.7 0.3 91.4 0.2
BiT-ResNet-50-1 15.8 0.2 27.5 0.2 47.6 0.4 37.4 0.3 19.7 0.2 30.6 0.3 21.7 0.2 93.4 0.2
Table A4: Comparing a few-shot learner utilizing the full library of nine ILSVRC2012-trained deep CNNs with the larger CNNs trained on the full ImageNet.
Aircraft FC100 Omniglot Texture Traffic Fungi Quick Draw VGG Flower
5-way, 1-shot
Full Library 44.9 0.9 60.9 0.9 88.4 0.7 68.4 0.8 66.2 0.9 57.7 1.0 65.4 0.9 86.3 0.8
Hard Ensemble 45.0 0.9 60.0 0.9 88.4 0.6 67.9 0.9 65.2 0.8 58.1 0.9 64.7 1.0 84.9 0.8
Soft Ensemble 44.2 0.9 61.0 0.9 88.2 0.6 67.4 0.9 63.2 0.8 57.1 1.0 65.2 0.9 86.3 0.7
Best Single 46.1 1.0 61.2 0.9 86.5 0.6 65.1 0.9 66.6 0.9 56.6 0.9 62.8 0.9 83.2 0.8
20-way, 1-shot
Full Library 25.2 0.3 36.8 0.4 76.4 0.4 46.3 0.4 40.0 0.4 37.6 0.5 44.2 0.5 75.5 0.5
Hard Ensemble 23.9 0.3 36.3 0.4 73.4 0.4 45.7 0.4 38.2 0.4 36.4 0.4 42.7 0.4 73.3 0.5
Soft Ensemble 24.2 0.3 36.4 0.4 73.8 0.4 46.3 0.5 37.7 0.4 37.1 0.4 43.4 0.5 74.3 0.5
Best Single 24.3 0.3 36.4 0.4 69.1 0.4 42.5 0.4 38.5 0.4 33.9 0.5 39.5 0.5 70.0 0.5
40-way, 1-shot
Full Library 18.6 0.2 27.3 0.2 67.7 0.3 37.0 0.3 30.3 0.2 29.3 0.3 34.5 0.3 67.6 0.3
Hard Ensemble 17.7 0.2 27.3 0.2 65.8 0.3 36.3 0.3 29.0 0.2 28.7 0.3 33.7 0.3 66.1 0.3
Soft Ensemble 18.1 0.2 27.6 0.3 66.2 0.3 37.0 0.3 29.0 0.2 29.1 0.2 34.1 0.3 67.4 0.3
Best Single 17.4 0.2 27.2 0.3 61.6 0.3 33.2 0.3 29.5 0.2 26.8 0.3 31.2 0.3 62.8 0.3
Table A5: Accuracy obtained using all nine library CNNs as the basis for a few-shot learner.
Number of Hidden Learning Regularization
Epochs Size Rate Constant
5-way, 1-shot and 5-shot
DenseNet121 200 1024 0.2
DenseNet161 100 1024 0.2
DenseNet169 300 1024 0.5
DenseNet201 100 512 0.5
ResNet18 200 512 0.2
ResNet34 100 1024 0.2
ResNet50 300 2048 0.1
ResNet101 100 512 0.1
ResNet152 300 512 0.1
Full Library 300 1024 0.1
BiT-ResNet-101-3 300 4096 0.7
BiT-ResNet-152-4 300 2048 0.7
BiT-ResNet-50-1 200 2048 0.5
20-way, 1-shot and 5-shot
DenseNet121 100 1024 0.2
DenseNet161 100 512 0.1
DenseNet169 300 512 0.1
DenseNet201 200 1024 0.1
ResNet18 200 2048 0.1
ResNet34 100 1024 0.1
ResNet50 100 1024 0.1
ResNet101 200 2048 0.2
ResNet152 100 512 0.2
Full Library 100 512 0.1
BiT-ResNet-101-3 300 2048 0.5
BiT-ResNet-152-4 300 1024 0.5
BiT-ResNet-50-1 100 2048 0.9
40-way, 1-shot and 5-shot
DenseNet121 100 2048 0.1
DenseNet161 100 512 0.1
DenseNet169 100 512 0.2
DenseNet201 100 1024 0.1
ResNet18 100 512 0.1
ResNet34 100 2048 0.2
ResNet50 100 512 0.1
ResNet101 100 512 0.1
ResNet152 100 1024 0.1
Full Library 100 1024 0.1
BiT-ResNet-101-3 300 512 0.7
BiT-ResNet-152-4 200 1024 0.5
BiT-ResNet-50-1 300 1024 0.5
Table A6: Hyper-parameter settings for different backbones in 5, 20, and 40 ways.

Full Library VS. Google BiT Methods
Table A4 shows the performance comparison of the full library method and the Google BiT methods. The problems addressed here are 1-shot under 5-way, 20-way, and 40-way. The full library method utilizes a library of nine, ILSVRC-trained CNNs, while the Google BiT methods individually use three deep CNNs trained on the full ImageNet.

Full Library VS. Hard and Soft Ensemble Methods
Table A5 compares the performances of the full library method and the hard and soft ensemble methods (bagging) for 1-shot problems under 5-way, 20-way, and 40-way.

References

  • W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang (2019) A closer look at few-shot classification. arXiv preprint arXiv:1904.04232. Cited by: §2.2, §2.3, §3, §3, §3, §6.
  • M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In

    Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §A.
  • G. S. Dhillon, P. Chaudhari, A. Ravichandran, and S. Soatto (2019) A baseline for few-shot image classification. arXiv preprint arXiv:1909.02729. Cited by: §6.
  • N. Dvornik, C. Schmid, and J. Mairal (2019) Diversity with cooperation: ensemble methods for few-shot classification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3723–3731. Cited by: §6.
  • L. Fei-Fei, R. Fergus, and P. Perona (2006) One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28 (4), pp. 594–611. Cited by: §1.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 1126–1135. Cited by: §1, §2.2, §2.3, §3.
  • C. Finn, K. Xu, and S. Levine (2018) Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pp. 9516–9527. Cited by: §1, §2.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.2, §2.2.
  • S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel (2013) Detection of traffic signs in real-world images: the german traffic sign detection benchmark. In The 2013 international joint conference on neural networks (IJCNN), pp. 1–8. Cited by: §A, §2.2.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.2.
  • J. Jongejan, H. Rowley, T. Kawashima, J. Kim, and N. Fox-Gieg (2016) The quick, draw!-ai experiment. Mount View, CA, accessed Feb 17 (2018), pp. 4. Cited by: §A, §2.2.
  • T. Kim, J. Yoon, O. Dia, S. Kim, Y. Bengio, and S. Ahn (2018) Bayesian model-agnostic meta-learning. arXiv preprint arXiv:1806.03836. Cited by: §1, §2.3.
  • A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby (2019) Big transfer (bit): general visual representation learning. arXiv preprint arXiv:1912.11370 6. Cited by: §1, §4, §6.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §A.
  • B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §A, §2.2.
  • K. Lee, S. Maji, A. Ravichandran, and S. Soatto (2019) Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10657–10665. Cited by: §1.
  • Y. Lee and S. Choi (2018) Gradient-based meta-learning with learned layerwise metric and subspace. arXiv preprint arXiv:1801.05558. Cited by: §1, §2.3.
  • H. Li, D. Eigen, S. Dodge, M. Zeiler, and X. Wang (2019) Finding task-relevant features for few-shot learning by category traversal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–10. Cited by: §1.
  • S. Maji, E. Rahtu, J. Kannala, M. B. Blaschko, and A. Vedaldi (2013) Fine-grained visual classification of aircraft. CoRR abs/1306.5151. External Links: Link, 1306.5151 Cited by: §A, §2.2.
  • A. Nichol and J. Schulman (2018) Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999 2, pp. 2. Cited by: §1, §2.3.
  • M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §A, §2.2.
  • B. Oreshkin, P. Rodríguez López, and A. Lacoste (2018) Tadam: task dependent adaptive metric for improved few-shot learning. Advances in Neural Information Processing Systems 31, pp. 721–731. Cited by: §A, §2.2, §6.
  • S. Ravi and H. Larochelle (2016) Optimization as a model for few-shot learning. Cited by: §3.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §A, §1, §3.
  • A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2018) Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960. Cited by: §6.
  • B. Schroeder and Y. Cui (2018) Fgvcx fungi classification challenge 2018. Note: https://github.com/visipedia/fgvcx_fungi_comp Cited by: §A, §2.2.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §2.2, §3.
  • C. Sun, A. Shrivastava, S. Singh, and A. Gupta (2017)

    Revisiting unreasonable effectiveness of data in deep learning era

    .
    In Proceedings of the IEEE international conference on computer vision, pp. 843–852. Cited by: §1.
  • Q. Sun, Y. Liu, T. Chua, and B. Schiele (2019) Meta-transfer learning for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 403–412. Cited by: §2.2, §3, §6.
  • F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: §2.2, §3.
  • Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola (2020) Rethinking few-shot image classification: a good embedding is all you need?. arXiv preprint arXiv:2003.11539. Cited by: §6.
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §2.2, §3.
  • Y. Wang, R. Girshick, M. Hebert, and B. Hariharan (2018) Low-shot learning from imaginary data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7278–7286. Cited by: §6.
  • P. Welinder, S. Branson, T. Mita, W. Catherine, F. Schroff, S. Belongie, and P. Perona (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §A, §2.2.
  • H. Ye, H. Hu, D. Zhan, and F. Sha (2020) Few-shot learning via embedding adaptation with set-to-set functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8808–8817. Cited by: §2.2, §3.