A Closer Look at Few-shot Classification

04/08/2019 ∙ by Wei-Yu Chen, et al. ∙ Georgia Institute of Technology Carnegie Mellon University Virginia Polytechnic Institute and State University National Taiwan University 8

Few-shot classification aims to learn a classifier to recognize unseen classes during training with limited labeled examples. While significant progress has been made, the growing complexity of network designs, meta-learning algorithms, and differences in implementation details make a fair comparison difficult. In this paper, we present 1) a consistent comparative analysis of several representative few-shot classification algorithms, with results showing that deeper backbones significantly reduce the performance differences among methods on datasets with limited domain differences, 2) a modified baseline method that surprisingly achieves competitive performance when compared with the state-of-the-art on both the and the CUB datasets, and 3) a new experimental setting for evaluating the cross-domain generalization ability for few-shot classification algorithms. Our results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones. In a realistic cross-domain evaluation setting, we show that a baseline method with a standard fine-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

Code Repositories

CloserLookFewShot

source code to ICLR'19, 'A Closer Look at Few-shot Classification'


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning models have achieved state-of-the-art performance on visual recognition tasks such as image classification. The strong performance, however, heavily relies on training a network with abundant labeled instances with diverse visual variations (e.g., thousands of examples for each new class even with pre-training on large-scale dataset with base classes). The human annotation cost as well as the scarcity of data in some classes (e.g., rare species) significantly limit the applicability of current vision systems to learn new visual concepts efficiently. In contrast, the human visual systems can recognize new classes with extremely few labeled examples. It is thus of great interest to learn to generalize to new classes with a limited amount of labeled examples for each novel class.

The problem of learning to generalize to unseen classes during training, known as few-shot classification, has attracted considerable attention Vinyals et al. (2016); Snell et al. (2017); Finn et al. (2017); Ravi & Larochelle (2017); Sung et al. (2018); Garcia & Bruna (2018); Qi et al. (2018). One promising direction to few-shot classification is the meta-learning paradigm where transferable knowledge is extracted and propagated from a collection of tasks to prevent overfitting and improve generalization. Examples include model initialization based methods Ravi & Larochelle (2017); Finn et al. (2017), metric learning methods Vinyals et al. (2016); Snell et al. (2017); Sung et al. (2018), and hallucination based methods Antoniou et al. (2018); Hariharan & Girshick (2017); Wang et al. (2018). Another line of work Gidaris & Komodakis (2018); Qi et al. (2018) also demonstrates promising results by directly predicting the weights of the classifiers for novel classes.

Limitations.

While many few-shot classification algorithms have reported improved performance over the state-of-the-art, there are two main challenges that prevent us from making a fair comparison and measuring the actual progress. First, the discrepancy of the implementation details among multiple few-shot learning algorithms obscures the relative performance gain. The performance of baseline approaches can also be significantly under-estimated (e.g., training without data augmentation). Second, while the current evaluation focuses on recognizing novel class with limited training examples, these novel classes are sampled from

the same dataset. The lack of domain shift between the base and novel classes makes the evaluation scenarios unrealistic.

Our work.

In this paper, we present a detailed empirical study to shed new light on the few-shot classification problem. First, we conduct consistent comparative experiments to compare several representative few-shot classification methods on common ground. Our results show that using a deep backbone shrinks the performance gap between different methods in the setting of limited domain differences between base and novel classes. Second, by replacing the linear classifier with a distance-based classifier as used in Gidaris & Komodakis (2018); Qi et al. (2018), the baseline method is surprisingly competitive to current state-of-art meta-learning algorithms. Third, we introduce a practical evaluation setting where there exists domain shift between base and novel classes (e.g., sampling base classes from generic object categories and novel classes from fine-grained categories). Our results show that sophisticated few-shot learning algorithms do not provide performance improvement over the baseline under this setting. Through making the source code and model implementations with a consistent evaluation setting publicly available, we hope to foster future progress in the field.111https://github.com/wyharveychen/CloserLookFewShot

Our contributions.

  1. We provide a unified testbed for several different few-shot classification algorithms for a fair comparison. Our empirical evaluation results reveal that the use of a shallow backbone commonly used in existing work leads to favorable results for methods that explicitly reduce intra-class variation. Increasing the model capacity of the feature backbone reduces the performance gap between different methods when domain differences are limited.

  2. We show that a baseline method with a distance-based classifier surprisingly achieves competitive performance with the state-of-the-art meta-learning methods on both mini-ImageNet and CUB datasets.

  3. We investigate a practical evaluation setting where base and novel classes are sampled from different domains. We show that current few-shot classification algorithms fail to address such domain shifts and are inferior even to the baseline method, highlighting the importance of learning to adapt to domain differences in few-shot learning.

2 Related Work

Given abundant training examples for the base classes, few-shot learning algorithms aim to learn to recognizing novel classes with a limited amount of labeled examples. Much efforts have been devoted to overcome the data efficiency issue. In the following, we discuss representative few-shot learning algorithms organized into three main categories: initialization based, metric learning based, and hallucination based methods.

Initialization based methods

tackle the few-shot learning problem by “learning to fine-tune”. One approach aims to learn good model initialization (i.e., the parameters of a network) so that the classifiers for novel classes can be learned with a limited number of labeled examples and a small number of gradient update steps Finn et al. (2017, 2018); Nichol & Schulman (2018); Rusu et al. (2019). Another line of work focuses on learning an optimizer. Examples include the LSTM-based meta-learner for replacing the stochastic gradient decent optimizer Ravi & Larochelle (2017) and the weight-update mechanism with an external memory Munkhdalai & Yu (2017). While these initialization based methods are capable of achieving rapid adaption with a limited number of training examples for novel classes, our experiments show that these methods have difficulty in handling domain shifts between base and novel classes.

Distance metric learning based methods

address the few-shot classification problem by “learning to compare”. The intuition is that if a model can determine the similarity of two images, it can classify an unseen input image with the labeled instances Koch et al. (2015)

. To learn a sophisticated comparison models, meta-learning based methods make their prediction conditioned on distance or metric to few labeled instances during the training process. Examples of distance metrics include cosine similarity 

Vinyals et al. (2016), Euclidean distance to class-mean representation Snell et al. (2017), CNN-based relation module Sung et al. (2018)

, ridge regression 

Bertinetto et al. (2019)

, and graph neural network 

Garcia & Bruna (2018). In this paper, we compare the performance of three distance metric learning methods. Our results show that a simple baseline method with a distance-based classifier (without training over a collection of tasks/episodes as in meta-learning) achieves competitive performance with respect to other sophisticated algorithms.

Besides meta-learning methods, both Gidaris & Komodakis (2018) and Qi et al. (2018) develop a similar method to our Baseline++ (described later in Section 3.2). The method in Gidaris & Komodakis (2018) learns a weight generator to predict the novel class classifier using an attention-based mechanism (cosine similarity), and the Qi et al. (2018) directly use novel class features as their weights. Our Baseline++ can be viewed as a simplified architecture of these methods. Our focus, however, is to show that simply reducing intra-class variation in a baseline method using the base class data leads to competitive performance.

Hallucination based methods

directly deal with data deficiency by “learning to augment”. This class of methods learns a generator from data in the base classes and use the learned generator to hallucinate new novel class data for data augmentation. One type of generator aims at transferring appearance variations exhibited in the base classes. These generators either transfer variance in base class data to novel classes  

Hariharan & Girshick (2017), or use GAN models Antoniou et al. (2018) to transfer the style. Another type of generators does not explicitly specify what to transfer, but directly integrate the generator into a meta-learning algorithm for improving the classification accuracy Wang et al. (2018). Since hallucination based methods often work with other few-shot methods together (e.g. use hallucination based and metric learning based methods together) and lead to complicated comparison, we do not include these methods in our comparative study and leave it for future work.

Domain adaptation

techniques aim to reduce the domain shifts between source and target domain Pan et al. (2010); Ganin & Lempitsky (2015), as well as novel tasks in a different domain Hsu et al. (2018). Similar to domain adaptation, we also investigate the impact of domain difference on few-shot classification algorithms in Section 4.5. In contrast to most domain adaptation problems where a large amount of data is available in the target domain (either labeled or unlabeled), our problem setting differs because we only have very few examples in the new domain. Very recently, the method in Dong & Xing (2018) addresses the one-shot novel category domain adaptation problem, where in the testing stage both the domain and the category to classify are changed. Similarly, our work highlights the limitations of existing few-shot classification algorithms problem in handling domain shift. To put these problem settings in context, we provided a detailed comparison of setting difference in the appendix A1.

3 Overview of Few-shot classification Algorithms

Figure 1: Baseline and Baseline++ few-shot classification methods. Both the baseline and baseline++ method train a feature extractor and classifier with base class data in the training stage In the fine-tuning stage, we fix the network parameters in the feature extractor and train a new classifier

with the given labeled examples in novel classes. The baseline++ method differs from the baseline model in the use of cosine distances between the input feature and the weight vector for each class that aims to reduce intra-class variations.

In this section, we first outline the details of the baseline model (Section 3.1) and its variant (Section 3.2), followed by describing representative meta-learning algorithms (Section 3.3) studied in our experiments. Given abundant base class labeled data and a small amount of novel class labeled data , the goal of few-shot classification algorithms is to train classifiers for novel classes (unseen during training) with few labeled examples.

3.1 Baseline

Our baseline model follows the standard transfer learning procedure of network pre-training and fine-tuning. Figure 

1 illustrates the overall procedure.

Training stage.

We train a feature extractor (parametrized by the network parameters ) and the classifier (parametrized by the weight matrix ) from scratch by minimizing a standard cross-entropy classification loss using the training examples in the base classes . Here, we denote the dimension of the encoded feature as and the number of output classes as . The classifier consists of a linear layer followed by a softmax function .

Fine-tuning stage.

To adapt the model to recognize novel classes in the fine-tuning stage, we fix the pre-trained network parameter in our feature extractor and train a new classifier (parametrized by the weight matrix ) by minimizing using the few labeled of examples (i.e., the support set) in the novel classes .

3.2 Baseline++

In addition to the baseline model, we also implement a variant of the baseline model, denoted as Baseline++, which explicitly reduces intra-class variation among features during training. The importance of reducing intra-class variations of features has been highlighted in deep metric learning Hu et al. (2015) and few-shot classification methods Gidaris & Komodakis (2018).

The training procedure of Baseline++ is the same as the original Baseline model except for the classifier design. As shown in Figure 1, we still have a weight matrix of the classifier in the training stage and a in the fine-tuning stage in Baseline++. The classifier design, however, is different from the linear classifier used in the Baseline. Take the weight matrix as an example. We can write the weight matrix as , where each class has a -dimensional weight vector. In the training stage, for an input feature where , we compute its cosine similarity to each weight vector and obtain the similarity scores for all classes, where

. We can then obtain the prediction probability for each class by normalizing these similarity scores with a softmax function. Here, the classifier makes a prediction based on the cosine distance between the input feature and the learned weight vectors representing each class. Consequently, training the model with this distance-based classifier explicitly reduce intra-class variations. Intuitively, the learned weight vectors

can be interpreted as prototypes (similar to Snell et al. (2017); Vinyals et al. (2016)) for each class and the classification is based on the distance of the input feature to these learned prototypes. The softmax function prevents the learned weight vectors collapsing to zeros.

We clarify that the network design in Baseline++ is not our contribution. The concept of distance-based classification has been extensively studied in Mensink et al. (2012) and recently has been revisited in the few-shot classification setting Gidaris & Komodakis (2018); Qi et al. (2018).

3.3 Meta-learning algorithms

Figure 2: Meta-learning few-shot classification algorithms. The meta-learning classifier is conditioned on the support set . (Top) In the meta-train stage, the support set and the query set are first sampled from random classes, and then train the parameters in to minimize the -way prediction loss . In the meta-testing stage, the adapted classifier can predict novel classes with the support set in the novel classes . (Bottom) The design of in different meta-learning algorithms.

Here we describe the formulations of meta-learning methods used in our study. We consider three distance metric learning based methods (MatchingNet Vinyals et al. (2016), ProtoNet Snell et al. (2017), and RelationNet Sung et al. (2018)) and one initialization based method (MAML Finn et al. (2017)). While meta-learning is not a clearly defined, Vinyals et al. (2016) considers a few-shot classification method as meta-learning if the prediction is conditioned on a small support set , because it makes the training procedure explicitly learn to learn from a given small support set.

As shown in Figure 2, meta-learning algorithms consist of a meta-training and a meta-testing stage. In the meta-training stage, the algorithm first randomly select classes, and sample small base support set and a base query set from data samples within these classes. The objective is to train a classification model that minimizes -way prediction loss of the samples in the query set . Here, the classifier is conditioned on provided support set . By making prediction conditioned on the given support set, a meta-learning method can learn how to learn from limited labeled data through training from a collection of tasks (episodes). In the meta-testing stage, all novel class data are considered as the support set for novel classes , and the classification model can be adapted to predict novel classes with the new support set .

Different meta-learning methods differ in their strategies to make prediction conditioned on support set (see Figure 2). For both MatchingNet Vinyals et al. (2016) and ProtoNet Snell et al. (2017), the prediction of the examples in a query set is based on comparing the distance between the query feature and the support feature from each class. MatchingNet compares cosine distance between the query feature and each support feature, and computes average cosine distance for each class, while ProtoNet compares the Euclidean distance between query features and the class mean of support features. RelationNet Sung et al. (2018) shares a similar idea, but it replaces distance with a learn-able relation module. The MAML method Finn et al. (2017) is an initialization based meta-learning algorithm, where each support set is used to adapt the initial model parameters using few gradient updates. As different support sets have different gradient updates, the adapted model is conditioned on the support set. Note that when the query set instances are predicted by the adapted model in the meta-training stage, the loss of the query set is used to update the initial model, not the adapted model.

4 Experimental Results

4.1 Experimental setup

Datasets and scenarios.

We address the few-shot classification problem under three scenarios: 1) generic object recognition, 2) fine-grained image classification, and 3) cross-domain adaptation.

For object recognition, we use the mini-ImageNet dataset commonly used in evaluating few-shot classification algorithms. The mini-ImageNet dataset consists of a subset of 100 classes from the ImageNet dataset Deng et al. (2009) and contains 600 images for each class. The dataset was first proposed by Vinyals et al. (2016), but recent works use the follow-up setting provided by Ravi & Larochelle (2017), which is composed of randomly selected 64 base, 16 validation, and 20 novel classes.

For fine-grained classification, we use CUB-200-2011 dataset Wah et al. (2011) (referred to as the CUB hereafter). The CUB dataset contains 200 classes and 11,788 images in total. Following the evaluation protocol of Hilliard et al. (2018), we randomly split the dataset into 100 base, 50 validation, and 50 novel classes.

For the cross-domain scenario (mini-ImageNet CUB), we use mini-ImageNet as our base class and the 50 validation and 50 novel class from CUB. Evaluating the cross-domain scenario allows us to understand the effects of domain shifts to existing few-shot classification approaches.

Implementation details.

In the training stage for the Baseline and the Baseline++ methods, we train 400 epochs with a batch size of 16. In the meta-training stage for meta-learning methods, we train 60,000 episodes for 1-shot and 40,000 episodes for 5-shot tasks. We use the validation set to select the training episodes with the best accuracy.

222For example, the exact episodes for experiments on the mini-ImageNet in the 5-shot setting with a four-layer ConvNet are: ProtoNet: 24,600; MatchingNet: 35,300; RelationNet: 37,100; MAML: 36,700. In each episode, we sample classes to form -way classification ( is 5 in both meta-training and meta-testing stages unless otherwise mentioned). For each class, we pick labeled instances as our support set and 16 instances for the query set for a -shot task.

In the fine-tuning or meta-testing stage for all methods, we average the results over 600 experiments. In each experiment, we randomly sample 5 classes from novel classes, and in each class, we also pick instances for the support set and 16 for the query set. For Baseline and Baseline++, we use the entire support set to train a new classifier for 100 iterations with a batch size of 4. For meta-learning methods, we obtain the classification model conditioned on the support set as in Section 3.3.

All methods are trained from scratch and use the Adam optimizer with initial learning rate

. We apply standard data augmentation including random crop, left-right flip, and color jitter in both the training or meta-training stage. Some implementation details have been adjusted individually for each method. For Baseline++, we multiply the cosine similarity by a class-wise learnable scalar to adjust original value range [-1,1] to be more appropriate for subsequent softmax layer. For MatchingNet, we use an FCE classification layer without fine-tuning in all experiments and also multiply cosine similarity by a constant scalar. For RelationNet, we replace the L2 norm with a softmax layer to expedite training. For MAML, we use a first-order approximation in the gradient for memory efficiency. The approximation has been shown in the original paper and in our appendix to have nearly identical performance as the full version. We choose the first-order approximation for its efficiency.

4.2 Evaluation using the Standard Setting

1-shot 5-shot
Method Reported Ours Reported Ours
Baseline - 42.11 0.71 - 62.53 0.69
Baseline333Reported results are from  Ravi & Larochelle (2017) 41.08 0.70 36.35 0.64 51.04 0.65 54.50 0.66
MatchingNet3 Vinyals et al. (2016) 43.56 0.84 48.14 0.78 55.31 0.73 63.48 0.66
ProtoNet - 44.42 0.84 - 64.24 0.72
ProtoNet Snell et al. (2017) 49.42 0.78 47.74 0.84 68.20 0.66 66.68 0.68
MAML Finn et al. (2017) 48.07 1.75 46.47 0.82 63.15 0.91 62.71 0.71
RelationNet Sung et al. (2018) 50.44 0.82 49.31 0.85 65.32 0.70 66.60 0.69
Table 1: Validating our re-implementation. We validate our few-shot classification implementation on the mini

-ImageNet dataset using a Conv-4 backbone. We report the mean of 600 randomly generated test episodes as well as the 95% confidence intervals. Our reproduced results to all few-shot methods do not fall behind by more than 2% to the reported results in the literature. We attribute the slight discrepancy to different random seeds and minor implementation differences in each method. “Baseline

” denotes the results without applying data augmentation during training. ProtoNet indicates performing 30-way classification in 1-shot and 20-way in 5-shot during the meta-training stage.
CUB mini-ImageNet
Method 1-shot 5-shot 1-shot 5-shot
Baseline 47.12 0.74 64.16 0.71 42.11 0.71 62.53 0.69
Baseline++ 60.53 0.83 79.34 0.61 48.24 0.75 66.43 0.63
MatchingNet Vinyals et al. (2016) 61.16 0.89 72.86 0.70 48.14 0.78 63.48 0.66
ProtoNet Snell et al. (2017) 51.31 0.91 70.77 0.69 44.42 0.84 64.24 0.72
MAML Finn et al. (2017) 55.92 0.95 72.09 0.76 46.47 0.82 62.71 0.71
RelationNet Sung et al. (2018) 62.45 0.98 76.11 0.69 49.31 0.85 66.60 0.69
Table 2: Few-shot classification results for both the mini-ImageNet and CUB datasets. The Baseline++ consistently improves the Baseline model by a large margin and is competitive with the state-of-the-art meta-learning methods. All experiments are from 5-way classification with a Conv-4 backbone and data augmentation.

We now conduct experiments on the most common setting in few-shot classification, 1-shot and 5-shot classification, i.e., 1 or 5 labeled instances are available from each novel class. We use a four-layer convolution backbone (Conv-4) with an input size of 84x84 as in Snell et al. (2017) and perform 5-way classification for only novel classes during the fine-tuning or meta-testing stage.

To validate the correctness of our implementation, we first compare our results to the reported numbers for the mini-ImageNet dataset in Table 1. Note that we have a ProtoNet, as we use 5-way classification in the meta-training and meta-testing stages for all meta-learning methods as mentioned in Section 4.1; however, the official reported results from ProtoNet uses 30-way for one shot and 20-way for five shot in the meta-training stage in spite of using 5-way in the meta-testing stage. We report this result for completeness.

From Table 1, we can observe that all of our re-implementation for meta-learning methods do not fall more than behind reported performance. These minor differences can be attributed to our modifications of some implementation details to ensure a fair comparison among all methods, such as using the same optimizer for all methods.

Moreover, our implementation of existing work also improves the performance of some of the methods. For example, our results show that the Baseline approach under 5-shot setting can be improved by a large margin since previous implementations of the Baseline do not include data augmentation in their training stage, thereby leads to over-fitting. While our Baseline is not as good as reported in 1-shot, our Baseline with augmentation still improves on it, and could be even higher if our reproduced Baseline matches the reported statistics. In either case, the performance of the Baseline method is severely underestimated. We also improve the results of MatchingNet by adjusting the input score to the softmax layer to a more appropriate range as stated in Section 4.1. On the other hand, while ProtoNet is not as good as ProtoNet, as mentioned in the original paper a more challenging setting in the meta-training stage leads to better accuracy. We choose to use a consistent 5-way classification setting in subsequent experiments to have a fair comparison to other methods. This issue can be resolved by using a deeper backbone as shown in Section 4.3.

After validating our re-implementation, we now report the accuracy in Table 2. Besides additionally reporting results on the CUB dataset, we also compare Baseline++ to other methods. Here, we find that Baseline++ improves the Baseline by a large margin and becomes competitive even when compared with other meta-learning methods. The results demonstrate that reducing intra-class variation is an important factor in the current few-shot classification problem setting.

However, note that our current setting only uses a 4-layer backbone, while a deeper backbone can inherently reduce intra-class variation. Thus, we conduct experiments to investigate the effects of backbone depth in the next section.

4.3 Effect of increasing the network depth

CUB1-shot5-shotmini-ImageNet 1-shot5-shot
Figure 3: Few-shot classification accuracy vs. backbone depth. In the CUB dataset, gaps among different methods diminish as the backbone gets deeper. In mini-ImageNet 5-shot, some meta-learning methods are even beaten by Baseline with a deeper backbone. (Please refer to  Figure A3 and Table A5 for larger figure and detailed statistics.)

In this section, we change the depth of the feature backbone to reduce intra-class variation for all methods. See appendix for statistics on how network depth correlates with intra-class variation. Starting from Conv-4, we gradually increase the feature backbone to Conv-6, ResNet-10, 18 and 34, where Conv-6 have two additional convolution blocks without pooling after Conv-4. ResNet-18 and 34 are the same as described in He et al. (2016) with an input size of 224224, while ResNet-10 is a simplified version of ResNet-18 where only one residual building block is used in each layer. The statistics of this experiment would also be helpful to other works to make a fair comparison under different feature backbones.

Results of the CUB dataset shows a clearer tendency in Figure 3. As the backbone gets deeper, the gap among different methods drastically reduces. Another observation is how ProtoNet improves rapidly as the backbone gets deeper. While using a consistent 5-way classification as discussed in Section 4.2 degrades the accuracy of ProtoNet with Conv-4, it works well with a deeper backbone. Thus, the two observations above demonstrate that in the CUB dataset, the gap among existing methods would be reduced if their intra-class variation are all reduced by a deeper backbone.

However, the result of mini-ImageNet in Figure 3 is much more complicated. In the 5-shot setting, both Baseline and Baseline++ achieve good performance with a deeper backbone, but some meta-learning methods become worse relative to them. Thus, other than intra-class variation, we can assume that the dataset is also important in few-shot classification. One difference between CUB and mini-ImageNet is their domain difference in base and novel classes since classes in mini-ImageNet have a larger divergence than CUB in a word-net hierarchy Miller (1995). To better understand the effect, below we discuss how domain differences between base and novel classes impact few-shot classification results.

4.4 Effect of domain differences between base and novel classes

To further dig into the issue of domain difference, we design scenarios that provide such domain shifts. Besides the fine-grained classification and object recognition scenarios, we propose a new cross-domain scenario: mini-ImageNet CUB as mentioned in  Section 4.1. We believe that this is practical scenario since collecting images from a general class may be relatively easy (e.g. due to increased availability) but collecting images from fine-grained classes might be more difficult.

We conduct the experiments with a ResNet-18 feature backbone. As shown in Table 3, the Baseline outperforms all meta-learning methods under this scenario. While meta-learning methods learn to learn from the support set during the meta-training stage, they are not able to adapt to novel classes that are too different since all of the base support sets are within the same dataset. A similar concept is also mentioned in Vinyals et al. (2016). In contrast, the Baseline simply replaces and trains a new classifier based on the few given novel class data, which allows it to quickly adapt to a novel class and is less affected by domain shift between the source and target domains. The Baseline also performs better than the Baseline++ method, possibly because additionally reducing intra-class variation compromises adaptability. In Figure 4, we can further observe how Baseline accuracy becomes relatively higher as the domain difference gets larger. That is, as the domain difference grows larger, the adaptation based on a few novel class instances becomes more important.

mini-ImageNet CUB Baseline 65.570.70 Baseline++ 62.040.76 MatchingNet 53.070.74 ProtoNet 62.020.70 MAML 51.340.72 RelationNet 57.710.73 Table 3: 5-shot accuracy under the cross-domain scenario with a ResNet-18 backbone. Baseline outperforms all other methods under this scenario. Figure 4: 5-shot accuracy in different scenarios with a ResNet-18 backbone. The Baseline model performs relative well with larger domain differences.
CUBmini-ImageNet CUBmini-ImageNet
Figure 5: Meta-learning methods with further adaptation steps. Further adaptation improves MatchingNet and MAML, but has less improvement to RelationNet, and could instead harm ProtoNet under the scenarios with little domain differences.All statistics are for 5-shot accuracy with ResNet-18 backbone. Note that different methods use different further adaptation strategies.

4.5 Effect of further adaptation

To further adapt meta-learning methods as in the Baseline method, an intuitive way is to fix the features and train a new softmax classifier. We apply this simple adaptation scheme to MatchingNet and ProtoNet. For MAML, it is not feasible to fix the feature as it is an initialization method. In contrast, since it updates the model with the support set for only a few iterations, we can adapt further by updating for as many iterations as is required to train a new classification layer, which is 100 updates as mentioned in Section 4.1. For RelationNet, the features are convolution maps rather than the feature vectors, so we are not able to replace it with a softmax. As an alternative, we randomly split the few training data in novel class into 3 support and 2 query data to finetune the relation module for 100 epochs.

The results of further adaptation are shown in Figure 5; we can observe that the performance of MatchingNet and MAML improves significantly after further adaptation, particularly in the mini-ImageNet CUB scenario. The results demonstrate that lack of adaptation is the reason they fall behind the Baseline. However, changing the setting in the meta-testing stage can lead to inconsistency with the meta-training stage. The ProtoNet result shows that performance can degrade in scenarios with less domain difference. Thus, we believe that learning how to adapt in the meta-training stage is important future direction. In summary, as domain differences are likely to exist in many real-world applications, we consider that learning to learn adaptation in the meta-training stage would be an important direction for future meta-learning research in few-shot classification.

5 Conclusions

In this paper, we have investigated the limits of the standard evaluation setting for few-shot classification. Through comparing methods on a common ground, our results show that the Baseline++ model is competitive to state of art under standard conditions, and the Baseline model achieves competitive performance with recent state-of-the-art meta-learning algorithms on both CUB and mini-ImageNet benchmark datasets when using a deeper feature backbone. Surprisingly, the Baseline compares favorably against all the evaluated meta-learning algorithms under a realistic scenario where there exists domain shift between the base and novel classes. By making our source code publicly available, we believe that community can benefit from the consistent comparative experiments and move forward to tackle the challenge of potential domain shifts in the context of few-shot learning.

Acknowledgement.

This work was supported in part by NSF under Grant No. 1755785 . We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU.

References

  • Antoniou et al. (2018) Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial networks. In Proceedings of the International Conference on Learning Representations Workshops (ICLR Workshops), 2018.
  • Bertinetto et al. (2019) Luca Bertinetto, João F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  • Cohen et al. (2017) Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: an extension of mnist to handwritten letters. arXiv preprint arXiv:1702.05373, 2017.
  • Davies & Bouldin (1979) David L Davies and Donald W Bouldin. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1979.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2009.
  • Dong & Xing (2018) Nanqing Dong and Eric P Xing. Domain adaption in one-shot learning. In

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer

    , 2018.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017.
  • Finn et al. (2018) Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems (NIPS), 2018.
  • Ganin & Lempitsky (2015) Yaroslav Ganin and Victor Lempitsky.

    Unsupervised domain adaptation by backpropagation.

    In Proceedings of the International Conference on Machine Learning (ICML), 2015.
  • Garcia & Bruna (2018) Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • Gidaris & Komodakis (2018) Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Hariharan & Girshick (2017) Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinating features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Hilliard et al. (2018) Nathan Hilliard, Lawrence Phillips, Scott Howland, Artëm Yankov, Courtney D Corley, and Nathan O Hodas. Few-shot learning with metric-agnostic conditional embeddings. arXiv preprint arXiv:1802.04376, 2018.
  • Hsu et al. (2018) Yen-Chang Hsu, Zhaoyang Lv, and Zsolt Kira. Learning to cluster in order to transfer across domains and tasks. 2018.
  • Hu et al. (2015) Junlin Hu, Jiwen Lu, and Yap-Peng Tan. Deep transfer metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Koch et al. (2015) Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In Proceedings of the International Conference on Machine Learning Workshops (ICML Workshops), 2015.
  • Lake et al. (2011) Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. One shot learning of simple visual concepts. In Cogsci, 2011.
  • Mensink et al. (2012) Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2012.
  • Miller (1995) George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 1995.
  • Motiian et al. (2017) Saeid Motiian, Quinn Jones, Seyed Iranmanesh, and Gianfranco Doretto. Few-shot adversarial domain adaptation. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • Munkhdalai & Yu (2017) Tsendsuren Munkhdalai and Hong Yu. Meta networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017.
  • Nichol & Schulman (2018) Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2018.
  • Pan et al. (2010) Sinno Jialin Pan, Qiang Yang, et al. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2010.
  • Qi et al. (2018) Hang Qi, Matthew Brown, and David G Lowe. Low-shot learning with imprinted weights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Ravi & Larochelle (2017) Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • Rusu et al. (2019) Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems (NIPS), 2016.
  • Wah et al. (2011) Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • Wang et al. (2018) Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

Appendix

A1 Relationship between domain adaptation and few-shot classification

As mentioned in Section 2, here we discuss the relationship between domain adaptation and few-shot classification to clarify different experimental settings. As shown in Table A1, in general, domain adaptation aims at adapting source dataset knowledge to the same class in target dataset. On the other hand, the goal of few-shot classification is to learn from base classes to classify novel classes in the same dataset.

Several recent work tackle the problem at the intersection of the two fields of study. For example, cross-task domain adaptation Hsu et al. (2018) also discuss novel classes in the target dataset. In contrast, while Motiian et al. (2017) has “few-shot” in the title, their evaluation setting focuses on classifying the same class in the target dataset.

If base and novel classes are both drawn from the same dataset, minor domain shift exists between the base and novel classes, as we demonstrated in Section 4.4. To highlight the impact of domain shift, we further propose the mini-ImageNet CUB setting. The domain shift in few-shot classification is also discussed in Dong & Xing (2018).

Domain shift Source to target dataset Base to novel class
Domain adaptation
Motiian et al. (2017)
V V -
Cross-task domain adaptation
Hsu et al. (2018)
V V V
Few-shot classification
Ours (CUB, mini-ImageNet )
* - V
Cross-domain few-shot
Ours (mini-ImageNet CUB)
Dong & Xing (2018)
V V V
Table A1: Relationship between domain adaptation and few-shot classification. The two field-of-studies have overlapping in the development. Notation ”*” indicates minor domain shifts exist between base and novel classes.

A2 Terminology difference

Different meta-learning works use different terminology in their works. We highlight their differences in appendix Table A2 to clarify the inconsistency.

Our terms
MatchingNet
Vinyals et al.
ProtoNet
Snell et al.
MAML
Finn et al.
Meta-learn LSTM
Ravi & Larochelle
Imaginary
Wang et al.
meta-training stage training training - - -
meta-testing stage test test - - -
base class training set training set task meta-training set -
novel class test set test set new task meta-testing set -
support set - - sample training dataset training data
query set batch - test time sample test dataset test data
Table A2: Different terminology used in other works. Notation ”-” indicates the term is the same as in this paper.

A3 Additional results on Omniglot and OmniglotEmnist

For completeness, here we also show the results under two additional scenarios in 4) character recognition 5) cross-domain character recognition.

For character recognition, we use the Omniglot dataset Lake et al. (2011) commonly used in evaluating few-shot classification algorithms. Omniglot contains 1,623 characters from 50 languages, and we follow the evaluation protocol of Vinyals et al. (2016) to first augment the classes by rotations in 90, 180, 270 degrees, resulting in 6492 classes. We then follow Snell et al. (2017) to split these classes into 4112 base, 688 validation, and 1692 novel classes. Unlike Snell et al. (2017), our validation classes are only used to monitor the performance during meta-training.

For cross-domain character recognition (OmniglotEMNIST), we follow the setting of Dong & Xing (2018) to use Omniglot without Latin characters and without rotation augmentation as base classes, so there are 1597 base classes. On the other hand, EMNIST dataset Cohen et al. (2017) contains 10-digits and upper and lower case alphabets in English, so there are 62 classes in total. We split these classes into 31 validation and 31 novel classes, and invert the white-on-black characters to black-on-white as in Omniglot.

We use a Conv-4 backbone with input size 28x28 for both settings. As Omniglot characters are black-and-white, center-aligned and rotation sensitive, we do not use data augmentation in this experiment. To reduce the risk of over-fitting, we use the validation set to select the epoch or episode with the best accuracy for all methods, including baseline and baseline++.444The exact epoch of baseline and baseline++ on Omniglot and OmniglotEMNIST is 5 epochs

As shown in Table A3, in both Omniglot and OmniglotEMNIST settings, meta-learning methods outperform baseline and baseline++ in 1-shot. However, all methods reach comparable performance in the 5-shot classification setting. We attribute this to the lack of data augmentation for the baseline and baseline++ methods as they tend to over-fit base classes. When sufficient examples in novel classes are available, the negative impact of over-fitting is reduced.

Omniglot OmniglotEMNIST
Method 1-shot 5-shot 1-shot 5-shot
Baseline 94.89 0.45 99.12 0.13 63.94 0.87 86.00 0.59
Baseline++ 95.41 0.39 99.38 0.10 64.74 0.82 87.31 0.58
MatchingNet 97.78 0.30 99.37 0.11 72.71 0.79 87.60 0.56
ProtoNet 98.01 0.30 99.15 0.12 70.43 0.80 87.04 0.55
MAML 98.57 0.19 99.53 0.08 72.04 0.83 88.24 0.56
RelationNet 97.22 0.33 99.30 0.10 75.55 0.87 88.94 0.54
Table A3: Few-shot classification results for both the Omniglot and OmniglotEMNIST. All experiments are from 5-way classification with a Conv-4 backbone and without data augmentation.

A4 Baseline with 1-NN classifier

Some prior work (Vinyals et al. (2016)) apply a Baseline with 1-NN classifier in the test stage. We include our result as in Table A4. The result shows that using 1-NN classifier has better performance than that of using the softmax classifier in 1-shot setting, but softmax classifier performs better in 5-shot setting. We note that the number here are not directly comparable to results in Vinyals et al. (2016) because we use a different mini-ImageNet as in Ravi & Larochelle (2017).

1-shot 5-shot
softmax 1-NN softmax 1-NN
Baseline 42.110.71 44.180.69 62.530.69 56.680.67
Baseline++ 48.240.75 49.570.73 66.430.63 61.930.65
Table A4: Baseline with softmax and 1-NN classifier in test stage. We note that we use cosine distance in 1-NN.

A5 MAML and MAML with first-order approximation

As discussed in Section 4.1, we use first-order approximation MAML to improve memory efficiency in all of our experiments. To demonstrate this design choice does not affect the accuracy, we compare their validation accuracy trends on Omniglot with 5-shot as in Figure A1. We observe that while the full version MAML converge faster, both versions reach similar accuracy in the end.

This phenomena is consistent with the difference of first-order (e.g. gradient descent) and second-order methods (e.g. Newton) in convex optimization problems. Second-order methods converge faster at the cost of memory, but they both converge to similar objective value.

Figure A1: Validation accuracy trends of MAML and MAML with first order approximation. Both versions converge to the same validation accuracy. The experimental results are on Omniglot with 5-shot with a Conv-4 backbone.

A6 Intra-class variation and backbone depth

As mentioned in Section 4.3, here we demonstrate decreased intra-class variation as the network depth gets deeper as in Figure A2. We use the Davies-Bouldin index Davies & Bouldin (1979) to measure intra-class variation. The Davies-Bouldin index is a metric to evaluate the tightness in a cluster (or class, in our case). Our results show that both intra-class variation in the base and novel class feature decrease using deeper backbones.

Base class featureNovel class feature
Figure A2: Intra-class variation decreases as backbone gets deeper.

Here we use Davies-Bouldin index to represent intra-class variation, which is a metric to evaluate the tightness in a cluster (or class, in our case). The statistics are Davies-Bouldin index for all base and novel class feature (extracted by feature extractor learned after training or meta-training stage) for CUB dataset under different backbone.

A7 Detailed statistics in effects of increasing backbone depth

Here we show a high-resolution version of Figure 3 in Figure A3 and show detailed statistics in Table A5 for easier comparison.

CUBmini-ImageNet
Figure A3: Few-shot classification accuracy vs. backbone depth. In the CUB dataset, gaps among different methods diminish as the backbone gets deeper. In mini-ImageNet 5-shot, some meta-learning methods are even beaten by Baseline with a deeper backbone.
Conv-4 Conv-6 Resnet-10 Resnet-18 Resnet-34
CUB
1-shot
Baseline 47.120.74 55.770.86 63.340.91 65.510.87 67.960.89
Baseline++ 60.530.83 66.000.89 69.550.89 67.020.90 68.000.83
MatchingNet 61.160.89 67.160.97 71.290.90 72.360.90 71.440.96
ProtoNet 51.310.91 66.070.97 70.130.94 71.880.91 72.030.91
MAML 55.920.95 65.910.97 71.290.95 69.961.01 67.281.08
RelationNet 62.450.98 63.110.94 68.650.91 67.591.02 66.200.99
CUB
5-shot
Baseline 64.160.71 73.070.71 81.270.57 82.850.55 84.270.53
Baseline++ 79.340.61 82.020.55 85.170.50 83.580.54 84.500.51
MatchingNet 72.860.70 77.080.66 83.590.58 83.640.60 83.780.56
ProtoNet 70.770.69 78.140.67 84.760.52 87.420.48 85.980.53
MAML 72.090.76 76.310.74 80.330.70 82.700.65 83.470.59
RelationNet 76.110.69 77.810.66 81.120.63 82.750.58 82.300.58
mini-ImageNet
1-shot
Baseline 42.110.71 45.820.74 52.370.79 51.750.80 49.820.73
Baseline++ 48.240.75 48.290.72 53.970.79 51.870.77 52.650.83
MatchingNet 48.140.78 50.470.86 54.490.81 52.910.88 53.200.78
ProtoNet 44.420.84 50.370.83 51.980.84 54.160.82 53.900.83
MAML 46.470.82 50.960.92 54.690.89 49.610.92 51.460.90
RelationNet 49.310.85 51.840.88 52.190.83 52.480.86 51.740.83
mini-ImageNet
5-shot
Baseline 62.530.69 66.420.67 74.690.64 74.270.63 73.450.65
Baseline++ 66.430.63 68.090.69 75.900.61 75.680.63 76.160.63
MatchingNet 63.480.66 63.190.70 68.820.65 68.880.69 68.320.66
ProtoNet 64.240.72 67.330.67 72.640.64 73.680.65 74.650.64
MAML 62.710.71 66.090.71 66.620.83 65.720.77 65.900.79
RelationNet 66.600.69 64.550.70 70.200.66 69.830.68 69.610.67
Table A5: Detailed statistics in Figure 3. We put exact value here for reference.

A8 More-way in meta-testing stage

We experiment with a practical setting that handles different testing scenarios. Specifically, we conduct the experiments of 5-way meta-training and N-way meta-testing (where N = 5, 10, 20) to examine the effect of testing scenarios that are different from training.

As in Table A6, we compare the methods Baseline, Baseline++, MatchingNet, ProtoNet, and RelationNet. Note that we are unable to apply the MAML method as MAML learns the initialization for the classifier and can thus only be updated to classify the same number of classes. Our results show that for classification with a larger N-way in the meta-testing stage, the proposed Baseline++ compares favorably against other methods in both shallow or deeper backbone settings.

We attribute the results to two reasons. First, to perform well in a larger N-way classification setting, one needs to further reduce the intra-class variation to avoid misclassification. Thus, Baseline++ has better performance than Baseline in both backbone settings. Second, as meta-learning algorithms were trained to perform 5-way classification in the meta-training stage, the performance of these algorithms may drop significantly when increasing the N-way in the meta-testing stage because the tasks of 10-way or 20-way classification are harder than that of 5-way one.

One may address this issue by performing a larger N-way classification in the meta-training stage (as suggested in Snell et al. (2017)). However, it may encounter the issue of memory constraint. For example, to perform a 20-way classification with 5 support images and 15 query images in each class, we need to fit a batch size of 400 (20 x (5 + 15)) that must fit into the GPUs. Without special hardware parallelization, the large batch size may prevent us from training models with deeper backbones such as ResNet.

Conv-4 ResNet-18
N-way test 5-way 10-way 20-way 5-way 10-way 20-way
Baseline 62.530.69 46.440.41 32.270.24 74.270.63 55.000.46 42.030.25
Baseline++ 66.430.63 52.260.40 38.030.24 75.680.63 63.400.44 50.850.25
MatchingNet 63.480.66 47.610.44 33.970.24 68.880.69 52.270.46 36.780.25
ProtoNet 64.240.68 48.770.45 34.580.23 73.680.65 59.220.44 44.960.26
RelationNet 66.600.69 47.770.43 33.720.22 69.830.68 53.880.48 39.170.25
Table A6: 5-way meta-training and N-way meta-testing experiment. The experimental results are on mini-ImageNet with 5-shot. We could see Baseline++ compares favorably against other methods in both shallow or deeper backbone settings.