A New Benchmark for Evaluation of Cross-Domain Few-Shot Learning

by   Yunhui Guo, et al.

Recent progress on few-shot learning has largely re-lied on annotated data for meta-learning, sampled from the same domain as the novel classes. However, in many applications, collecting data for meta-learning is infeasible or impossible. This leads to the cross-domain few-shot learn-ing problem, where a large domain shift exists between base and novel classes. Although some preliminary investigation of the few-shot methods under domain shift exists, a standard benchmark for cross-domain few-shot learning is not yet established. In this paper, we propose the cross-domain few-shot learning (CD-FSL) benchmark, consist-ing of images from diverse domains with varying similarity to ImageNet, ranging from crop disease images, satellite images, and medical images. Extensive experiments on the proposed benchmark are performed to compare an array of state-of-art meta-learning and transfer learning approaches, including various forms of single model fine-tuning and ensemble learning. The results demonstrate that current meta-learning methods underperform in relation to simple fine-tuning by 12.8 Accuracy of all methods tend to correlate with dataset similarity toImageNet. In addition, the relative performance gain with increasing number of shots is greater with transfer methods compared to meta-learning. Finally, we demonstrate that transferring from multiple pretrained models achieves best performance, with accuracy improvements of 14.9 meta-learning and single model fine-tuning approaches, respectively. In summary, the proposed benchmark serves as a challenging platform to guide future research on cross-domain few-shot learning due to its spectrum of diversity and coverage


Cross-Domain Few-Shot Learning with Meta Fine-Tuning

In this paper, we tackle the new Cross-Domain Few-Shot Learning benchmar...

Domain Agnostic Few-Shot Learning For Document Intelligence

Few-shot learning aims to generalize to novel classes with only a few sa...

Comparing Transfer and Meta Learning Approaches on a Unified Few-Shot Classification Benchmark

Meta and transfer learning are two successful families of approaches to ...

Partial Is Better Than All: Revisiting Fine-tuning Strategy for Few-shot Learning

The goal of few-shot learning is to learn a classifier that can recogniz...

Large Margin Mechanism and Pseudo Query Set on Cross-Domain Few-Shot Learning

In recent years, few-shot learning problems have received a lot of atten...

FHIST: A Benchmark for Few-shot Classification of Histological Images

Few-shot learning has recently attracted wide interest in image classifi...

Interventional Few-Shot Learning

We uncover an ever-overlooked deficiency in the prevailing Few-Shot Lear...

1 Introduction

Figure 1: The cross-domain few-shot learning (CD-DSL) benchmark. ImageNet is used for source training, and domains of varying dissimilarity from ImageNet are used for target evaluation. No data is provided for meta-learning, and target classes are disjoint from the source classes.

Training deep neural networks for visual recognition typically requires a large amount of labelled examples

[krizhevsky2012imagenet]. The generalization ability of deep neural networks relies heavily on the size and variations of the dataset used for training. However, collecting sufficient amounts of data for certain classes may be impossible in practice: for example, in dermatology, there are a multitude of instances of rare diseases, or diseases that become rare for particular types of skin [rotemberg2019role, adamson2018machine, fairnessinskin]. Or in other domains such as satellite imagery, there are instances of rare categories such as airplane wreckage. Although individually each situation may not carry heavy cost, as a group across many such conditions and modalities, correct identification is critically important, and remains a significant challenge where access to expertise may be impeded.

In contrast to deep neural networks, humans generalize to recognize new categories from few examples in certain circumstances, such as when categories exhibit predictable variations across examples and have reasonable contrast from background [lake2015human, lake2011one]. However, even humans have trouble recognizing new categories that vary too greatly between examples or differ from prior experience, such as for diagnosis in dermatology, radiology, or other fields [rotemberg2019role]. Because there are many applications where learning must work from few examples, and both machines and humans have difficulty learning in these circumstances, finding new methods to tackle the problem remains a challenging but desirable goal.

The problem of learning how to categorize classes with very few training examples is referred to as “few-shot learning”, and has been the topic of a large body of recent work [li2006one, ravi2016optimization, vinyals2016matching, finn2017model, snell2017prototypical, chen2018a, sung2018learning]. Few-shot learning is typically composed of the following two stages: meta-learning and meta-testing. In the meta-learning stage, there exists an abundance of base category classes on which a system can be trained to learn well under conditions of few-examples within that particular domain. In the meta-testing stage, a set of novel classes consisting of very few examples per class is used to adapt and evaluate the trained model. However, recent work [chen2018a] points out that meta-learning based few-shot learning algorithms underperform compared to traditional “pre-training and fine-tuning” when there exists a large domain shift between base classes and novel classes. This is a major issue, as the domain shift problem occurs commonly in practice: by the nature of the problem, it is hard to collect data from the same domain for many few-shot classification tasks. This scenario is referred to as cross-domain few-shot learning, to distinguish it from the conventional few-shot learning setting.

Although benchmarks for conventional few-shot learning are well established, the lack of standard cross-domain few-shot learning benchmarks hinder the development of new few-shot learning algorithms. Therefore, to fill this gap, we propose the cross-domain few-shot learning (CD-FSL) benchmark (Fig. 1). The proposed benchmark covers a variety of image domains with varying levels of similarity with ImageNet, including agriculture (most similar), satellite (less similar), dermatology (even less similar), and radiological images (least similar). The performance of existing state-of-art meta-learning methods is then evaluated on the proposed benchmark, where results reveal that these meta-learning methods perform significantly worse than the standard “pre-training and fine-tuning” approach. Subsequently, variants of single model fine-tuning techniques are also evaluated, where results demonstrate that no individual method dominates performance across the benchmark, and the relative performance gain with increasing number of shots is greater with transfer methods than compared to meta-learning. In addition, performance across methods also positively correlates with dataset similarity to ImageNet. Further experiments transferring knowledge from multiple pretrained models, all from disjoint domains than the evaluation benchmark, demonstrates best performance.

In summary, the contributions of this paper are itemized as follows:

  • We establish a new benchmark for cross-domain few-shot learning, consisting of images from a diversity of domains with varying similarity to ImageNet, and lacking data for meta-learning.

  • We extensively evaluate the performance of current meta-learning methods and variants of fine-tuning. The results show the following observations for CD-FSL: 1) meta-learning underperforms compared to fine-tuning, 2) accuracy gain with additional data is increased for fine-tuning versus meta-learning, 3) no individual fine-tuning method dominates performance versus the others across the benchmark, and 4) a general positive correlation between accuracy and dataset similarity to ImageNet exists.

  • We propose Incremental Multi-model Selection, a method which integrates multiple pretrained models for cross-domain few-shot learning, and demonstrates best average performance on the new benchmark.

2 Related Work

Few-shot learning. Few-shot learning [lake2015human, vinyals2016matching, lake2011one]

is an increasingly important topic in machine learning. Current few-shot leaning algorithms can be roughly classified into two threads. One thread includes meta-learning methods which aim to learn models that can be quickly adapted using a few examples

[vinyals2016matching, finn2017model, snell2017prototypical, sung2018learning, lee2019meta]. MatchingNet [vinyals2016matching] learns an embedding that can map an unlabelled example to its label using a small number of labelled examples, while MAML [finn2017model] aims at learning good initialization parameters that can be quickly adapted to a new task. In ProtoNet [snell2017prototypical], the goal is to learn a metric space in which classification can be conducted by calculating distances to prototype representations of each class. RelationNet [sung2018learning] targets learning a deep distance metric to compare a small number of images. More recently, MetaOpt [lee2019meta] learns feature embeddings that can generalize well under a linear classification rule for novel categories.

Another line of few-shot learning algorithms is based on the idea of reusing features learned from the base classes for the novel classes, i.e., transfer learning [pan2009survey]. Transfer learning with deep neural networks is conducted mainly by fine-tuning, which adjusts a pretrained model from a source task to a target task. Yosinski et al. [yosinski2014transferable] conducted extensive experiments to investigate the transfer utility of pretrained deep neural networks. In [kornblith2018better], the authors investigated whether higher performing ImageNet models transfer better to new tasks. Ge et al. [ge2017borrowing] proposed a selective joint fine-tuning method for improving the performance of models with a limited amount training data. In [guo2019spottune], the authors proposed an adaptive fine-tuning scheme to decide which layers of the pretrained network should be fine-tuned.

Few-shot Learning Benchmarks. Current few-shot learning research assumes base classes and novel classes are from the same domain. The common benchmarks for evaluation are miniImageNet [vinyals2016matching], CUB [wah2011caltech], Omniglot [lake2011one], CIFAR-FS [bertinetto2018meta] and tieredImageNet [ren2018meta]. In [triantafillou2019meta], the authors proposed Meta-Dataset, which is a new benchmark for training and evaluating few-shot learning algorithms. However, the included datasets are limited to natural images similar to previous benchmarks. The evaluation also follows the standard setting, that is, both the base classes and novel classes are from the same domain. Arguably, the current few-shot learning benchmarks do not reflect the reality of few-shot learning applications where meta-learning data in domain is commonly not available.

3 Cross-domain Few-shot Learning

We formalize the cross domain few-shot learning problem as follows. We define a domain

as a joint distribution

over input space and label space . The marginal distribution of is denoted as . We use the pair to denote a sample and the corresponding label from the joint distribution . For a model : with parameter

and a loss function

, the expected error is defined as,


In cross-domain few-shot learning, we have a source domain and a target domain with joint distribution and respectively, and specially . The base classes data are sampled from the source domain and the novel classes data are sampled from the target domain. During the training or meta-training stage, the model is trained (or meta-trained) on the base classes data. During testing (or meta-testing) stage, the model is presented with a support set consisting of examples from novel classes. This configuration is referred to as “-way -shot” few-shot learning, as the support set has novel classes and each novel class has training examples. After the model is adapted to the support set, a query set from novel classes is used to evaluate the model performance.

Different from the traditional domain adaptation setting, the label space of source domain and target domain in cross-domain few-shot learning is disjoint. Thus, domain adaptation theory [ben2010theory] cannot be applied in the cross-domain few-shot learning setting.

4 Proposed Benchmark

In this section, we introduce the proposed cross-domain few-shot learning benchmark. The proposed benchmark includes data from the CropDiseases [mohanty2016using], EuroSAT [helber2019eurosat], ISIC2018 [tschandl2018ham10000, codella2019skin], and ChestX [wang2017chestx] datasets, which covers plant disease images, satellite images, dermoscopic images of skin lesions, and X-ray images, respectively. The selected datasets reflect real-world use cases for few-shot learning since collecting enough examples from above domains is often difficult, expensive, or in some cases not possible. In addition, they demonstrate the following spectrum of readily quantifiable domain shifts from ImageNet [deng2009imagenet]: 1) CropDiseases images are most similar as they include perspective color images of natural elements, but are more specialized than anything available in ImageNet, 2) EuroSAT images are less similar as they have lost perspective distortion, but are still color images of natural scenes, 3) ISIC2018 images are even less similar as they have lost perspective distortion and no longer represent natural scenes, and 4) ChestX images are the most dissimilar as they have lost perspective distortion, all color, and do not represent natural scenes. Example images from ImageNet and the proposed benchmark datasets are shown in Figure 1.

In practice, having a few-shot learning model trained on a source domain such as ImageNet [deng2009imagenet] that can generalize to domains such as these, is highly desirable, as it enables effective learning for rare categories in new domains, which has previously not been studied in detail.

5 Methods for Cross-Domain Few-Shot Learning

In this section, we describe the existing prevailing few-shot learning algorithms that will be evaluated on our proposed benchmark. We categorize existing few-shot learning algorithms into meta-learning based methods and transfer learning based methods. For transfer learning methods, we also consider the effect of different type of classifiers. Finally, we show that transferring from multiple models pretrained on different datasets from the same source domain can generally boost performance.

5.1 Meta-learning based methods

Meta-learning [finn2017model, ravi2016optimization], or learning to learn, aims at learning task-agnostic knowledge in order to efficiently learn on new tasks. Each task is assumed to be drawn from a fixed distribution, . Specially, in few-shot learning, each task is a small dataset . and are used to denote the task distribution of the source (base) classes data and target (novel) classes data respectively. During the meta-training stage, the model is trained on tasks which are sampled independently from . During the meta-testing stage, the model is expected to be quickly adapted to a new task .

Meta-learning methods differ in their way of learning the parameter of the initial model on the base classes data. In MatchingNet [vinyals2016matching], the goal is to learn a model that can map an unlabelled example to its label using a small labelled set as , where is an attention kernel which leverages to compute the distance between the unlabelled example and the labelled example , and is the one-hot representation of the label. In contrast, MAML [finn2017model] aims at learning an initial parameter that can be quickly adapted to a new task. This is achieved by updating the model parameter via a two-stage optimization process. ProtoNet [snell2017prototypical] represents each class

with the mean vector of embedded support examples as

. Classification is then conducted by calculating distance of the example to the prototype representations of each class. In RelationNet [sung2018learning] the metric of the nearest neighbor classifier is meta-learned using a Siamese Networks trained for optimal comparison between query and support samples. More recently, MetaOpt [lee2019meta] employs convex base learners and aims at learning feature embeddings that generalize well under a linear classification rule for novel categories. All the existing meta-learning methods implicitly assume that = so the task-agnostic knowledge learned in the meta-training stage can be leveraged for fast learning on novel classes. However, in cross-domain few-shot learning which poses severe challenges for current meta-learning methods.

Methods ChestX ISIC
5-way 5-shot 5-way 20-shot 5-way 50-shot 5-way 5-shot 5-way 20-shot 5-way 50-shot
MatchingNet 22.40% 0.85% 23.61% 0.86% 22.12% 0.88% 36.74% 0.53% 45.72% 0.53% 54.58% 0.65%
MAML 23.48% 0.96% 27.53% 0.43% - 40.13% 0.58% 52.36% 0.57% -
ProtoNet 24.05% 1.01% 28.21% 1.15% 29.32% 1.12% 39.57% 0.57% 49.50% 0.55% 51.99% 0.52%
RelationNet 22.96% 0.88% 26.63% 0.92% 28.45% 1.20% 39.41% 0.58% 41.77% 0.49% 49.32% 0.51%
MetaOpt 22.53% 0.91% 25.53% 1.02% 29.35% 0.99% 36.28% 0.50% 49.42% 0.60% 54.80% 0.54%
Methods EuroSAT CropDiseases
5-way 5-shot 5-way 20-shot 5-way 50-shot 5-way 5-shot 5-way 20-shot 5-way 50-shot
MatchingNet 64.45% 0.63% 77.10% 0.57% 54.44% 0.67% 66.39% 0.78% 76.38% 0.67% 58.53% 0.73%
MAML 71.70% 0.72% 81.95% 0.55% - 78.05% 0.68% 89.75% 0.42% -
ProtoNet 73.29% 0.71% 82.27% 0.57% 80.48% 0.57% 79.72% 0.67% 88.15% 0.51% 90.81% 0.43%
RelationNet 61.31% 0.72% 74.43% 0.66% 74.91% 0.58% 68.99% 0.75% 80.45% 0.64% 85.08% 0.53%
MetaOpt 64.44% 0.73% 79.19% 0.62% 83.62% 0.58% 68.41% 0.73% 82.89% 0.54% 91.76% 0.38%
Table 1: The results of meta-learning methods on the proposed benchmark.
Methods ChestX ISIC
5-way 5-shot 5-way 20-shot 5-way 50-shot 5-way 5-shot 5-way 20-shot 5-way 50-shot
Random 21.80% 1.03% 25.69% 0.95% 26.19% 0.94% 37.91% 1.39% 47.24% 1.50% 50.85% 1.37%
Fixed 25.35% 0.96% 30.83% 1.05% 36.04% 0.46% 43.56% 0.60% 52.78% 0.58% 57.34% 0.56%
Fine-tuning 25.97% 0.41% 31.32% 0.45% 35.49% 0.45% 48.11% 0.64% 59.31% 0.48% 66.48% 0.56%
Ft Last-1 25.96% 0.46% 31.63% 0.49% 37.03% 0.50% 47.20% 0.45% 59.95% 0.45% 65.04% 0.47%
Ft Last-2 26.79% 0.59% 30.95% 0.61% 36.24% 0.62% 47.64% 0.44% 59.87% 0.35% 66.07% 0.45%
Ft Last-3 25.17% 0.56% 30.92% 0.89% 37.27% 0.64% 48.05% 0.55% 60.20% 0.33% 66.21% 0.52%
Transductive Ft 26.09% 0.96% 31.01% 0.59% 36.79% 0.53% 49.68% 0.36% 61.09% 0.44% 67.20% 0.59%
Methods EuroSAT CropDiseases
5-way 5-shot 5-way 20-shot 5-way 50-shot 5-way 5-shot 5-way 20-shot 5-way 50-shot
Random 58.00% 2.01% 68.93% 1.47% 71.65% 1.47% 69.68% 1.72% 83.41% 1.25% 86.56% 1.42%
Fixed 75.69% 0.66% 84.13% 0.52% 86.62% 0.47% 87.48% 0.58% 94.45% 0.36% 96.62% 0.25%
Fine-tuning 79.08% 0.61% 87.64% 0.47% 90.89% 0.36% 89.25% 0.51% 95.51% 0.31% 97.68% 0.21%
Ft Last-1 80.45% 0.54% 87.92% 0.44% 91.41% 0.46% 88.72% 0.53% 95.76% 0.65% 97.87% 0.48%
Ft Last-2 79.57% 0.51% 87.67% 0.46% 90.93% 0.45% 88.07% 0.56% 95.68% 0.76% 97.64% 0.59%
Ft Last-3 78.04% 0.77% 87.52% 0.53% 90.83% 0.42% 89.11% 0.47% 95.31% 0.85% 97.45% 0.46%
Transductive Ft 81.76% 0.48% 87.97% 0.42% 92.00% 0.56% 90.64% 0.54% 95.91% 0.72% 97.48% 0.56%
Table 2: The results of different variants of single model fine-tuning on the proposed benchmark.

5.2 Transfer learning based methods

An alternative way to tackle the problem of few-shot learning is based on transfer learning, where an initial model

is trained on the base classes data in a standard supervised learning way and reused on the novel classes. There are several options to realize the idea of transfer learning for few-shot learning. Previous works on transfer learning for few-shot learning

[snell2017prototypical, chen2018a] simply freeze the pretrained model and use it as a fixed feature extractor. While it was pointed out that fine-tuning the pretrained model on the novel classes data would lead to overfitting in the limited-data regime [snell2017prototypical, sung2018learning], our results show that the conclusion does not hold in cross-domain few-shot learning.

5.2.1 Single Model Methods

In this paper, we extensively evaluate the following commonly variants of single model fine-tuning:

  • Fixed feature extractor (Fixed): simply leverage the pretrained model as a fixed feature extractor.

  • Fine-tuning: Fine-tuning adjusts the pretrained parameters on the new task with standard supervised learning.

  • Fine-tuning last-k (Ft last-k): only the last layers of the pretrained model are optimized for the new task. In the paper, we consider Fine-tuning last-1, Fine-tuning last-2, Fine-tuning last-3.

  • Transductive fine-tuning (Transductive Ft)

    : in transductive fine-tuning, the statistics of the query images are used via batch normalization


In addition, we compare these single model transfer learning techniques against a baseline of an embedding formed by a randomly initialized network to contrast against a fixed feature vector that has no pre-training. All the variants of single model fine-tuning are based on linear classifier but differ in their approach to fine-tune the single model feature extractor.

Methods ChestX ISIC
5-way 5-shot 5-way 20-shot 5-way 50-shot 5-way 5-shot 5-way 20-shot 5-way 50-shot
Linear 25.97% 0.41% 31.32% 0.45% 35.49% 0.45% 48.11% 0.64% 59.31% 0.48% 66.48% 0.56%
Mean-centroid 26.31% 0.42% 30.41% 0.46% 34.68% 0.46% 47.16% 0.54% 56.40% 0.53% 61.57% 0.66%
Cosine-similarity 26.95% 0.44% 32.07% 0.55% 34.76% 0.55% 48.01% 0.49% 58.13% 0.48% 62.03% 0.52%
Methods EuroSAT CropDiseases
5-way 5-shot 5-way 20-shot 5-way 50-shot 5-way 5-shot 5-way 20-shot 5-way 50-shot
Linear 79.08% 0.61% 87.64% 0.47% 91.34% 0.37% 89.25% 0.51% 95.51% 0.31% 97.68% 0.21%
Mean-centroid 82.21% 0.49% 87.62% 0.34% 88.24% 0.29% 87.61% 0.47% 93.87% 0.68% 94.77% 0.34%
Cosine-similarity 81.37% 1.54% 86.83% 0.43% 88.83% 0.38% 89.15% 0.51% 93.96% 0.46% 94.27% 0.41%
Table 3: The results of varying the classifier for fine-tuning on the proposed benchmark.

Another line of work for few-shot learning uses a broader variety of classifiers for transfer learning. For example, recent works show that mean-centroid classifier and cosine-similarity based classifier are more effective than linear classifier for few-shot learning [mensink2013distance, chen2018a]. Therefore we study these two variations as well.

Mean-centroid classifier.

The mean-centroid classifier is inspired from ProtoNet [snell2017prototypical]. Given the pretrained model and a support set , where is the number of novel classes and is the number of images per class. The class prototypes are computed in the same way as in ProtoNet. Then the likelihood of an unlabelled example belongs to class is computed as,


where is a distance function. In the experiments, we use negative cosine similarity. Other distance functions such as Euclidean distance can also be used. Different from ProtoNet, is pretrained on the base classes data in a standard supervised learning way.

Cosine-similarity based classifier.

In cosine-similarity based classifier, instead of directly computing the class prototypes using the pretrained model, each class is represented as a -dimension weight vector which is initialized randomly. For each unlabeled example , the cosine similarity to each weight vector is computed as

. The predictive probability of the example

belongs to class is computed by normalizing the cosine similarity with a softmax function. Intuitively, the weight vector can be thought as the prototype of class .

Methods ChestX ISIC
5-way 5-shot 5-way 20-shot 5-way 50-shot 5-way 5-shot 5-way 20-shot 5-way 50-shot
All embeddings 26.74% 0.42% 32.77% 0.47% 38.07% 0.50% 46.86% 0.60% 58.57% 0.59% 66.04% 0.56%
IMS-f 25.50% 0.45% 31.49% 0.47% 36.40% 0.50% 45.84% 0.62% 61.50% 0.58% 68.64% 0.53%
Methods EuroSAT CropDiseases
5-way 5-shot 5-way 20-shot 5-way 50-shot 5-way 5-shot 5-way 20-shot 5-way 50-shot
All embeddings 81.29% 0.62% 89.90% 0.41% 92.76% 0.34% 90.82% 0.48% 96.64% 0.25% 98.14% 0.18%
IMS-f 83.56% 0.59% 91.22% 0.38% 93.85% 0.30% 90.66% 0.48% 97.18% 0.24% 98.43% 0.16%
Table 4: The results of using all embeddings, and the proposed Incremental Multi-model Selection (IMS-f) based on fine-tuned pretrained models on the proposed benchmark.

5.2.2 Transfer from Multiple Pretrained Models

Traditional transfer learning methods for few-shot learning only consider using one pretrained model for feature extraction. In this section, we propose a novel method that utilizes multiple models pretrained on different source datasets from similar domains as ImageNet, where all source datasets are still disjoint from the target datasets, for cross-domain few-shot learning. The intuition is that by using models pretrained on different datasets, we can obtain more diverse and richer visual features. Unlike previous works on ensemble methods for few-shot learning

[dvornik2019diversity] that train diverse models on the same source dataset, the proposed method requires no change to how models are trained and is an off-the-shelf solution to leverage existing pretrained models for cross-domain few-shot learning, without requiring access to the source datasets.

Assume we have a library of pretrained models which are trained on various datasets in a standard way. We denote the layers of all pretrained models as a set . Given a support set where , our goal is to find a subset of the layers to generate a feature vector for each example in order to achieve the lowest test error. Mathematically,


where is a loss function, is a function which combines a set of feature vectors, is one particular layer in the set and is a linear classifier. Practically, for feature vectors coming from inner layers which are three-dimensional, we convert them to one-dimensional vectors by using Global Average Pooling. Since Eq. 3 is intractable generally, we instead adopt a two-stage greedy selection method, called Incremental Multi-model Selection, to iteratively find the best subset of layers for a given support .

In the first stage, for each pretrained model, we a train linear classifier on the feature vector generated by each layer individually and select the corresponding layer which achieves the lowest average error using five-fold cross-validation on the support set . Essentially, the goal of the first stage is to find the most effective layer of each pretrained model given the task in order to reduce the search space and mitigate risk of overfitting. For convenience, we denote the layers selected in the first selection stage as set . In the second stage, we greedily add the layers in into the set following a similar cross-validation procedure. First, we add the layer in into which achieves the lowest cross-validation error. Then we iterate over , and add each remaining layer into if the cross-validation error is reduced when the new layer is added. Finally, we concatenate the feature vector generated by each layer in set and train the final linear classifier. Please see Algorithm 1 in Appendix for further details.

6 Evaluation Setup

For meta-learning methods, we meta-train all meta-learning methods on the base classes of miniImageNet [vinyals2016matching] and meta-test the trained models on each dataset of the proposed benchmark. For transfer learning methods, we train the pretrained model on base classes of miniImageNet. For transferring from multiple pretrained models, we use a maximum of five pretrained models, trained on miniImagenet, CIFAR100 [krizhevsky2009learning], DTD [cimpoi2014describing], CUB [WelinderEtal2010], Caltech256 [griffin2007caltech], respectively. On all experiments we consider 5-way 5-shot, 5-way 20-shot, 5-way 50-shot. For all cases, the test (query) set has 15 images per class. All experiments are performed with ResNet-10 [he2016deep] for fair comparison. For each evaluation, we use the same 600 randomly sampled few-shot episodes (for consistency), and report the average accuracy and confidence interval.

During the training (meta-training) stage, models used for transfer learning and meta-learning models are both trained for 400 epochs with Adam optimizer. The learning rate is set to 0.001. During testing (meta-testing), both transfer learning methods and those meta-learning methods that require adaptation on the support set of the test episodes (MAML, RelationNet, etc.) use SGD with momentum. The learning rate is 0.01 and the momentum rate is 0.9. All variants of fine-tuning methods are trained for 100 epochs. In the training or meta-training stage, we apply standard data augmentation including random crop, random flip, and color jitter.

7 Experimental Results

Results are discussed according to method categories as described in Section 5. First, results from state-of-art meta-learning methods are presented in Section 7.1. Next, transfer learning is evaluated and analyzed in Section 7.2, including single model transfer in Section 7.2.1, followed by multi-model transfer in Section 7.2.2. Finally, a succinct best-in-category comparison is presented in Section 7.3.

7.1 Meta-learning based results

Table 1 show the results on the proposed benchmark of meta-learning, for each dataset, method, and shot level in the benchmark. Across all datasets and shot levels, the average accuracies (and 95% confidence internals) are 50.21% (0.70) for MatchingNet, 38.75% (0.41) for MAML, 59.78% (0.70) for ProtoNet, 54.48% (0.71) for RelationNet, and 57.35% (0.68) for MetaOpt. The performance of MAML was impacted by its inability to scale to larger shot levels due to memory overflow.

What is immediately apparent from Table 1, is that performance in general strongly positively correlates to the dataset’s similarity to ImageNet, confirming that the benchmark’s intentional design allows us to investigate few-shot learning in a spectrum of cross-domain difficulties.

7.2 Transfer learning based results

7.2.1 Single model results

Table 2 show the results on the proposed benchmark of various single model transfer learning methods. Across all datasets and shot levels, the average accuracies (and 95% confidence internals) are 53.99% (1.38) for random embedding, 64.24 (0.59) for fixed feature embedding, 67.23% (0.46) for fine-tuning, 67.41% (0.49) for fine-tuning the last 1 layer, 67.26% (0.53) for fine-tuning the last 2 layers, 67.17% (0.58) for fine-tuning the last 3 layers, and 68.14% (0.56) for transductive fine-tuning. From these results, several observations can be made. The first observation is that, although meta-learning methods have been previously shown to achieve higher performance than transfer learning in the standard few-shot learning setting [vinyals2016matching, chen2018a], in the cross-domain few-shot learning setting this situation is reversed: meta-learning methods significantly underperform simple fine-tuning methods. In fact, MatchingNet performs worse than a randomly generated fixed embedding

. A possible explanation is that meta-learning methods are fitting the task distribution on the base class data, improving performance in that circumstance, but hindering ability to generalize to another task distribution. The second observation is that, by leveraging the statistics of the test data, transductive fine-tuning achieves higher results than the standard fine-tuning. This suggests that reliable estimates of statistics are difficult to measure with only a few examples. The third observation is that the accuracy of most methods on the benchmark continues to be dependent on how similar the dataset is to ImageNet:

CropDiseases commands the highest performance on average, while EuroSAT follows in 2 place, ISIC in 3, and ChestX in 4. This further supports the motivation behind benchmark design in targeting applications with increasing visual domain dissimilarity to ImageNet.

Table 3 shows results from varying the classifier. While mean-centriod classifier and cosine-similarity classifier are shown to be more efficient than simple linear classifier in the conventional few-shot learning setting, our results show that mean-centroid and cosine-similarity classifier only have a marginal advantage on ChestX and EuroSAT over linear classifier in the 5-shot case (Table 3). As the shot increases, linear classifier begins to dominate mean-centroid and cosine-similarity classifier. One plausible reason is that both mean-centroid and cosine-similarity classifier conduct classification based on unimodal class prototypes, when the number of examples increases, unimodal distribution becomes less suitable to represent them, and multi-modal distribution is required.

We further analyze how layers are changed during transfer. We use to denote the original pretrained parameters and to denote the parameters after fine-tuning. Figure 2 shows the relative parameter change of the ResNet10 miniImageNet pretrained model as , averaged over all parameters per layer, and 100 runs. Several interesting observations can be made from these results. First, across all the datasets and all the shots, the first layer of the pretrained model changes most. This indicates that if the target domain is different from the source domain, the lower layers of the pretrained models still need to be adjusted. Second, while the datasets are drastically different, we observe that some layers are consistently more transferable than other layers. One plausible explanation for this phenomenon is the heterogeneous characteristic of layers in overparameterized deep neural networks [zhang2019all].

Figure 2: Relative change of pretrained network layers for single model transfer.
Figure 3: Histograms showing frequency of source model selection for each dataset in the benchmark.
Dataset# of models 2 3 4 5
ChestX 34.35% 36.29% 37.64% 37.89%
ISIC 59.4% 62.49% 65.07% 64.77%
EuroSAT 91.71% 93.49% 92.67% 93.00%
CropDiseases 98.43% 98.09% 98.05% 98.60%
Table 5: Number of models’ effect on test accuracy.

7.2.2 Transfer from Multiple Pretrained Models

The results of the proposed Incremental Muiti-model Selection are shown in Table 4. IMS-f fine-tunes each pretrained model before applying the model selection. We include a baseline called all embeddings which concatenates the feature vectors generated by all the layers from the fine-tuned models. Across all datasets and shot levels, the average accuracies (and 95% confidence internals) are 68.22% (0.45) for all embeddings, and 68.69% (0.44) for IMS-f. The results show that IMS-f generally improves upon all embeddings which indicates the importance of selecting relevant pretrained models to the target dataset. Model complexity also tends to decrease by over 20% compared to all embeddings on average. We can also observe that it is beneficial to use multiple pretrained models than using just one model. Compared with standard finetuning with a linear classifier, the average improvement of IMS-f across all the shots on ChestX is 0.20%, on ISIC is 0.69%, on EuroSAT is 3.52% and on CropDiseases is 1.27%.

In further analysis, we study the effect of the number of pretrained models for the proposed multi-model selection method. We consider libraries consisting of two, three, four, and all five pretrained models. The pretrained models are added into the library in the order of ImageNet, CIFAR100, DTD, CUB, Caltech256. For each dataset, the experiment is conducted on 5-way 50-shot with 600 episodes. The results are shown in Table 5. As more pretrained models are added into the library, we can observe that the test accuracy on ChestX and ISIC gradually improves which can be attributed to the diverse features provided by different pretrained models. However, on EuroSAT and CropDiseases, only a marginal improvement can be observed. One possible reason is that the features from ImageNet already captures the characteristics of the datasets and more pretrained models does not provide additional information.

Finally, we visualize for each dataset which pretrained models are selected in the proposed incremental multi-model selection. The experiments are conducted on 5-way 50-shot with all five pretrained models. For each dataset, we repeat the experiments for 600 episodes and calculate the frequency of each model being selected. The results are shown in Figure 3. We observe the distribution of the frequency differs significantly across datasets. This demonstrates that target datasets can benefit from features from different pretrained models.

Figure 4: Top performing meta-learning, single model, and multi-model transfer learning.

7.3 Best-in-category Comparison

Figure 4 summarizes the comparison across best-in-category algorithms, according to the average accuracy across all datasets and shot levels in the benchmark. The degradation in performance suffered by meta-learning approaches is significantly greater than the gain between single model and multi-model learning strategies, emphasizing the risk of employing meta-learning strategies for few-shot learning when the application domain may sustain any degree of drift. In addition, the relative performance gain with increasing number of shots is greater with transfer methods compared to meta-learning: 5.8% and 5.7% from 20-shot to 50-shot for single and multi-model transfer learning, respectively, versus 1.8% for meta-learning. 5-shot to 20-shot was similar for all methods: 14.5%, 13.6%, 14.6%, for meta-learning, single model, and multi-model, respectively.

8 Conclusion

In this paper, we formally introduce the problem of cross-domain few-shot learning (CD-FSL) and establish the new CD-FSL benchmark, which covers several target domains with varying similarity to the ImageNet source domain. We extensively analyze and evaluate existing meta-learning methods and variants of transfer learning. The results show that meta-learning approaches significantly underperform in comparison to fine-tuning methods. In addition, the relative performance gain with increasing data is greater with transfer methods compared to meta-learning. Finally, we propose a multi-model selection method to leverage multiple pretrained models from multiple source datasets with similar domains as ImageNet, and demonstrate that this method yields higher average performance than any single model fine-tuning approach. In conclusion, due to its spectrum of diversity and coverage, the proposed benchmark serves as a challenging platform to guide research on cross-domain few-shot learning. Future work may expand the target domains further.


9 Appendix

9.1 Incremental Multi-model Selection

/* First stage */
= {} /* Iterate over each pretrained model */
1 for   do
       = -1 = /* Iterate over each layer of the pretrained model */
2       for   do
3             if   then
4                   = =
6       =
/* Second stage */
7 = {} = -1 for each in  do
8       if   then
9             = =
Concatenate the feature vectors generated by the layers in and train a linear classifier.
Algorithm 1 Incremental Multi-model Selection. is a support set consisting of examples from novel classes. Assume there is a library of pretrained models . Each model has layers and is used to denote one particular layer. Let be a function which returns the average cross-validation error given a dataset and a set of layers which are used to generate feature vector.