Self-Supervised Prototypical Transfer Learning for Few-Shot Classification

06/19/2020 ∙ by Carlos Medina, et al. ∙ EPFL 21

Most approaches in few-shot learning rely on costly annotated data related to the goal task domain during (pre-)training. Recently, unsupervised meta-learning methods have exchanged the annotation requirement for a reduction in few-shot classification performance. Simultaneously, in settings with realistic domain shift, common transfer learning has been shown to outperform supervised meta-learning. Building on these insights and on advances in self-supervised learning, we propose a transfer learning approach which constructs a metric embedding that clusters unlabeled prototypical samples and their augmentations closely together. This pre-trained embedding is a starting point for few-shot classification by summarizing class clusters and fine-tuning. We demonstrate that our self-supervised prototypical transfer learning approach ProtoTransfer outperforms state-of-the-art unsupervised meta-learning methods on few-shot tasks from the mini-ImageNet dataset. In few-shot experiments with domain shift, our approach even has comparable performance to supervised methods, but requires orders of magnitude fewer labels.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

ProtoTransfer

Official code for the paper "Self-Supervised Prototypical Transfer Learning for Few-Shot Classification"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Few-shot classification (Fei-Fei et al., 2006)

is a learning task in which a classifier must adapt to distinguish novel classes not seen during training, given only a few examples (shots) of these classes. Meta-learning

(Finn et al., 2017; Ren et al., 2018) is a popular approach for few-shot classification by mimicking the test setting during training through so-called episodes of learning with few examples from the training classes. However, several works (Chen et al., 2019b; Guo et al., 2019) show that common (non-episodical) transfer learning outperforms meta-learning methods on the realistic cross-domain setting, where training and novel classes come from different distributions.

Nevertheless, most few-shot classification methods still require much annotated data for pre-training. Recently, several unsupervised meta-learning approaches, constructing episodes via pseudo-labeling (Hsu et al., 2019; Ji et al., 2019) or image augmentations (Khodadadeh et al., 2019; Antoniou and Storkey, 2019; Qin et al., 2020), have addressed this problem. To our knowledge, unsupervised non-episodical techniques for transfer learning to few-shot tasks have not yet been explored.

Our approach ProtoTransfer performs self-supervised pre-training on an unlabeled training domain and can transfer to few-shot target domain tasks. During pre-training, we minimize a pairwise distance loss in order to learn an embedding that clusters noisy transformations of the same image around the original image. Our pre-training loss can be seen as a self-supervised version of the prototypical loss in Snell et al. (2017) in line with contrastive learning, which has driven recent advances in self-supervised representation learning (Ye et al., 2019; Chen et al., 2020; He et al., 2019). In the few-shot target task, in line with pre-training, we summarize class information in class prototypes for nearest neighbor inference similar to ProtoNet (Snell et al., 2017) and we support fine-tuning to improve performance when multiple examples are available per class.

We highlight our main contributions and results:

  1. We show that our approach outperforms state-of-the-art unsupervised meta-learning methods by 4% to 8% on mini-ImageNet few-shot classification tasks and has competitive performance on Omniglot.

  2. Compared to the fully supervised setting, our approach achieves competitive performance on mini-ImageNet and multiple datasets from the cross-domain transfer learning CDFSL benchmark, with the benefit of not requiring labels during training.

  3. In an ablation study and cross-domain experiments we show that using a larger number of equivalent training classes than commonly possible with episodical meta-learning, and parametric fine-tuning are key to obtaining performance matching supervised approaches.

2 A Self-Supervised Prototypical Transfer Learning Algorithm

Section 2.1 introduces the few-shot classification setting and relevant terminology. Further, we describe ProtoTransfer’s pre-training stage, ProtoCLR, in Section 2.2 and its fine-tuning stage, ProtoTune, in Section 2.3. Figure 1 illustrates the procedure.

(a) Self-Supervised Prototypical Pre-Training         (b) Prototypical Fine-Tuning & Inference
Figure 1: Self-Supervised Prototypical Transfer Learning. (a): In the embedding, original images serve as class prototypes around which their augmented views should cluster. (b): Prototypes are the means of embedded support examples for each class and initialize a final linear layer for fine-tuning. An embedded query point is classified via a softmax over the fine-tuned linear layer.

2.1 Preliminaries

The goal of few-shot classification is to predict classes for a set of unlabeled points (the query set) given a small set of labeled examples (the support set) from the same classes. Few-shot classification approaches commonly consist of two subsequent learning phases, each using its own set of classes.

The first learning phase utilizes samples from base (training) classes contained within a training set , where is a sample with label in label set

. An important aspect of our specific unsupervised learning setting is that the first phase has no access to the per-sample label information, the distribution of classes, nor the size of the label set

, for pre-training. This first phase serves as a preparation for the actual few-shot learning in the target domain, i.e. the second learning phase. This second supervised learning phase contains novel (testing) classes as , where only few examples for each of the classes in are available. Concretely, an -way -shot classification task consists of labeled examples for each of the novel classes. In the few-shot learning literature a task is also commonly referred to as an episode.

2.2 Self-Supervised Prototypical Pre-Training: ProtoCLR

Similar to the few-shot target tasks, we frame every ProtoCLR pre-training learning step as an -way

-shot classification task optimized by a contrastive loss function as described below. In this, we draw inspiration from recent progress in unsupervised meta-learning

(Khodadadeh et al., 2019) and self-supervised visual contrastive learning of representations (Chen et al., 2020; Ye et al., 2019).

1:  input: batch size , augmentations size , embedding function , set of random transformations , step size , distance function
2:  Randomly initialize
3:  while not done do
4:     Sample minibatch
5:     for all do
6:     for all do
7:     draw a random transformation
8:     
9:     end for
10:     end for
11:     let
12:     
13:     
14:  end while
15:  return embedding function
Algorithm 1 Self-Supervised Prototypical Pre-Training (ProtoCLR)

Algorithm 1 details ProtoCLR and it comprises the following parts:

  • Batch generation (Algorithm 1 lines 4-10): Each mini-batch contains random samples from the training set. As our self-supervised setting does not assume any knowledge about the base class labels , we treat each sample as it’s own class. Thus, each sample serves as a 1-shot support sample and class prototype. For each prototype , different randomly augmented versions are used as query samples.

  • Contrastive prototypical loss optimization (Algorithm 1 lines 11-13): The pre-training loss encourages clustering of augmented query samples around their prototype in the embedding space through a distance metric . The softmax cross-entropy loss over classes is minimized with respect to the embedding parameters

    with mini-batch stochastic gradient descent (SGD).

Commonly, unsupervised pre-training approaches for few-shot classification (Hsu et al., 2019; Khodadadeh et al., 2019; Antoniou and Storkey, 2019; Qin et al., 2020; Ji et al., 2019) rely on meta-learning. Thus, they are required to create small artificial -way (-shot) tasks identical to the downstream few-shot classification tasks. Our approach does not use meta-learning and can use any batch size . Larger batch sizes have been shown to help self-supervised representation learning (Chen et al., 2020) and supervised pre-training for few-shot classification (Snell et al., 2017). We also find that larger batches yield a significant performance improvement for our approach (see Section 3.3). To generate the query examples, we use image augmentations similar to (Chen et al., 2020) and adjust them for every dataset. The exact transformations are listed in Appendix A.3. Following Snell et al. (2017), we use the Euclidean distance, but our method is generic and works with any metric.

2.3 Supervised Prototypical Fine-Tuning: ProtoTune

After pre-training the metric embedding , we address the target task of few-shot classification. For this, we extend the prototypical nearest-neighbor classifier ProtoNet (Snell et al., 2017) with prototypical fine-tuning of a final classification layer, which we refer to as ProtoTune. First, the class prototypes are computed as the mean of the class samples in the support set of the few-shot task:

ProtoNet uses non-parametric nearest-neighbor classification with respect to and can be interpreted as a linear classifier applied to a learned representation . Following the derivation in Snell et al. (2017), we initialize a final linear layer with weights and biases . Then, this final layer is fine-tuned with a softmax cross-entropy loss on samples from , while keeping the embedding function parameters fixed. Triantafillou et al. (2020) proposed a similar fine-tuning approach with prototypical initialization, but their approach always fine-tunes all model parameters.

3 Experiments

We carry out several experiments to benchmark and analyze ProtoTransfer. In Section 3.1, we conduct in-domain classification experiments on the Omniglot (Lake et al., 2011) and mini-ImageNet (Vinyals et al., 2016) benchmarks to compare to state-of-the-art unsupervised few-shot learning approaches and methods with supervised pre-training. In Section 3.2, we test our method on a more challenging cross-domain few-shot learning benchmark (Guo et al., 2019). Section 3.3 contains an ablation study showing how the different components of ProtoTransfer contribute to its performance. In Section 3.4, we study how pre-training with varying class diversities affects performance. In Section 3.5, we give insight in generalization from training classes to novel classes from both unsupervised and supervised perspectives. Experimental details can be found in Appendix A and code is made available111Our code and pre-trained models are available at https://www.github.com/indy-lab/ProtoTransfer.

3.1 In-Domain Few-shot Classification: Omniglot and mini-ImageNet

For our in-domain experiments, where the disjoint training class set and novel class set come from the same distribution, we used the popular few-shot datasets Omniglot (Lake et al., 2011) and mini-ImageNet (Vinyals et al., 2016). For comparability we use the Conv-4 architecture proposed in Vinyals et al. (2016). Specifics on the datasets, architecture and optmization can be found in Appendices A.1 and A.2

. We apply limited hyperparameter tuning, as suggested in

Oliver et al. (2018), and use a batch size of and number of query augmentations for all datasets.

In Table 1, we report few-shot accuracies on the mini-ImageNet and Omniglot benchmarks. We compare to unsupervised clustering based methods CACTUs (Hsu et al., 2019) and UFLST (Ji et al., 2019) as well as the augmentation based methods UMTRA (Khodadadeh et al., 2019), AAL (Antoniou and Storkey, 2019) and ULDA (Qin et al., 2020). More details on how these approaches compare to ours can be found in Section 4

. Pre+Linear represents classical supervised transfer learning, where a deep neural network classifier is (pre)trained on the training classes and then only the last

linear layer is fine-tuned on the novel classes. On mini-ImageNet, ProtoTransfer outperforms all other state-of-the-art unsupervised pre-training approaches by at least 4% up to 8% and mostly outperforms the supervised meta-learning method MAML (Finn et al., 2017), while requiring orders of magnitude fewer labels ( vs ). On Omniglot, ProtoTransfer shows competitive performance with most unsupervised meta-learning approaches.

Method  (N,K) (5,1) (5,5) (20,1) (20,5) (5,1) (5,5) (5,20) (5,50)
Omniglot mini-ImageNet
Training (scratch) 52.50 74.78 24.91 47.62 27.59 38.48 51.53 59.63
CACTUs-MAML 68.84 87.78 48.09 73.36 39.90 53.97 63.84 69.64
CACTUs-ProtoNet 68.12 83.58 47.75 66.27 39.18 53.36 61.54 63.55
UMTRA 83.80 95.43 74.25 92.12 39.93 50.73 61.11 67.15
AAL-ProtoNet 84.66 89.14 68.79 74.28 37.67 40.29 - -
AAL-MAML++ 88.40 97.96 70.21 88.32 34.57 49.18 - -
UFLST 97.03 99.19 91.28 97.37 33.77 45.03 53.35 56.72
ULDA-ProtoNet - - - - 40.63 55.41 63.16 65.20
ULDA-MetaOptNet - - - - 40.71 54.49 63.58 67.65
ProtoTransfer (ours) 88.00 96.48 72.27 89.08 45.67 62.99 72.34 77.22
Supervised training
MAML 94.46 98.83 84.60 96.29 46.81 62.13 71.03 75.54
ProtoNet 97.70 99.28 94.40 98.39 46.44 66.33 76.73 78.91
Pre+Linear 94.30 99.08 86.05 97.11 43.87 63.01 75.46 80.17
Table 1: Accuracy (%) of unsupervised pre-training methods on -way -shot classification tasks on Omniglot and mini-Imagenet on a Conv-4 architecture. For detailed results, see Tables 7 and 8 in the Appendix. Results style: best and second best.

3.2 Cross-domain Few-Shot Classification: CDFSL benchmark

For our cross-domain experiments, where training and novel classes come from different distributions, we turn to the CDFSL benchmark (Guo et al., 2019). This benchmark specifically tests how well methods trained on mini-ImageNet can transfer to few-shot tasks with only limited similarity to mini-ImageNet. In order of decreasing similarity, the four datasets are plant disease images from CropDiseases (Mohanty et al., 2016), satellite images from EuroSAT (Helber et al., 2019), dermatological images from ISIC2018 (Tschandl et al., 2018; Codella et al., 2019) and grayscale chest X-ray images from ChestX (Wang et al., 2017). Following Guo et al. (2019), we use a ResNet-10 neural network architecture. As there is no validation data available for the target tasks in CDFSL, we keep the same ProtoTransfer hyperparameters , as used in the mini-ImageNet experiments. Experimental details are listed in Appendices A.1.2 and A.2.2.

For comparison to unsupervised meta-learning, we include our results on UMTRA-ProtoNet and its fine-tuned version UMTRA-ProtoTune (Khodadadeh et al., 2019). Both use our augmentations instead of those from (Khodadadeh et al., 2019). For further comparison, we include ProtoNet (Snell et al., 2017) for supervised few-shot learning and Pre+Mean-Centroid and Pre+Linear as the best-on-average performing transfer learning approaches from Guo et al. (2019). As the CDFSL benchmark presents a large domain shift with respect to mini-ImageNet, all model parameters are fine-tuned in ProtoTransfer during the few-shot fine-tuning phase with ProtoTune.

We report results on the CDFSL benchmark in Table 2. ProtoTransfer consistently outperforms its meta-learned counterparts by at least 0.7% up to 19% and performs mostly on par with the supervised transfer learning approaches. Comparing the results of UMTRA-ProtoNet and UMTRA-ProtoTune, starting from 5 shots, parametric fine-tuning gives improvements ranging from 1% to 13%. Notably, on the dataset with the largest domain shift (ChestX), ProtoTransfer outperforms all other approaches.

Method UnSup (5,5) (5,20) (5,50) (5,5) (5,20) (5,50)
ChestX ISIC
ProtoNet 24.05 28.21 29.32 39.57 49.50 51.99
Pre+Mean-Centroid 26.31 30.41 34.68 47.16 56.40 61.57
Pre+Linear 25.97 31.32 35.49 48.11 59.31 66.48
UMTRA-ProtoNet 24.94 28.04 29.88 39.21 44.62 46.48
UMTRA-ProtoTune 25.00 30.41 35.63 38.47 51.60 60.12
ProtoTransfer (ours) 26.71 33.82 39.35 45.19 59.07 66.15
EuroSat CropDiseases
ProtoNet 73.29 82.27 80.48 79.72 88.15 90.81
Pre+Mean-Centroid 82.21 87.62 88.24 87.61 93.87 94.77
Pre+Linear 79.08 87.64 91.34 89.25 95.51 97.68
UMTRA-ProtoNet 74.91 80.42 82.24 79.81 86.84 88.44
UMTRA-ProtoTune 68.11 81.56 85.05 82.67 92.04 95.46
ProtoTransfer (ours) 75.62 86.80 90.46 86.53 95.06 97.01
Table 2: Accuracy (%) of methods on -way -shot (,) classification tasks of the CDFSL benchmark (Guo et al., 2019). Both our results on methods with unsupervised pre-training (UnSup) and results on methods with supervised pre-training from CDFSL are listed. All models are trained on mini-ImageNet with ResNet-10. For detailed results, see Appendix Table 9. Results style: best and second best.

3.3 Ablation Study: Batch Size, Number of Queries, and Fine-Tuning

We conduct an ablation study of ProtoTransfer’s components to see how they contribute to its performance. Starting from ProtoTransfer we successively remove components to arrive at the equivalent UMTRA-ProtoNet which shows similar performance to the original UMTRA approach (Khodadadeh et al., 2019) on mini-ImageNet. As a reference, we provide results of a ProtoNet classifier on top of a fixed randomly initialized network.

Table 3 shows that increasing the batch size from for UMTRA-ProtoNet to 50 for ProtoCLR-ProtoNet, keeping everything else equal, is crucial to our approach and yields a 5% to 9% performance improvement. Importantly, UMTRA-ProtoNet uses our augmentations instead of those from (Khodadadeh et al., 2019). Thus, this improvement cannot be attributed to using different augmentations than UMTRA. Increasing the training query number to gives better gradient information and yields a relatively small but consistent performance improvement. Fine-tuning in the target domain does not always give a net improvement. Generally, when many shots are available, fine-tuning gives a significant boost in performance as exemplified by ProtoCLR-ProtoTune and UMTRA-MAML in the 50-shot case. Interestingly, our approach reaches competitive performance in the few-shot regime even before fine-tuning.

Training Testing batch size Q FT (5,1) (5,5) (5,20) (5,50)
n.a. ProtoNet n.a. n.a. no 27.05 34.12 39.68 41.40
UMTRA MAML 1 yes 39.93 50.73 61.11 67.15
UMTRA ProtoNet 1 no 39.17 53.78 62.41 64.40
ProtoCLR ProtoNet 50 1 no 44.53 62.88 70.86 73.93
ProtoCLR ProtoNet 50 3 no 44.89 63.35 72.27 74.31
ProtoCLR ProtoTune 50 3 yes 45.67 62.99 72.34 77.22
Table 3: Accuracy (%) of methods on -way -shot classification tasks on mini-ImageNet with a Conv-4 architecture for different training image batch sizes, number of training queries () and optional finetuning on target tasks (FT). UMTRA-MAML results are taken from Khodadadeh et al. (2019), where UMTRA uses AutoAugment (Cubuk et al., 2019) augmentations. For detailed results see Table 10 in the Appendix. Results style: best and second best.

3.4 Number of Training Classes and Samples

While ProtoTransfer already does not require any labels during pre-training, for some applications, e.g. rare medical conditions, even the collection of sufficiently similar data might be difficult. Thus, we test our approach when reducing the total number of available training images under the controlled setting of mini-ImageNet. Moreover, not all training datasets will have such a diverse set of classes to learn from as the different animals, vehicles and objects in mini-ImageNet. Therefore, we also test the effect of reducing the number of training classes and thereby the class diversity. To contrast the effects of reducing the number of classes or reducing the number of samples, we either remove whole classes from the mini-ImageNet training set or remove the corresponding amount of samples randomly from all classes. The number of samples are decreased in multiples of 600 as each mini-ImageNet class contains exactly 600 samples. We compare the mini-ImageNet few-shot classification accuracies of ProtoTransfer to the popular supervised transfer learning baseline Pre+Linear in Figure 2.

As expected, when uniformily reducing the number of images from all classes (Figure 2a), the few-shot classification accuracy is reduced as well. The performance of ProtoTransfer and the supervised baseline closely match in this case. When reducing the number of training classes in Figure 2b, ProtoTransfer consistently and significantly outperforms the supervised baseline when the number of mini-ImageNet training classes drops below 16. For example in the 20-shot case with only two training classes, ProtoTransfer outperforms the supervised baseline by a large margin of 16.9% (64.59% vs 47.68%). Comparing ProtoTransfer in Figures 2a and 2b, there is only a small difference between reducing images randomly from all classes or taking entire classes away. In contrast, the supervised baseline performance suffers substantially from having fewer classes.

(a) Varying number of training images. (b) Varying number of training classes.
Figure 2: -way

-shot accuracies with 95% confidence intervals on mini-ImageNet as a function of training images and classes. Methods: ProtoTransfer (

 ), transfer learning baseline Pre+Linear (   ). Note the logarithmic scale. Detailed results available in Table 11 in the appendix.

To validate these in-domain observations in a cross-domain setting, following Devos and Grossglauser (2019), we compare few-shot classification performance when training on CUB (Welinder et al., 2010; Wah et al., 2011) and testing on mini-ImageNet (Vinyals et al., 2016). CUB consists of 200 classes of birds, while only three of the 64 mini-ImageNet training classes are birds (see A.1.3, A.2.3 for details on CUB). Thus CUB possesses a lower class diversity than mini-ImageNet. Table 4 confirms our previous observation numerically and shows that ProtoTransfer has a superior transfer accuracy of 2% to 4% over the supervised approach when limited diversity is available in the training classes.

We conjecture that this difference is due to the fact that our self-supervised approach does not make a difference between samples coming from the same or different (latent) classes during training. Thus, we expect it to learn discriminative features despite a low training class diversity. In contrast, the supervised case forces multiple images with rich features into the same classes. We thus expect the generalization gap between tasks coming from training classes and testing classes to be smaller with self-supervision. We provide evidence to support this conjecture in Section 3.5.

Training Testing (5,1) (5,5) (5,20) (5,50)
ProtoCLR ProtoNet 34.56 0.61 52.76 0.63 62.76 0.59 66.01 0.55
ProtoCLR ProtoTune 35.37 0.63 52.38 0.66 63.82 0.59 68.95 0.57
Pre(training) Linear 33.10 0.60 47.01 0.65 59.94 0.62 65.75 0.63
Table 4: Accuracy (%) on -way -shot classification tasks on mini-ImageNet for methods trained on the CUB training set (5885 images) with a Conv-4 architecture. All results indicate 95% confidence intervals over 600 randomly generated test episodes. Results style: best and second best.

3.5 Task Generalization Gap

To compare the generalization of ProtoCLR with its supervised embedding learning counterpart ProtoNet (Snell et al., 2017), we visualize the learned embedding spaces with t-SNE (Maaten and Hinton, 2008) in Figure 3. We compare both methods on samples from 5 random classes from the training and testing sets of mini-ImageNet. In Figures 3a and 3b we observe that, for the same training classes, ProtoNet shows more structure. Comparing all subfigures in Figure 3, ProtoCLR shows more closely related embeddings in Figures 3a and 3c than ProtoNet in Figures 3b and 3d.

These visual observations are supported numerically in Table 5. Self-supervised embedding approaches, such as UMTRA and our ProtoCLR approach, show a much smaller task generalization gap than supervised ProtoNet. ProtoCLR shows virtually no classification performance drop. However, supervised ProtoNet suffers a significant accuracy reduction of 6% to 12%.

(a) ProtoCLR training (b) ProtoNet training (c) ProtoCLR testing (d) ProtoNet testing
Figure 3: t-SNE plots of trained embeddings on 5 classes from the training and testing sets of mini-ImageNet. Trained embeddings considered are self-supervised ProtoCLR and supervised 20-way 5-shot ProtoNet. For details on the depicted classes, please refer to Appendix A.4.
Training Testing Data (5,1) (5,5) (5,20) (5,50)
ProtoNet ProtoNet Train 53.74 0.95 79.09 0.69 85.53 0.53 86.62 0.48
ProtoNet ProtoNet Val 46.62 0.82 67.34 0.69 76.44 0.57 79.00 0.53
ProtoNet ProtoNet Test 46.44 0.78 66.33 0.68 76.73 0.54 78.91 0.57
UMTRA ProtoNet Train 41.03 0.79 56.43 0.78 64.48 0.71 66.28 0.66
UMTRA ProtoNet Test 38.92 0.69 53.37 0.68 61.69 0.66 65.12 0.59
ProtoCLR ProtoNet Train 45.33 0.63 63.47 0.58 71.51 0.51 73.99 0.49
ProtoCLR ProtoNet Test 44.89 0.58 63.35 0.54 72.27 0.45 74.31 0.45
Table 5: Accuracy (%) of -way -shot (N,K) classification tasks from the training and testing split of mini-ImageNet. Following Snell et al. (2017), ProtoNet is trained with -way -shot for -shot tasks and -way -shot otherwise. All results use a Conv-4 architecture. All results show 95% confidence intervals over 600 randomly generated episodes.

4 Related Work

Unsupervised meta-learning:

Both CACTUs (Hsu et al., 2019) and UFLST (Ji et al., 2019) alternate between clustering for support and query set generation and employing standard meta-learning. In contrast, our method unifies self-supervised clustering and inference in a single model. Khodadadeh et al. (2019) propose an unsupervised model-agnostic meta-learning approach (UMTRA), where artifical -way -shot tasks are generated by randomly sampling support examples from the training set and generating corresponding queries by augmentation. Antoniou and Storkey (2019) (AAL) generalize this approach to more support shots by randomly grouping augmented images into classes for classification tasks. ULDA (Qin et al., 2020) induce a distribution shift between the support and query set by applying different types of augmentations to each. In contrast, ProtoTransfer uses a single un-augmented support sample, similar to Khodadadeh et al. (2019), but extends to several query samples for better gradient signals and steps away from artificial few-shot task sampling by using larger batch sizes, which is key to learning stronger embeddings.

Supervised meta-learning aided by self-supervision:

Several works have proposed to use a self-supervised loss either alongside supervised meta-learning episodes (Gidaris et al., 2019; Liu et al., 2019) or to initialize a model prior to supervised meta-learning on the source domain (Chen et al., 2019a; Su et al., 2019). In contrast, we do not require any labels during training.

Fine-tuning for few-shot classification:

Chen et al. (2019b) show that adaptation on the target task is key for good cross-domain few-shot classification performance. Similar to ProtoTune, Triantafillou et al. (2020) also initialize a final layer with prototypes after supervised meta-learning, but always fine-tune all parameters of the model.

Contrastive loss learning:

Contrastive losses have fueled recent progress in learning strong embedding functions (Ye et al., 2019; Chen et al., 2020; He et al., 2019; Tian et al., 2020; Li et al., 2020). Most similar to our approach is Ye et al. (2019). They propose a per-batch contrastive loss that minimizes the distance between an image and an augmented version of it. Different to us, they do not generalize to using multiple augmented query images per prototype and use 2 extra fully connected layers during training. Concurrently, Li et al. (2020) also use a prototype-based contrastive loss. They compute the prototypes as centroids after clustering augmented images via -Means. They also separate learning and clustering procedures, which ProtoTransfer achieves in a single procedure.

5 Conclusion

In this work, we proposed ProtoTransfer for few-shot classification. ProtoTransfer performs transfer learning from an unlabeled source domain to a target domain with only a few labeled examples. Our experiments show that on mini-ImageNet it outperforms all prior unsupervised few-shot learning approaches by a large margin. On a more challenging cross-domain few-shot classification benchmark, ProtoTransfer shows similar performance to fully supervised approaches. Our ablation studies show that large batch sizes are crucial to learning good representations for downstream few-shot classification tasks and that parametric fine-tuning on target tasks can significantly boost performance.

This work received support from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 754354.


References

  • A. Antoniou and A. Storkey (2019) Assume, Augment and Learn: Unsupervised Few-Shot Meta-Learning via Random Labels and Data Augmentation. arXiv preprint arXiv:1902.09884. Cited by: item 33footnotemark: 3, item 33footnotemark: 3, §1, §2.2, §3.1, §4.
  • D. Chen, Y. Chen, Y. Li, F. Mao, Y. He, and H. Xue (2019a) Self-Supervised Learning For Few-Shot Image Classification. arXiv preprint arXiv:1911.06045. Cited by: §4.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A Simple Framework for Contrastive Learning of Visual Representations. arXiv preprint arXiv:2002.05709. Cited by: §A.3.1, §A.3.2, §1, §2.2, §2.2, §4.
  • W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang (2019b) A Closer Look at Few-shot Classification. In ICLR 2019 : 7th International Conference on Learning Representations, Cited by: §1, §4.
  • N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gutman, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, et al. (2019) Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC). arXiv preprint arXiv:1902.03368. Cited by: §A.1.2, §3.2.
  • E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) AutoAugment: learning augmentation strategies from data. In

    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 113–123. Cited by: Table 3.
  • A. Devos and M. Grossglauser (2019) Subspace Networks for Few-Shot Classification. arXiv preprint arXiv:1905.13613. Cited by: §3.4.
  • T. DeVries and G. W. Taylor (2017)

    Improved Regularization of Convolutional Neural Networks With Cutout

    .
    arXiv preprint arXiv:1708.04552. Cited by: §A.3.3.
  • L. Fei-Fei, R. Fergus, and P. Perona (2006) One-Shot Learning of Object Categories. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (4), pp. 594–611. Cited by: §1.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 1126–1135. Cited by: §1, §3.1.
  • S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord (2019) Boosting Few-Shot Visual Learning with Self-Supervision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8059–8068. Cited by: §4.
  • Y. Guo, N. C. Codella, L. Karlinsky, J. R. Smith, T. Rosing, and R. Feris (2019) A New Benchmark for Evaluation of Cross-Domain Few-Shot Learning. arXiv preprint arXiv:1912.07200. Cited by: item *, §A.1.2, §A.2.2, §A.3.1, Table 9, §1, §3.2, §3.2, Table 2, §3.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum Contrast for Unsupervised Visual Representation Learning. arXiv preprint arXiv:1911.05722. Cited by: §1, §4.
  • P. Helber, B. Bischke, A. Dengel, and D. Borth (2019)

    Eurosat: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification

    .
    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12 (7), pp. 2217–2226. Cited by: §A.1.2, §3.2.
  • N. Hilliard, L. Phillips, S. Howland, A. Yankov, C. D. Corley, and N. O. Hodas (2018) Few-shot learning with metric-agnostic conditional embeddings. arXiv preprint arXiv:1802.04376. Cited by: §A.1.3.
  • K. Hsu, S. Levine, and C. Finn (2019) Unsupervised Learning via Meta-Learning. In ICLR 2019 : 7th International Conference on Learning Representations, Cited by: item 11footnotemark: 1, item 11footnotemark: 1, §1, §2.2, §3.1, §4.
  • Z. Ji, X. Zou, T. Huang, and S. Wu (2019) Unsupervised Few-shot Learning via Self-supervised Training. arXiv preprint arXiv:1912.12178. Cited by: item 44footnotemark: 4, item 44footnotemark: 4, §1, §2.2, §3.1, §4.
  • S. Khodadadeh, L. Boloni, and M. Shah (2019) Unsupervised Meta-Learning for Few-Shot Image Classification. In Advances in Neural Information Processing Systems (NeurIPS), pp. 10132–10142. Cited by: item 22footnotemark: 2, item 22footnotemark: 2, item *, §1, §2.2, §2.2, §3.1, §3.2, §3.3, §3.3, Table 3, §4.
  • D. P. Kingma and J. L. Ba (2015) Adam: A Method for Stochastic Optimization. In ICLR 2015 : International Conference on Learning Representations 2015, Cited by: §A.2.1, §A.2.2, §A.2.4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet Classification With Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1097–1105. Cited by: §A.3.3.
  • B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum (2011) One-shot Learning of Simple Visual Concepts. Cognitive Science 33 (33). Cited by: §A.1.1, §3.1, §3.
  • J. Li, P. Zhou, C. Xiong, R. Socher, and S. C. Hoi (2020) Prototypical Contrastive Learning of Unsupervised Representations. arXiv preprint arXiv:2005.04966. Cited by: §4.
  • S. Liu, A. Davison, and E. Johns (2019) Self-Supervised Generalisation with Meta-Auxiliary Learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1677–1687. Cited by: §4.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §3.5.
  • S. P. Mohanty, D. P. Hughes, and M. Salathé (2016) Using Deep Learning for Image-Based Plant Disease Detection. Frontiers in Plant Science 7, pp. 1419. Cited by: §A.1.2, §3.2.
  • A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. Goodfellow (2018)

    Realistic Evaluation of Deep Semi-Supervised Learning Algorithms

    .
    In Advances in Neural Information Processing Systems (NeurIPS), pp. 3235–3246. Cited by: §A.2, §3.1.
  • T. Qin, W. Li, Y. Shi, and Y. Gao (2020) Unsupervised Few-shot Learning via Distribution Shift-based Augmentation. arXiv preprint arXiv:2004.05805. Cited by: item 55footnotemark: 5, §1, §2.2, §3.1, §4.
  • S. Ravi and H. Larochelle (2017) Optimization as a Model for Few-Shot Learning. In ICLR 2017 : International Conference on Learning Representations 2017, Cited by: §A.1.1.
  • M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel (2018) Meta-Learning for Semi-Supervised Few-Shot Classification. In Proceedings of 6th International Conference on Learning Representations ICLR, Cited by: §1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §A.1.1, §A.3.1.
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. P. Lillicrap (2016) One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065. Cited by: §A.1.1.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical Networks for Few-Shot Learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4077–4087. Cited by: §A.2.1, §1, §2.2, §2.3, §3.2, §3.5, Table 5.
  • J. Su, S. Maji, and B. Hariharan (2019) When Does Self-Supervision Improve Few-Shot Learning?. arXiv preprint arXiv:1910.03560. Cited by: §4.
  • Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola (2020) What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243. Cited by: §4.
  • E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P. Manzagol, and H. Larochelle (2020) Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples. In ICLR 2020 : Eighth International Conference on Learning Representations, Cited by: §2.3, §4.
  • P. Tschandl, C. Rosendahl, and H. Kittler (2018) The HAM10000 Dataset, a Large Collection of Multi-Source Dermatoscopic Images of Common Pigmented Skin Lesions. Scientific data 5, pp. 180161. Cited by: §A.1.2, §3.2.
  • O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra (2016) Matching Networks for One Shot Learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3630–3638. Cited by: §A.1.1, §A.1.1, §A.2.1, §3.1, §3.4, §3.
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Cited by: §A.1.3, §3.4.
  • X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) Chestx-ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2097–2106. Cited by: §A.1.2, §3.2.
  • P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §A.1.3, §3.4.
  • M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised Embedding Learning via Invariant and Spreading Instance Feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6210–6219. Cited by: §1, §2.2, §4.
  • Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2020) Random Erasing Data Augmentation. In

    AAAI 2020 : The Thirty-Fourth AAAI Conference on Artificial Intelligence

    ,
    Cited by: item 6.

Appendix A Experimental Details

a.1 Datasets

a.1.1 In-domain datasets

For our in-domain experiments we used the popular few-shot datasets Omniglot (Lake et al., 2011) and mini-ImageNet (Vinyals et al., 2016).

Omniglot consists of 1623 handwritten characters from 50 alphabets and 20 examples per character. Identical to Vinyals et al. (2016), the grayscale images are resized to 28x28. Following Santoro et al. (2016), we use 1200 characters for training and 423 for testing.

Mini-ImageNet is a subset of the ILSVRC-12 dataset (Russakovsky et al., 2015), which contains 60,000 color images that we resized to 84x84. For comparability, we use the splits introduced by Ravi and Larochelle (2017) over 100 classes with 600 images each. 64 classes are used for pre-training and 20 for testing. We only use the 16 validation set classes for limited hyperparameter tuning of batch size , number of queries and the augmentation strengths.

a.1.2 Cross-domain datasets

We evaluate all cross-domain experiments the CDFSL-benchmark (Guo et al., 2019). It comprises four datasets with decreasing similarity to mini-ImageNet. In order of similarity, they are plant disease images from CropDiseases (Mohanty et al., 2016), satellite images from EuroSAT (Helber et al., 2019), dermatological images from ISIC2018 (Tschandl et al., 2018; Codella et al., 2019) and grayscale chest x-ray images from ChestX (Wang et al., 2017).

a.1.3 Caltech-UCSD Birds-200-2011 (CUB) dataset

We use the Caltech-UCSD Birds-200-2011 (CUB) dataset Welinder et al. (2010); Wah et al. (2011) in our ablation studies. It is composed of 11,788 images from 200 different bird species. We follow the splits proposed by Hilliard et al. (2018) with 100 training, 50 validation and 50 test classes. We do not use the validation set classes.

a.2 Architecture and Optimization Parameters

In the following, we describe the experimental details for the individual experiments. We deliberately stay close to the parameters reported in prior work and do not perform an extensive hyperparameter search for our specific setup, as this can easily lead to performance overestimation compared to simpler approaches (Oliver et al., 2018)). Table 6 summarizes the hyperparameters we used for ProtoTransfer.

in-domain cross-domain
Hyperparameter Omniglot mini-ImageNet mini-ImageNet CUB
Model architecture Conv-4 Conv-4 ResNet-10 Conv-4
Image input size
Optimizer Adam Adam Adam Adam
Learning rate 0.001 0.001 0.001 0.001
Learning rate decay factor 0.5 0.5 / 0.5
Learning rate decay period 25,000 25,000 / 25,000
Support examples 1 1 1 1
Augmented queries () 3 3 3 3
Training batch size () 50 50 50 50
Augmentation appendix A.3.3 A.3.2 A.3.1 A.3.2
Fine-tuning optimizer Adam Adam Adam Adam
Fine-tuning learning rate 0.001 0.001 0.001 0.001
Fine-tuning batch size 5 5 5 5

Fine-tuning epochs

15 15 15 15
Fine-tune last layer
Fine-tune backbone
Table 6: ProtoTransfer hyperparameter summary.

a.2.1 In-Domain Experiments

Our mini-ImageNet and Omniglot experiments use the Conv-4 architecture proposed in Vinyals et al. (2016)

for comparability. Its four convolutional blocks each apply a 64-filters 3x3 convolution, batch normalization, a ReLU nonlinearity and 2x2 max-pooling. The pre-training mostly mirrors

Snell et al. (2017) and uses Adam (Kingma and Ba, 2015) with an initial learning rate of 0.001, which is multiplied by a factor of 0.5 every 25000 iterations. We use a batch size of 50. We do not use the validation set to select the best training epoch. Instead training stops after 20.000 iterations without improvement in training accuracy.

a.2.2 Cross-Domain Experiments

Our experiments on the CDFSL-Challenge are based on the code provided by Guo et al. (2019). Following Guo et al. (2019), we use a ResNet10 architecture that is pre-trained on mini-Imagenet images of size 224x224 for 400 epochs with Adam (Kingma and Ba, 2015) and the default learning rate of 0.001 for best comparability with the results reported in Guo et al. (2019). The batch size for self-supervised pre-training is 50. We do not use a validation set.

a.2.3 Caltech-UCSD Birds-200-2011 (CUB) Experiments

The CUB training is identical in terms of architecture (Conv-4) and optimization to the setup for our in-domain experiments.

a.2.4 Prototypical Fine-Tuning

During the fine-tuning stage we add a fully connected classification layer after the embedding function and initialize as described in Section 2.3. We split the support examples into batches of 5 images each and perform 15 fine-tuning epochs with Adam (Kingma and Ba, 2015) and an initial learning rate of 0.001. For target datasets mini-ImageNet and Omniglot only the last fully connected layer is optimized, while for the CDFSL benchmark experiments the embedding network is adapted as well.

a.3 Augmentations

a.3.1 CDFSL transforms

For the CDFSL-benchmark (Guo et al., 2019) experiments we employ the same augmentations as Chen et al. (2020), as these have proven to work well for ImageNet (Russakovsky et al., 2015) images of size 224x224. They are as follows:

  1. Random crop and resize: scale , aspect ratio , Bilinear filter with interpolation = 2

  2. Random horizontal flip

  3. Random () color jitter: brightness = contrast = saturation = 0.8, hue=0.2

  4. Random () grayscale

  5. Gaussian blur, random radius

a.3.2 mini-ImageNet & CUB transforms

For the mini-Imagenet and CUB experiments we used lighter versions of the Chen et al. (2020) augmentations, namely no Gaussian blur, lower color jitter strengths and smaller rescaling and cropping ranges. They are as follows:

  1. * Random crop and resize: scale , aspect ratio , Bilinear filter with interpolation = 2

  2. Random horizontal flip

  3. * Random vertical flip

  4. * Random () color jitter: brightness = contrast = saturation = 0.4, hue=0.2

  5. Random () grayscale

a.3.3 Omniglot transforms

For Omniglot we use a set of custom augmentations, namely random resizing and cropping, horizontal and vertical flipping, Image-Pixel Dropout (Krizhevsky et al., 2012) and Cutout (DeVries and Taylor, 2017). They are as follows:

  1. Resize to a size of 28x28 pixels

  2. Random and resize: scale , aspect ratio , Bilinear filter with interpolation = 2

  3. Random horizontal flip

  4. Random vertical flip

  5. Random () dropout

  6. Random erasing of a rectangular region in an image (Zhong et al., 2020), setting pixel values to 0: scale , aspect ratio

a.4 Classes for t-SNE Plots

The classes in the t-SNE plots are a random subset of classes from the mini-ImageNet base classes (classes 1-5) and the mini-ImageNet novel classes (classes 6-10). Their corresponding labels are the following:

  1. n02687172 aircraft carrier

  2. n04251144 snorkel

  3. n02823428 beer bottle

  4. n03676483 lipstick

  5. n03400231 frying pan

  6. n03272010 electric guitar

  7. n07613480 trifle

  8. n03775546 mixing bowl

  9. n03127925 crate

  10. n04146614 school bus

Each of the t-SNE plots in Figure 3 shows 500 randomly selected embedded images from within those classes.

a.5 Results With Full Confidence Intervals & References

Method  (N,K) (5,1) (5,5) (20,1) (20,5)
Omniglot
Training (scratch) 52.50 0.84 74.78 0.69 24.91 0.33 47.62 0.44
CACTUs-MAML11footnotemark: 1 68.84 0.80 87.78 0.50 48.09 0.41 73.36 0.34
CACTUs-ProtoNet11footnotemark: 1 68.12 0.84 83.58 0.61 47.75 0.43 66.27 0.37
UMTRA22footnotemark: 2 83.80 - 95.43 - 74.25 - 92.12 -
AAL-ProtoNet33footnotemark: 3 84.66 0.70 89.14 0.27 68.79 1.03 74.28 0.46
AAL-MAML++33footnotemark: 3 88.40 0.75 97.96 0.32 70.21 0.86 88.32 1.22
UFLST44footnotemark: 4 97.03 - 99.19 - 91.28 - 97.37 -
ProtoTransfer (ours) 88.00 0.64 96.48 0.26 72.27 0.47 89.08 0.23
Supervised training
MAML11footnotemark: 1 94.46 0.35 98.83 0.12 84.60 0.32 96.29 0.13
ProtoNet 97.70 0.29 99.28 0.10 94.40 0.23 98.39 0.08
Pre+Linear 94.30 0.43 99.08 0.10 86.05 0.34 97.11 0.11
  • Hsu et al. (2019)

  • Khodadadeh et al. (2019)

  • Antoniou and Storkey (2019)

  • Ji et al. (2019)

Table 7: Accuracy (%) of methods on -way -shot classification tasks on Omniglot and a Conv-4 architecture. All results are reported with 95% confidence intervals over 600 randomly generated test episodes. Results style: best and second best.
Method  (N,K) (5,1) (5,5) (5,20) (5,50)
Mini-ImageNet
Training (scratch) 27.59 0.59 38.48 0.66 51.53 0.72 59.63 0.74
CACTUs-MAML11footnotemark: 1 39.90 0.74 53.97 0.70 63.84 0.70 69.64 0.63
CACTUs-ProtoNet11footnotemark: 1 39.18 0.71 53.36 0.70 61.54 0.68 63.55 0.64
UMTRA22footnotemark: 2 39.93 - 50.73 - 61.11 - 67.15 -
AAL-ProtoNet33footnotemark: 3 37.67 0.39 40.29 0.68 - -
AAL-MAML++33footnotemark: 3 34.57 0.74 49.18 0.47 - -
UFLST44footnotemark: 4 33.77 0.70 45.03 0.73 53.35 0.59 56.72 0.67
ULDA-ProtoNet55footnotemark: 5 40.63 0.61 55.41 0.57 63.16 0.51 65.20 0.50
ULDA-MetaOptNet55footnotemark: 5 40.71 0.62 54.49 0.58 63.58 0.51 67.65 0.48
ProtoTransfer (ours) 45.67 0.79 62.99 0.75 72.34 0.58 77.22 0.52
Supervised training
MAML11footnotemark: 1 46.81 0.77 62.13 0.72 71.03 0.69 75.54 0.62
ProtoNet 46.44 0.78 66.33 0.68 76.73 0.54 78.91 0.57
Pre+Linear 43.87 0.69 63.01 0.71 75.46 0.58 80.17 0.51
  • Hsu et al. (2019)

  • Khodadadeh et al. (2019)

  • Antoniou and Storkey (2019)

  • Ji et al. (2019)

  • Qin et al. (2020)

Table 8: Accuracy (%) of methods on -way -shot classification tasks mini-Imagenet and a Conv-4 architecture. All results are reported with 95% confidence intervals over 600 randomly generated test episodes. Results style: best and second best.

width=center Method UnSup (5,5) (5,20) (5,50) (5,5) (5,20) (5,50) ChestX ISIC ProtoNet* 24.05 1.01 28.21 1.15 29.32 1.12 39.57 0.57 49.50 0.55 51.99 0.52 Pre+Mean-Centroid* 26.31 0.42 30.41 0.46 34.68 0.46 47.16 0.54 56.40 0.53 61.57 0.66 Pre+Linear* 25.97 0.41 31.32 0.45 35.49 0.45 48.11 0.64 59.31 0.48 66.48 0.56 UMTRA-ProtoNet 24.94 0.43 28.04 0.44 29.88 0.43 39.21 0.53 44.62 0.49 46.48 0.47 UMTRA-ProtoTune 25.00 0.43 30.41 0.44 35.63 0.48 38.47 0.55 51.60 0.54 60.12 0.50 ProtoTransfer (ours) 26.71 0.46 33.82 0.48 39.35 0.50 45.19 0.56 59.07 0.55 66.15 0.57 EuroSat CropDiseases ProtoNet* 73.29 0.71 82.27 0.57 80.48 0.57 79.72 0.67 88.15 0.51 90.81 0.43 Pre+Mean-Centroid* 82.21 0.49 87.62 0.34 88.24 0.29 87.61 0.47 93.87 0.68 94.77 0.34 Pre+Linear* 79.08 0.61 87.64 0.47 91.34 0.37 89.25 0.51 95.51 0.31 97.68 0.21 UMTRA-ProtoNet 74.91 0.72 80.42 0.66 82.24 0.61 79.81 0.65 86.84 0.50 88.44 0.46 UMTRA-ProtoTune 68.11 0.70 81.56 0.54 85.05 0.50 82.67 0.60 92.04 0.43 95.46 0.31 ProtoTransfer (ours) 75.62 0.67 86.80 0.42 90.46 0.37 86.53 0.56 95.06 0.32 97.01 0.26

  • Results from Guo et al. (2019)

Table 9: Accuracy (%) of methods on -way -shot classification tasks of the CDFSL benchmark (Guo et al., 2019). All models are trained on mini-ImageNet with ResNet-10. All results are reported with 95% confidence intervals over 600 randomly generated test episodes. Results style: best and second best.

width=center Training Testing batch size P FT (5,1) (5,5) (5,20) (5,50) UMTRA* MAML 5 1 yes 39.93 - 50.73 - 61.11 - 67.15 - UMTRA ProtoNet 5 1 no 39.17 0.53 53.78 0.53 62.41 0.49 64.40 0.46 ProtoCLR ProtoNet 50 1 no 44.53 0.60 62.88 0.54 70.86 0.48 73.93 0.44 ProtoCLR ProtoNet 50 3 no 44.89 0.58 63.35 0.54 72.27 0.45 74.31 0.45 ProtoCLR ProtoNet 50 5 no 45.00 0.57 63.17 0.55 71.70 0.48 73.98 0.44 ProtoCLR ProtoNet 50 10 no 44.98 0.58 62.56 0.53 70.78 0.48 73.69 0.44 ProtoCLR ProtoTune 50 3 yes 45.67 0.76 62.99 0.75 72.34 0.58 77.22 0.52

  • Khodadadeh et al. (2019)

Table 10: Accuracy (%) of methods on -way -shot classification tasks on Mini-ImageNet with a Conv-4 architecture for different batch sizes, number of training queries () and optional finetuning on target tasks (FT). UMTRA-MAML uses different augmentations. All results are reported with 95% confidence intervals over 600 randomly generated test episodes. Results style: best and second best.

width=center Method # images # classes (5,1) (5,5) (5,20) (5,50) Random+ProtoTune* 0 0 28.16 0.56 35.32 0.60 42.72 0.63 47.05 0.61 Random+Linear 0 0 26.77 0.52 34.68 0.58 44.62 0.61 51.79 0.61 ProtoTransfer 600 1 32.20 0.57 48.89 0.64 60.80 0.62 66.91 0.56 ProtoTransfer 600 64 37.02 0.65 52.84 0.72 64.76 0.63 69.54 0.58 Pre+Linear 600 36.58 0.69 52.03 0.72 63.04 0.68 68.34 0.64 ProtoTransfer 1200 2 34.15 0.61 53.59 0.68 64.59 0.63 70.24 0.54 ProtoTransfer 1200 64 38.88 0.70 55.53 0.69 66.91 0.57 71.16 0.56 Pre+Linear 1200 2 27.05 0.46 37.06 0.57 47.68 0.62 54.37 0.59 Pre+Linear 1200 37.81 0.70 53.96 0.69 65.43 0.68 70.02 0.59 ProtoTransfer 2400 4 37.96 0.64 55.27 0.69 66.61 0.60 70.92 0.55 ProtoTransfer 2400 64 40.90 0.71 59.12 0.71 69.34 0.60 73.32 0.55 Pre+Linear 2400 4 31.26 0.57 45.41 0.65 58.48 0.65 63.63 0.61 Pre+Linear 2400 38.82 0.69 55.26 0.70 67.96 0.64 73.29 0.58 ProtoTransfer 4800 8 40.74 0.73 59.00 0.71 69.45 0.61 74.08 0.53 ProtoTransfer 4800 64 41.97 0.74 59.09 0.71 69.40 0.61 73.60 0.56 Pre+Linear 4800 8 34.54 0.60 52.04 0.69 65.71 0.59 70.44 0.53 Pre+Linear 4800 41.38 0.70 58.15 0.73 70.51 0.63 75.05 0.56 ProtoTransfer 9600 16 42.04 0.76 60.35 0.72 70.70 0.58 75.16 0.57 ProtoTransfer 9600 64 42.94 0.78 60.36 0.72 70.66 0.59 74.67 0.55 Pre+Linear 9600 16 38.39 0.65 54.78 0.67 67.75 0.60 73.42 0.52 Pre+Linear 9600 41.74 0.73 60.24 0.68 73.03 0.61 77.90 0.53 ProtoTransfer 19200 32 43.88 0.76 61.22 0.69 71.26 0.59 75.62 0.52 ProtoTransfer 19200 64 44.02 0.74 60.78 0.72 71.58 0.56 75.77 0.52 Pre+Linear 19200 32 40.10 0.63 59.58 0.65 72.45 0.56 76.53 0.52 Pre+Linear 19200 41.58 0.71 61.20 0.66 73.57 0.56 79.01 0.51 ProtoTransfer 38400 64 45.67 0.76 62.99 0.75 72.34 0.58 77.22 0.52 Pre+Linear 38400 64 43.87 0.69 63.01 0.71 75.46 0.58 80.17 0.51

  • Trained for 100 epochs instead of the default 15 epochs for ProtoTune, since training a classifier on top of a fixed randomly initialized network is expected to require more fine-tuning than starting from a pre-trained network.

Table 11: Accuracy (%) of methods on -way -shot classification tasks on Mini-ImageNet with a Conv-4 architecture when reducing the number of pre-training classes or images. All results are reported with 95% confidence intervals over 600 randomly generated test episodes. Results style: best and second best.
Training Testing (5,1) (5,5) (5,20) (5,50)
ProtoCLR ProtoNet 34.56 0.61 52.76 0.63 62.76 0.59 66.01 0.55
ProtoCLR ProtoTransfer 35.37 0.63 52.38 0.66 63.82 0.59 68.95 0.57
Pre(training) Linear 33.10 0.60 47.01 0.65 59.94 0.62 65.75 0.63
Table 12: Accuracy (%) on -way -shot classification tasks on Mini-ImageNet for methods trained on the CUB training set (5885 images) with a Conv-4 architecture. All results are reported with 95% confidence intervals over 600 randomly generated test episodes. Results style: best and second best.