Deep Metric Transfer for Label Propagation with Limited Annotated Data

12/20/2018 ∙ by Bin Liu, et al. ∙ Microsoft Tsinghua University 12

We study object recognition under the constraint that each object class is only represented by very few observations. In such cases, naive supervised learning would lead to severe over-fitting in deep neural networks due to limited training data. We tackle this problem by creating much more training data through label propagation from the few labeled examples to a vast collection of unannotated images. Our main insight is that such a label propagation scheme can be highly effective when the similarity metric used for propagation is learned and transferred from other related domains with lots of data. We test our approach on semi-supervised learning, transfer learning and few-shot recognition, where we learn our similarity metric using various supervised/unsupervised pretraining methods, and transfer it to unlabeled data across different data distributions. By taking advantage of unlabeled data in this way, we achieve significant improvements on all three tasks. Notably, our approach outperforms current state-of-the-art techniques by an absolute 20% for semi-supervised learning on CIFAR10, 10% for transfer learning from ImageNet to CIFAR10, and 6% for few-shot recognition on mini-ImageNet, when labeled examples are limited.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Overview of the approach. Often, object categories are represented by very few images. We transfer a metric learned from another domain and propagate the labels from the few labeled images to a vast collection of unannotated images. We show this can reliably create much more labeled data for the target problem.

We address the problem of object recognition from a very small amount of labeled data. This problem is of particular importance when limited labels can be collected due to either time or financial constraints. Though this is a difficult challenge, we are encouraged by evidence from cognitive science suggesting that infants can quickly learn new concepts from very few examples [19, 1].

Many recognition problems in computer vision are concerned with learning on few labeled data. Semi-supervised learning, transfer learning, and few-shot recognition all aim to achieve fast generalization from few examples, by leveraging unlabeled data or labeled data from other domains.

The fundamental difficulty of this problem is that naive supervised training with very few examples results in severe over-fitting. Because of this, prior work in semi-supervised learning rely on strong regularizations such as augmentations [9], temporal consistency [18], and adversarial examples [25]

to improve performance. Some related works in few-shot learning do not even refine an online classifier. Instead, they simply apply the similarity metric learned from training categories to new categories without adaptation. Meta learning 

[7] seeks to optimize an online parametric classifier with few samples, but under the assumption that just a few steps of optimization will lead to effective generalization with less overfitting. These approaches indirectly address the inherent problem of limited training data.

In this paper, we propose a novel approach to this problem, where we create training data by label propagation to an unlabeled dataset, so that training a supervised model with great learning capacity no longer faces over-fitting. This approach is related to work on “pseudo-labeling” [20, 29], where the model is bootstrapped from limited data and trained on the new data/label pairs it infers. However, that is unlikely to work well when the labeled data is scarce, since the initial model is likely to be poor. Our work shares the spirit of amazing work [6]

showing that label propagation can work well with simple GIST descriptors. We bring it to the context of deep learning, and show that it is the metric transfer that enables accurate, diverse, and generalizable label propagation.

Our approach works with three data domains: a source domain to learn a similarity metric, few labeled examples to define the target problem, and an unlabeled dataset in which to propagate labels. Concretely, we first learn a similarity metric on the source domain, which can be either labeled or unlabeled. Supervised learning or unsupervised (self-supervised) learning is used to learn the metric accordingly. Then, given few observations of the target problem, we propagate the labels from these observations to the unlabeled dataset using the metric learned in the source domain. This creates an abundance of labeled data for learning a classifier. Finally, we train a standard supervised model using the propagated labels.

The main contribution of this work is the metric transfer approach for label propagation. By studying different combinations of metric pretraining methods (e.g. unsupervised, supervised) and label propagation algorithms (e.g.

nearest neighbors, spectral clustering), we find that our metric transfer approach on unlabeled data is general enough to work effectively for many settings. For semi-supervised learning on CIFAR10, we obtain an absolute

improvement over the state-of-the-art when labeled data is limited ( examples). Our work also provides an alternative approach for transfer learning and few-shot recognition when unlabeled data is given. Compared to pretraining on the source dataset and then finetuning on the limited labeled examples, we achieve a improvement on transferring representations from ImageNet to CIFAR10. We also demonstrate improved performance for few-shot recognition on the mini-ImageNet benchmark.

2 Related Work

Large-scale Recognition. To solve a computer vision problem, it has become a common practice to build a large-scale dataset [5, 2] and train deep neural networks [17, 32] on it. This philosophy has achieved unprecedented success on many important computer vision problems [5, 22, 30]

. However, constructing a large-scale dataset is often time-consuming and expensive, and this has motivated work on unsupervised learning and problems defined on few labeled samples.

Semi-supervised Learning. Semi-supervised learning [38] is a problem that lies in between supervised learning and unsupervised learning. It aims to make more accurate predictions by leveraging a large amount of unlabeled data than by relying on the labeled data alone. In the era of deep learning, one line of work leverages unlabeled data through deep generative models [15, 27]. However, training of generative models is often unstable, making it tricky to work with recognition tasks. Recent efforts on semi-supervised learning focus on regularization by self-ensembling through consistency loss, such as temporal ensembling [18], adversarial ensembling [25], and teacher-student distillation [34]. These models treat labeled data and unlabeled data separately without considering their relationships. The pseudo-labeling approach [20, 29] initializes a model on a smalled labeled dataset and bootstraps on the new data it predicts. This tends to fail when the labeled set is small. Our work is most closely related to label propagation approaches [6, 37], and we propose metric transfer to significantly improve the propagation performance.

Few-shot Recognition. Given some training data in training categories, few-shot recognition [1] requires the classifier to generalize to new categories from observing very few examples, often 1-shot or 5-shot. A body of work approaches this problem by offline metric learning [35, 33, 39], where a generic similarity metric is learned on the training data and directly transferred to the new categories using simple nearest neighbor classifiers without further adaptation. Recent works on meta-learning [7, 21, 24] take a learning-to-learn approach using online algorithms. In order not to overfit to the few examples, they develop meta-learners to find a common embedding space, which can be further finetuned with fast convergence to the target problem. Recent works [28, 8] using meta-learning consider the combined problem of semi-supervised learning and few-shot recognition, by allowing access to unlabeled data in few-shot recognition. This drives few-shot recognition into more realistic scenarios. We follow this setting as we study few-shot recognition.

Transfer Learning. Since the inception of the ImageNet challenge [30], transfer learning has emerged almost everywhere in visual recognition, such as in object detection [10] and semantic segmentation [23], by simply transferring the network weights learned on ImageNet classification and finetuning on the target task. When the pretraining task and the target task are closely related, this tends to generalize much better than training from scratch on the target task alone. Domain adaptation seeks to address a much more difficult scenario where there is a large gap between the inputs of the source and target domains [13], for example, between real images and synthetic images. What we study in this paper is metric transfer. Different from prior work [41] that employ metric transfer just to reduce the distribution divergence of different domains, we use metric transfer to propagate labels. Through this, we show that metric propagation is an effective method for learning with small data.

3 Approach

To deal with the shortage of labeled data, our approach is to enlarge it by propagating labels from annotated images to unlabeled data using the similarity metric between data pairs. The creation of much more labeled data enables us to train deep neural networks to their full learning capacity.

Our framework works on three data domains: the source domain , the target domain , and additional unlabeled data . The source domain can be labeled or unlabeled with abundant data, and it is used to learn a generic similarity metric between data pairs. The target domain only has few labeled data, but it defines the problem we want to optimize. The unlabeled data is the resource in which to propagate labels, and may potentially contain similar classes to the task defined in . It may or may not have overlapping classes with . Below we introduce our approach in detail.

3.1 Metric Pretraining

The source domain is used for pretraining a similarity metric between data pairs. Ideally, we desire the metric to capture the inherent structure in the target domain , so that transferring labels from is reliable and useful. For this to happen, we usually hold some prior knowledge about the source and the target . For example, the source domain is sampled from the same distribution as the target domain, but is completely unannotated, or the source domain is annotated with a different task but is closely related to the target. Formally, a similarity metric between data and can be defined as



is the similarity function to be learned. In this work, we use deep neural networks as a parametric model of this similarity function. The metric can be trained with either supervised or unsupervised methods, depending on whether labels are given in the source domain

. We briefly review the training algorithms as follows.

Unsupervised Metric Pretraining
Recently, there has been growing interest in unsupervised learning and self-supervised learning. Different algorithms are based on different data properties (e.g. color [43], context [3], motion [44]) and thus may vary in performance on the target task we may want to transfer. However, it is not our intent to give a comprehensive comparison over various methods and choose the best one. Instead, we show that general unsupervised transfer is beneficial for label propagation and leads to improved performance.

In this work, we utilize two unsupervised learning methods: instance discrimination [40]

and colorization 


. For instance discrimination, we treat each instance as a class, and maximize the probability of each example belonging to the class of itself,


For colorization, the idea is to learn a mapping from grayscale images to colorful ones. Following the original paper [43], instead of predicting raw pixel colors, we quantize the color space into soft bins , and use the cross-entropy loss on the soft bins,


where are spatial indices. We follow previous work [4] for applying ResNet to colorization, where we use a base network to map inputs to features, and a head network of three convolutional layers to convert features to colors. Since colorization does not automatically output a metric, we use the Euclidean distance on the features from the base network to measure similarity.

Supervised Metric Pretraining

In some scenarios, we have access to a labeled dataset, such as PASCAL VOC and ImageNet, having commonalities with the target task. Traditional metric learning with supervision minimizes the intra-class distance and maximizes the inter-class distance of the labeled samples. For this purpose, many types of loss functions such as contrastive loss, triplet loss 

[12], and neighborhood analysis [11] have been proposed. In this work, we use neighborhood analysis [11] to learn our metric. Concretely, we maximize the likelihood of each example being supported by other examples belonging to the same category,


3.2 Label Propagation

Given a target represented by a small number of labeled examples, and a unlabeled set , we propagate labels from to using the similarity function learned from . Suppose , and , where and are the number of images in , respectively. Label

is represented as a vector with the ground-truth class element set to

and the others set to . We consider two propagation algorithms.

Naive Nearest Neighbors
A straightforward propagation approach is to vote for the class of an unlabeled sample based on its similarity to each of the exemplars in the target set . For an unlabeled example

, we calculate its logits

for every class ,


where is the indicator function, denotes the similarity between example and , and is the number of labeled images available for class .

The nearest neighbor propagation method is essentially a one-step random walk where the similarity metric acts as the transition matrix and the indicator function acts as the initial distribution. The effectiveness of such one-step propagation depends heavily on the quality of the similarity metric.

In general, it is hard to learn such a metric well, especially when limited supervision is available, because of the visual diversity of images. Figure 2 (left) shows a typical similarity matrix computed from unsupervised features. Data points in the similarity matrix are sparsely connected, thus limiting the one-step label propagation approach.

Figure 2: Left: raw similarity matrix. Right: similarity matrix by spectral embedding. Through spectral embedding, sparse similarities are propagated to distant areas to reveal global structure. Samples are sorted by their class id for better visualization.

Constrained Spectral Clustering
Constrained spectral clustering [14, 6] may potentially relieve such a problem. Instead of propagating labels by one step as in the naive nearest neighbor approach, constrained spectral clustering propagates labels through multiple steps by taking advantage of structure within the unlabeled dataset. It computes a spectral embedding [31, 36] from the original similarity metric, which is then used as the new metric for label propagation. The spectral embedding is formulated as


where and

are the eigenvalues and eigenvectors of the normalized Laplacian in ascending order. The Laplacian matrix

is derived from the original similarity metric as , with degree matrix and . Parameter is the total number of eigen components used.

Due to its globalized nature, spectral clustering is able to pass messages between distant areas, which is in contrast to the local behavior of the naive nearest neighbors approach. The embedded metric is usually densely connected and better aligned with object classes, as illustrated in Figure 2 (right). Using the same voting approach as in Eqn (5), labeled propagation can be more accurate than using the original raw similarity metric.

Constrained spectral clustering is also efficient. By following the common practice of using -nearest neighbors to build the similarity graph [36], propagating labels to images takes about 10 seconds on a regular GPU.


Metric pretraining Propagation method 50 100 250 500 1000 2000 4000 8000
Nearest neighbor 22.03 25.74 48.35 68.03 77.57 77.28 87.77 90.88
Spectral 23.49 28.88 54.46 70.02 80.94 87.77 93.94 96.23
Nearest neighbor 57.32 67.61 75.48 79.34 80.70 82.14 83.66 84.79
Spectral 60.85 67.34 76.31 80.04 81.78 81.89 82.93 82.03
Nearest neighbor 54.82 62.99 77.08 84.90 88.68 91.34 92.72 93.67
Spectral 72.59 79.21 86.64 90.01 91.04 91.57 91.77 91.94


Table 1: Ablation study of the mean average precision (mAP) of pseudo labels on CIFAR10.


Metric pretraining Propagation method 50 100 250 500 1000 2000 4000 8000
No No 20.95 25.35 41.63 54.06 65.08 73.22 81.44 86.23
Nearest neighbor 21.79 25.37 42.70 54.14 68.08 75.17 83.30 87.68
Spectral 22.78 27.95 47.28 60.73 72.60 78.20 85.10 88.26
No 49.57 55.41 64.65 68.81 73.40 77.93 82.17 86.25
Nearest neighbor 49.96 52.69 65.63 65.88 70.88 76.36 80.16 84.64
Spectral 53.47 55.08 68.40 71.15 72.38 76.50 80.31 84.03
No 35.27 37.87 62.46 71.04 75.96 80.12 83.90 87.82
Nearest neighbor 46.68 54.45 66.93 74.16 79.17 82.24 84.56 87.92
Spectral 56.34 63.53 71.26 74.77 79.38 82.34 84.52 87.48


Table 2: Ablation study of semi-supervised performance on CIFAR10.

3.3 Confidence Weighted Supervised Training

Figure 3: The accumulated accuracy of the pseudo labels on the validation data sorted by the confidence measure.

Given the logits , the pseudo label

is estimated as


With the estimated pseudo labels on the unlabeled data, we have considerably more data for training a classifier. However, the pseudo labels may not be accurate, and directly using these labels may lead to degraded performance. For example, not all the data in the unlabeled set are related to the target problem. Here, we devise a simple weighting mechanism to compensate for inaccurate labels.

Given the logits produced by the label propagation algorithm, we first normalize it into a probabilistic distribution,


where indexes the dimension of categories, and the temperature controls the sharpness of the distribution. We then define the confidence measure of the pseudo label as the difference between the maximum response and the second largest response,


A high value of indicates a confident estimate of the pseudo label, and a low value of indicates an ambiguous estimate. In Figure 3, we measure the accumulated accuracy of pseudo labels on validation data sorted by this confidence. It can be seen that our confidence measure gives a good indication of the quality of pseudo labels.

Our final training criterion is given by


where is the pseudo label for example , and is the softmax probability output of the classification network.

In practice, since some pseudo labels have relatively low confidence, e.g. , and thus contribute negligibly to the overall learning criterion, we may safely discard those examples to speed up learning.

4 Experiments

Through experiments, we show that, with unlabeled data, metric propagation is able to effectively label lots of data when little labeled data is given. We verify our approach on semi-supervised learning, where an unsupervised metric is transferred, and on transfer learning, where supervised metrics generalize across different data distributions, and on few-shot recognition, where the metric can generalize across open-set object categories. While studying few-shot recognition, we leverage an extra unlabeled data for label propagation, which is also known as semi-supervised few-shot recognition [28].

Our approach has two major hyper-parameters: the number of the eigenvectors for spectral clustering and the temperature controlling the confidence distribution. Different parameter settings may slightly change the performance. We use and across the experiments. A detailed analysis is provided in the supplementary materials.

4.1 Semi-Supervised Learning


Methods Network architectures 50 100 250 500 1000 2000 4000 8000
Mean Teacher
WideResNet-28-2 29.66 36.62 45.49 57.19 65.07 79.26 84.38 87.55
WideResNet-28-10 27.35 38.83 49.44 59.45 70.03 82.62 86.71 89.38
WideResNet-28-2 56.34 63.53 71.26 74.77 79.38 82.34 84.52 87.48
WideResNet-28-10 73.13 75.87 80.30 81.76 84.97 86.82 88.70 91.01


Table 3: Scalability to large network architectures.
Figure 4: Comparisons to the state-of-the-art on CIFAR10.

We follow a recent evaluation paper [26], which gives a comprehensive benchmark for state-of-the-art semi-supervised learning approaches. All of our experiments are conducted on the CIFAR10 [16] dataset. We use the same Wide-ResNet [42] architecture with 28 layers and a width factor of 2. We report performance as we vary the number of labeled examples from to of the total examples in the original CIFAR10 dataset.

For training our model, we pretrain the metric using the unlabeled split of CIFAR10, and propagate labels to the same unlabeled set. This means in our framework. We use SGD for optimization with an initial learning rate of 0.01 and a cosine decay schedule. We fix the total number of optimization iterations to

as opposed to fixing optimization epochs, because it gives more consistent comparisons when the number of labeled data varies.

Study of different pretrained metrics.

Our label propagation algorithm needs a pretrained similarity metric to guide it. The pretrained metric can be learned by supervised methods using limited labeled data, or by unsupervised methods using large-scale unlabeled data. Here, we consider three metric pretraining methods:

  1. [nolistsep]

  2. supervised learning on limited labeled data.

  3. self-supervised learning by image colorization [43].

  4. unsupervised learning by instance discrimination [40].

We train the models using the optimal parameters for each pretraining method. Then we use cosine similarity in the feature space for propagating labels to the unlabeled data.

In Table 1, we evaluate the quality of pseudo labels as the mean average precision (mAP) sorted by the confidence as in Figure 3. Table 2 lists the final semi-supervised recognition accuracy. We can see that both unsupervised methods generalize much better than the supervised bootstrapping method most of the time, until the labeled set is relatively large with 4000 labels. This confirms our claim that unsupervised transfer is the key for label propagation. For the unsupervised methods, non-parametric metric learning performs better than colorization, probably because it explicitly learns a similarity metric. We also include the result of the naive baseline which trains from scratch using limited labeled data without label propagation.

Study of different label propagation schemes.

Given the pretrained metrics, there are various ways to transfer the metrics. We consider three possible solutions:

  1. [nolistsep]

  2. no propagation, only transfer network weights.

  3. nearest neighbor metric transfer.

  4. spectral metric transfer.

The first baseline is a common practice, which basically transfers the network weights and then finetunes on the labeled data. The second is much weaker than the third because it only considers one-hop distances, without taking into account the similarities between unlabeled pairs.

The results are summarized in Table 1 and Table 2. Compared to the state-of-the-art performance in Table 4, even a simple finetuning approach outperforms the state-of-the-arts when the labeled data is small. For example, by finetuning from instance discrimination, we achieve with labeled data, significantly outperforming the state-of-the-art result of . This suggests that unsupervised pretraining generally improves semi-supervised learning.

When unlabeled data is used for label propagation, metric transfer can be much stronger than just weight transfer, improving the performance to with labeled data. It is also evident that the spectral clustering method performs better than weighted nearest neighbors because of its globalization behavior.


Num Labeled 250 4000


Ours 71.26 84.52


Pi Model [18] 47.07 84.17
+Ours 74.90 85.32


Mean Teacher [34] 45.49 84.38
+Ours 74.54 85.45


VAT [25] 44.83 86.79
+Ours 78.34 86.93


VAT+EM [25] 46.29 86.96
+Ours 78.63 87.20


Table 4: Combination of our approach with state-of-the-art methods. Ours is complementary to all prior works.


Metric pretraining Transfer method 50 100 250 500 1000 2000 4000 8000


Network finetuning 28.92 34.56 57.14 67.54 76.20 80.92 85.01 88.74
Spectral 44.30 46.51 61.29 68.31 72.61 77.86 84.00 88.19


Network finetuning 54.95 61.88 73.01 78.43 84.52 88.79 91.44 93.05
Spectral 77.71 85.34 86.07 86.91 88.27 89.93 91.22 93.49


Table 5: Transfer learning from ImageNet to CIFAR10.

Scalability to large network architectures.

In contrast to prior methods which face over-fitting issues, our approach can easily scale to larger network architectures. Here, we keep all the learning parameters unchanged, and experiment with a wider version of Wide-ResNet-28 with a width factor of 10. We consider a state-of-the-art method mean-teacher [34] for comparison. In Table 3, mean-teacher only shows a limited improvement of about . Our method enjoys consistently significant gains from a larger network on all the testing scenarios. It achieves an unprecedented accuracy using only labels with Wide-ResNet-28-10.

Comparison to state-of-the-art methods.

We compare our approach to state of the art methods in Figure 4. Ours is particularly stronger when the labeled set is small, but this advantage diminishes as the labeled set grows. However, as most prior approaches focus on self-ensembling, ours is orthogonal to them. We examine the complementarity of our method by combining it with each of the prior approaches. To do so, we generate our most confident pseudo labels (about of the full data), and use it as ground-truth for the other algorithms. For fair comparisons, we run public code111 with our generated pseudo labels. In Table 4, combining our approach leads to improved performance for all of the methods.

4.2 Transfer Learning

So far, our experiments have been conducted on CIFAR10, splitting the entire dataset into labeled and unlabeled splits. We also examine whether the proposed metric transfer can work across different data distributions. For this, we study supervised and unsupervised pretraining for transfer learning.


Method Fintune 5-way Classification
1-shot 5-shot
NN baseline [35] No 41.10.7 51.00.7
MAML [7] Yes 48.70.7 63.20.9
Meta-SGD [21] No 50.51.9 64.00.9
Matching net [35] Yes 46.60.8 60.00.7
Prototypical [33] No 49.40.8 68.20.7

Soft k-means 

Yes 50.40.3 64.40.2


SNCA [39] No 50.30.7 64.10.8
Our supervised Yes 56.10.6 70.70.5
Our unsupervised Yes 50.80.6 66.00.5


Table 6: Few-shot recognition on Mini-ImageNet dataset.

Transferring from labeled ImageNet. We resize ImageNet images to a resolution of and pretrain the metric on them by supervised learning. We keep the network architecture WideResNet-28-2 for meaningful comparison with the semi-supervised settings in Sec 4.1. This obtains an accuracy of on the ImageNet validation set. Then we transfer the metric to CIFAR10. This transfer is conducted by network finetuning and by metric propagation. In Table 5, we can see that simple network finetuning can reach the best results obtained in the semi-supervised settings of the previous subsection. By using label propagation with spectral clustering, we can observe a large improvement, yielding accuracy with just labeled images. This illustrates the generality of our metric transfer approach, where supervised transfer can also take advantage of unlabeled data to improve generalization.

Transferring from unlabeled ImageNet. Instead of supervised training which encodes prior knowledge about object categories, we treat ImageNet images as unlabeled and repeat the previous experiment. Different from the earlier unsupervised experiments, this setting involves substantially more unlabeled data, which could potentially lead to a better unsupervised metric. However, our results suggest otherwise. When propagating to CIFAR10, the unsupervised metric learned from ImageNet is inferior to the metric learned from CIFAR10. This is possibly due to the data distribution gap between CIFAR10 and ImageNet. Nevertheless, our unsupervised transfer from ImageNet still surpasses the state-of-the-art in the semi-supervised setting when labeled samples are limited.

4.3 Few-Shot Recognition

Figure 5: Visualizations of top ranked retrievals from the unlabeled set given one-shot observations.

Few-shot recognition targets a more challenging scenario, the generalization across object categories (a.k.a. open-set recognition). Originally, the problem is defined with numerous labeled examples in a source dataset, and few examples in the target categories. Recent works [28, 8] also explore the scenario where extra unlabeled data is available for this problem. This fits into our framework for studying label propagation via metric transfer.

We follow the protocols in [28] for conducting the experiments, because it introduces distractor categories in the unlabeled set. The experiments are evaluated on the mini-ImageNet dataset, consisting of a total of categories, with for training, for validation and for testing. Images in each category are split into as labeled, and as unlabeled. Training uses only the labeled split in the training categories. During evaluation, a testing episode is constructed by sampling few-shot labeled observations from the labeled split in the testing categories, and all of the unlabeled images in all the testing categories. A testing episode requires the model to find useful information in the unlabeled set to aid recognition from the few-shot observations. Unlike [28], which includes five distractor categories in the unlabeled set, we consider all categories in the testing set, which better reflects practical scenarios. We test episodes and report the results.

We follow prior work [35]

by using a shallow architecture with four convolutional layers and a final fully connected layer. Each convolutional layer has 64 channels, interleaved with ReLU, subsampling and a batch normalization layer. Images are resized to

to train the model. We use the spectral embedding approach for label propagation. During online training, we use an initial learning rate of with a total of 30 epochs and decrease the learning rate to be times smaller after epochs.

Transfer from supervised models. We use a recent supervised metric learning approach [39] as the baseline. After label propagation and finetuning on the new data, we obtain a significant performance boost of . Prior work [28] improves upon its baselines, but fails to make further improvement because of limited () training data. In Figure 5, we visualize the top retrievals from the unlabeled set in the one-shot scenario. These retrievals not only accurately belong to the same class as the ground truth, but their diverse appearance facilitates learning a strong classifier.

Transfer from unsupervised models. We also investigate pretraining the metric without labels, using instance discrimination [40] for learning the metric. Surprisingly, in Table 6, this obtains better performance than the offline metric learning approach with annotations [39], by in 1-shot recognition and for 5-shot. This suggests that leveraging unlabeled data in the target problem can be more beneficial than using labeled samples in the source domain for few-shot recognition.

Figure 6: Ablations of model parameters and .

5 Discussions

  • [nolistsep]

  • The effectiveness of label propagation depends heavily on the learned metric, so advances in metric learning should lead to improved results. Since the prevalent pretraining methods in deep learning use softmax classification, we hope to draw more attention to pretraining networks with metric learning.

  • Currently, we study metric pretraining and label propagation separately. It may be beneficial to formulate them jointly in an end-to-end framework, which would be an interesting direction for future work.

  • Because of the label propagation process, the complexity of our approach depends on the unlabeled dataset, instead of the target problem. Our current algorithm cannot run in an online fashion. We hope to address this in the future.

  • Our algorithm takes advantage of the unlabeled dataset to create more training data. The overall performance is affected by the relevance of image content in the unlabeled set to that of the target , as this impacts the ability to effectively propagate labels.

Figure 7: CIFAR10 retrievals for 10 categories.
Figure 8: Mini-ImageNet retrieval visualizations. Top left is the one-shot query, the rest are the top retrievals.

Appendix A1 Ablations of Model Parameters

Our model depends on two parameters: the number of eigen components used for spectral clustering, and the temperature used for controlling the confidence. We used and in our main submission. In Figure 6, we show the effects of the two parameters respectively.

The number of eigenvectors works well in the range between and . We can see a trade-off of the value for performance under various number of labeled samples. Smaller benefits very few labeled samples, while larger benefits comparably more labeled samples. For the temperature parameter , it is generally robust for a wide range of values between to .

Appendix A2 Additional Visualizations

We provide more retrieval visualizations in the CIFAR10 and mini-ImageNet dataset in Figure 7 and Figure 8. For CIFAR10, we show the top retrievals for each class in the unlabeled set given labeled examples. For mini-ImageNet, we show the top retrievals in the 5-class 1-shot scenario.