Unsupervised Few-shot Learning via Self-supervised Training

12/20/2019 ∙ by Zilong Ji, et al. ∙ Beijing Normal University Peking University 0

Learning from limited exemplars (few-shot learning) is a fundamental, unsolved problem that has been laboriously explored in the machine learning community. However, current few-shot learners are mostly supervised and rely heavily on a large amount of labeled examples. Unsupervised learning is a more natural procedure for cognitive mammals and has produced promising results in many machine learning tasks. In the current study, we develop a method to learn an unsupervised few-shot learner via self-supervised training (UFLST), which can effectively generalize to novel but related classes. The proposed model consists of two alternate processes, progressive clustering and episodic training. The former generates pseudo-labeled training examples for constructing episodic tasks; and the later trains the few-shot learner using the generated episodic tasks which further optimizes the feature representations of data. The two processes facilitate with each other, and eventually produce a high quality few-shot learner. Using the benchmark dataset Omniglot and Mini-ImageNet, we show that our model outperforms other unsupervised few-shot learning methods. Using the benchmark dataset Market1501, we further demonstrate the feasibility of our model to a real-world application on person re-identification.



There are no comments yet.


page 1

page 3

page 5

page 8

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Few-shot learning, which aims to accomplish a learning task by using very few training examples, is receiving increasing attention in the machine learning community. The challenge of few-shot learning lies on that traditional techniques such as fine-tuning would normally incur overfitting (Wang et al., 2018). To overcome this difficulty, a set-to-set meta-learning(episodic learning) paradigm was proposed (Vinyals et al., 2016). In such a paradigm, the conventional mini-batch training is replaced by the episodic training, in term of that a batch of episodic tasks, each of which having the same setting as the testing environment, are presented to the learning model; and in each episodic task, the model learns to predict the classes of unlabeled points (the query set) using very few labeled examples (the support set). By this, the learning model acquires the transferable knowledge (optimized feature representations) across tasks, and due to the consistency between the training and testing environments, the model is able to generalize to novel but related tasks. Although this set-to-set few-shot learning paradigm has made great progress, in its current supervised form, it requires a large number of labeled examples for constructing episodic tasks, which is often infeasible or too expensive in practice. So, can we build up a few-shot learner in the paradigm of episodic training using only unlabeled data?

It is well-known that humans have the remarkable ability to learn a concept when given only several exposures to its instances, for example, young children can effortlessly learn and generalize the concept of “giraffe” after seeing a few pictures of giraffes. While the specifics of the human learning process are complex (trial-based, perpetual, multi-sourced, and simultaneous for multiple tasks) and yet to be solved, previous works agree that its nature is progressive and unsupervised in many cases (Dupoux, 2018). Given a set of unlabeled items, humans are able to organize them into different clusters by comparing one with another. The comparing or associating process follows a coarse-to-fine manner. At the beginning of learning, humans tend to group items based on fuzzy-rough knowledge such as color, shape or size. Subsequently, humans build up associations between items using more fine-grained knowledge, i.e., stripes of images, functions of items or other domain knowledge. Furthermore, humans can extract representative representations across categories and apply this capability to learn new concepts (Wang et al., 2014).

Figure 1: The scheme of our model UFLST, which consists of two alternate processes: clustering and episodic training. At each round, unlabeled data points are clustered based on extracted features, and pseudo labels are assigned according to cluster identities. After clustering, a set of episodic tasks are constructed by sampling from the pseudo-labeled data, and the few-shot learner is trained, which further optimizes feature representations. The two processes are repeated.

In the present study, inspired by the unsupervised and progressive characteristics of human learning, we propose an unsupervised model for few-shot learning via a self-supervised training procedure (UFLST). Different from previous unsupervised learning methods, our model integrates unsupervised learning and episodic training into a unified framework, which facilitates feature extraction and model training iteratively. Basically, we adopt the episodic training paradigm, taking advantage of its capability of extracting transferable knowledge across tasks, but we use an unsupervised strategy to construct episodic tasks. Specifically, we apply progressive clustering to generate pseudo labels for unlabeled data, and this is done alternatively with feature optimization via few-shot learning in an iterative manner (Fig. 

1). Initially, unlabeled data points are assigned into several clusters, and we sample a few training examples from each cluster together with their pseudo labels (the identities of clusters) to construct a set of episodic tasks having the same setting as the testing environment. We then train the few-shot learner using the constructed episodic tasks and obtain improved feature representations for the data. In the next round, we use the improved features to re-cluster data points, generating new pseudo labels and constructing new episodic tasks, and train the few-shot learner again. The above two steps are repeated till a stopping criterion is reached. After training, we expect that the few-shot learner has acquired the transferable knowledge (the optimized feature representations) suitable for a novel task of the same setting as in the episodic training. Using benchmark datasets, we demonstrate that our model outperforms other unsupervised few-shot learning methods and approaches to the performances of fully supervised models.

2 Related Works

In the paradigm of episodic training, few-shot learning algorithms can be divided into two main categories: “learning to optimize” and “learning to compare”. The former aims to develop a learning algorithm which can adapt to a new task efficiently using only few labeled examples or with few steps of parameter updating (Finn et al., 2017; Ravi and Larochelle, 2016; Mishra et al., 2017; Rusu et al., 2018; Nichol and Schulman, 2018; Andrychowicz et al., 2016), and the latter aims to learn a proper embedding function, so that prediction is based on the distance (metric) of a novel example to the labeled instances (Vinyals et al., 2016; Snell et al., 2017; Ren et al., 2018; Sung et al., 2018; Liu et al., 2018). In the present study, we focus on the “learning to compare” framework, although the other framework can also be integrated into our model.

Only very recently, people have tried to develop unsupervised few-shot learning models. Hsu et al. (2018) proposed a method called CACTUs, which uses progressive clustering to leverage feature representations before carrying out episodic training. This is different from our model, in term of that we carry out progressive clustering and episodic training concurrently, and the two processes facilitate with each other. Khodadadeh et al. (2018) proposed a method called UMTRA, which utilizes the statistical diversity properties and domain-specific augmentations to generate training and validation data. Antoniou and Storkey (2019) proposed a similar model called AAL, which uses data augmentations of the unlabeled support set to generate the query data. Both methods are different from our model, in term of that we use a progressive clustering strategy to generate pseudo labels for constructing episodic tasks.

The idea of self-supervised training is to artificially generate pseudo labels for unlabeled data, which is useful when supervisory signals are not available or too expensive (de Sa, 1994). Progressive (deep) clustering is a promising method for self-supervised training, which aims to optimize feature representations and pseudo labels (cluster assignments) in an iterative manner. This idea was first applied in NLP tasks, which tries to self-train a two-phase parser-reranker system using unlabeled data (McClosky et al., 2006). Xie et al. (2016) proposed a Deep Embedded Clustering network to jointly learn cluster centers and network parameters. Caron et al. (2018) further proposed strategies to solve the degenerated solution problem in progressive clustering. Fan et al. (2018) and Song et al. (2018) applied the progressive clustering idea to the person re-identification task, both of which aim to transfer the extracted feature representations to an unseen domain. None of these studies have integrated progressive clustering and episodic training in few-shot learning as we do in this work.

3 Method

In this section, we describe the model UFLST in detail. Let us first introduce some notations. Denote the model at the training round as , the unlabeled dataset with the number of examples , and

is the corresponded feature vector with dimentionality

, which is given by , where representing the feature extracter and the training parameters of . and

represent, respectively, the selected features and unlabeled data after removing outliers from the clustering results, and

the corresponding pseudo labels.

3.1 Progressive Clustering

3.1.1 K-reciprocal Jaccard Distance

To cluster unlabeled data points, we adopt the k-reciprocal Jaccard distance (KRJD) metric to measure the distance between data points (Qin et al., 2011; Zhong et al., 2017), and it is done in the feature space rather than the raw pixel space. First, we calculate the k-reciprocal nearest neighbours of each feature point, which are given by,


where denotes the nearest neighbours of . imposes that and each element of are mutually nearest neighbours of each other. Second, we compute KRJD between two feature points, which is given by


Compared to the Euclidean distance, KRJD takes into account the reciprocal relationship between data points, and hence is a stricter rule measuring whether two feature points matches or not. We find that KRJD is crucial to our model, which outperforms the Euclidean metric as demonstrated in Fig. 2.

Figure 2: Comparing the performances of KRJD and the Euclidean metric. Top 10 neighbours of a chosen query character from Omniglot are shown. Black box: the query character. Green box: the positive characters in the neighbourhood of the query character. Red box: the negative characters in the neighbourhood of query character. (A) The ranking result using Euclidean metric. (B) The ranking result using KRJD. The number under each image represents its true class. KRJD outperforms the Euclidean metric, in term of it includes more positive examples in the ranking list.

3.1.2 Density-based spatial clustering algorithm

For the clustering strategy, we choose the density-based spatial clustering algorithm (DBSCAN) (Ester et al., 1996)

, which performs better than other methods in our model. The reasons are: 1) other clustering methods such as k-means and hierarchical clustering are useful to find spherical shaped or convex clusters in the embedding space, while DBSCAN works well for arbitrarily shaped clusters; 2) DBSCAN can detect outliers as noise points, which is very useful at the beginning of training when the distribution of data points in the embedding space is highly noisy; 3) DBSCAN does not need to specify the number of clusters to be generated, which is appealing for unsupervised learning. After removing noisy points (outliers) as done in DBSCAN, pseudo labels (i.e., cluster identities) of data points

can be expressed as,


where denotes the maximum distance for two points to be considered as in the same neighborhood, the minimum number of points huddled together for a region to be considered as dense, and the KRJD matrix. The value of relies on the cluster density of feature points, which is set to be the mean distance of top minimum distances in KRJD in this study.

3.2 Episodic training

3.2.1 Constructing episodic tasks

After each round of clustering, we construct a number of episodic tasks, denoted as with the number of tasks, from the pseudo-labeled data set . For each episodic task, we randomly sample classes and examples per class. Notably, the setting of each episodic task follows that of the test environment to be performed after training.

We will apply two different ways to implement few-shot learning (see below). One uses the prototype loss, which aims to learn the prototypes of each class and discriminate a novel example based on its distances to the prototypes. In this case, we further split into a support set and a query set , i.e., . Following Snell et al. (2017), we choose a larger value of in training than that in testing, but keeps the same. The other way to implement few-shot learning is to use the triplet loss or the hardtriplet loss, which separates examples from different classes with a positive margin . In this case, no splitting support and query data is needed. To mine hard negative examples in triplets, we also use a larger in training than that in testing.

3.2.2 Loss Functions

Two types of loss functions are used in the present study, and both of them are in the framework of “learning to compare” and contribute to simple inductive bias of our model. One is the prototype loss, which is written as


where is the prototype of class given by , and a query point. In implementation, we choose to minimize the negative log value of Eq. 4, i.e., , as the log value better reflects the geometry of the loss function, making it easy to select a learning rate to minimize the loss function.

The other is the triplet loss (Weinberger and Saul, 2009)

, which has been widely used in face recognition and image retrieval. The triplet loss

consists of several triplets, each of which includes a query feature , a positive feature and a negative feature , and is written as


where controls the margin of two classes, and the hinge term plays the role of correcting triplets, so that the difference between the similarities of positive and negative examples to the query point is larger than a margin . However, in the above form, positive pairs in those “already correct” triplets will no longer be pulled together due to the hard cutoff. We therefore replace the hinge term by a soft-margin formulation, which gives


Eq. 6 is similar to Eq. 5, but it decays exponentially instead of having a hard cutoff and tends to be numerically more stable (Hermans et al., 2017).

We find that in general our model achieves a better performance using the prototype loss than using the triplet loss. However, by including hard example mining when constructing triplets, referred to as the hardtriplet loss hereafter, the model performance is improved significantly and becomes better that using the prototype loss. The pseudo code of our model is summarized in Algorithm 1.

0:  Unlabeled dataset , initialized model parameters
0:  Trained model parameters
2:  repeat
3:     Clustering:
4:     Extracting features of using the feature extractor .
5:     Computing K-reciprocal nearest neighbours of each .
6:     Calculating Jaccard distance matrix based on .
7:     Clustering data using DBSCAN and generating pseudo labels .
8:     Removing outliers and obtaining the pseudo-labeled dataset .
9:     Episodic Training:
11:     repeat
12:        Constructing a episodic task by randomly sampling classes with examples per class from .
13:        Updating model parameters by training the few-shot learner on .
15:     until 
17:  until 
Algorithm 1 Unsupervised Few-shot Learning via Self-supervised Training (UFLST)

4 Experiments

4.1 Datasets

We evaluate our model on three benchmark datasets, which are Omniglot (Lake et al., 2015), Mini-ImageNet (Vinyals et al., 2016) and Market1501 (Zheng et al., 2015).

Omniglot contains 1623 different handwritten characters from 50 different alphabets. There are 20 examples in each class and each of them was drawn by a different human subject via Amazon’s Mechanical Turk. We split data into two parts: 1200 characters for training and 423 for testing, but we did not augment data with rotations (this is unnecessary in our model), and instead of resizing images to , we resized them to .

Mini-Imagenet is derived from the ILSVRC-12 dataset. We follow the data splits proposed by Ravi and Larochelle Ravi and Larochelle (2016), which has a different set of 100 classes including 64 for training, 16 for validating, and 20 for testing compared to the spilt by  Vinyals et al. (2016). Each class contains 600 colored images of size .

Market1501 is a person re-identification (Re-ID) dataset containing 32668 images with 1501 identities captured from 6 cameras. The dataset is split into three parts: 12936 images with 751 identities forming the training set, 19732 images with 750 identities forming the gallery set, and another 3368 images forming the query set. All images were resized to . Except normalization, no other pre-processing was applied.

4.2 Implementation Details

When training on the Omniglot and the Mini-ImageNet dataset, we chose the model architecture to be the same as that in Vinyals et al. (2016) , which consists of four stacked layers. Each layer comprises 64-filter

convolution, followed by a batch normalization, a ReLU nonlinearity, and

max-pooling. When training on the market1501 dataset, due to high variances of pose and luminance, we chose to use a highly expressive model (Xiong et al., 2018), which consists of a Resnet50 pretrained on ImageNet as a backbone, and a batch normalization layer after the global max-pooling layer to prevent overfitting. Our evaluation protocol on market1501 is different from that in Zheng et al. (2015), where they reported Cumulative Matching Characteristic (CMC) and the mean average precision (mAPs); while we consider the performance of 1-shot learning, which mimics the typical single query condition in a person Re-ID task.

When training with the triplet loss, we set the margin between positive and negative examples to be , and the number of training rounds . To avoid overfitting, the model is fine-tuned for only epochs in each round. We used Adam with momentum to update the model parameters, and the learning rate is set to with an exponential decay after epochs. The mini-batch size is , which consists of classes and examples per class in each episodic task. When constructing triplets with hard example mining, we didn’t mine hard negative examples across the whole dataset which is infeasible, rather we did only in the current episodic task. When training with the prototype loss, we used more classes (higher “way”) during training ( in Omniglot and in Market1501), which leads to better performances as empirically observed in Snell et al. (2017). Other hyper-parameters are set to be the same as training with the triplet loss.

Figure 3: Behaviors of progressive clustering. (A) Visualizing clustering results over training rounds by T-SNE. characters from the Futurama alphabets in the Omniglot dataset were selected. (B) NMI vs. training round. (C) Classification accuracy vs. training round.

4.3 Performance of Progressive Clustering

We first check the behavior of progressive clustering via visualizing hand-written characters from the Futurama alphabets in the Omniglot dataset using T-SNE (Maaten and Hinton, 2008). Overall, we observe that as learning progresses, the organization of data points is improved continuously, indicating that our model “discovers” the underlying structure of data gradually. As illustrated in Fig. 3A, initially, all data points are intertwined with each other and no structure exists. Over training, clusters gradually emerge, in the sense of that data points from the same class are grouped together and the margins between different classes are enlarged. We quantitatively measure the clustering quality by computing the Normalized Mutual Information (NMI) between real labels (i.e., the ground truth) and pseudo labels , which is given by,


where is the mutual information between and , and the entropy. The value of NMI lies in , with standing for perfect alignment between two sets. Note that NMI is independent of the permutation of labeling orders. As shown in Fig. 3B, the value of NMI increases with the training round and gradually reaches to a high value close to . Remarkably, the value of NMI well predicts the classification accuracy of the learning model (comparing Fig. 3B and 3C). These results strongly suggest that the combination of progressive clustering and episodic training in our model is able to discover the underlying structure of data manifold and extract the representative features of data points necessary for the few-shot classification task.

4.4 Results on Omniglot

Table 1 presents the performances of our model on the Omniglot dataset compared with other methods. We note that using the triplet loss, our model already outperforms other state-of-the-art unsupervised few-shot learning methods, including CACTUs (Hsu et al., 2018), UMTRA (Khodadadeh et al., 2018), and AAL (Antoniou and Storkey, 2019), to a large extend. Using the prototype loss, the performance of our model is further improved. The best performance of our model is achieved when using the hardtriplet loss. Remarkably, the best performance of our model approaches to that of two supervised models, which are the upper bounds for unsupervised methods.

5-way Acc. 20-way Acc.
1-shot 5-shot 1-shot 5-shot
UMTRA (Khodadadeh et al., 2018) 77.80 92.74 62.20 77.50
CACTUs-MAML (Hsu et al., 2018) 68.84 87.78 48.09 73.36
CACTUs-ProtNets (Hsu et al., 2018) 68.12 83.58 47.75 66.27
AAL-MAML++ (Antoniou and Storkey, 2019) 88.40 97.96 70.21 88.32
AAL-ProtoNets (Antoniou and Storkey, 2019) 84.66 89.14 68.79 74.28
UFLST-Tripletloss 88.68 96.65 73.21 90.11
UFLST-Prototypeloss 96.51 99.23 90.27 97.22
UFLST-HardTripletloss 97.03 99.19 91.28 97.37
MAML (Finn et al., 2017) (Supervised) 98.7 99.9 95.8 98.9
ProtoNets (Snell et al., 2017) (Supervised) 98.8 99.7 96.0 98.9
Table 1: Performances of different unsupervised few-shot learning models on Omniglot under different settings.

4.5 Results on Mini-ImageNet

(5,1) (5,5) (5,20) (5, 50)
Training fram scratch 25.17 33.90 39.56 41.45

BiGAN knn-nearest neighbors

25.56 31.10 37.31 43.60

BiGAN linear classifier

27.08 33.91 44.00 50.41
BiGAN MLP with dropout 22.91 29.06 40.06 48.36
BiGAN cluster matching 24.63 29.49 33.89 36.13
BiGAN CACTUs MAML 36.24 51.28 61.33 66.91
BiGAN CACTUs ProtoNets 36.62 50.16 59.56 63.27
DeepCluster knn-nearest neighbors 28.90 42.25 56.44 63.90
DeepCluster linear classifier 29.44 39.79 56.19 65.28
DeepCluster MLP with dropout 29.03 39.67 52.71 60.95
DeepCluster cluster matching 22.20 23.50 24.97 26.87
DeepCluster CACTUs MAML 39.90 53.97 63.84 69.64
DeepCluster CACTUs ProtoNets 39.18 53.36 61.54 63.55
UMTRA without data Augmentation 26.49 - - -
UMTRA+Shift+random flip 30.16 - - -
UMTRA+Shift+random flip +randomly change to grayscale 32.80 - - -
UMTRA+Shift+random flip+random rotation+color distortions 35.09 - - -
UMTRA+AutoAugment 39.93 50.73 61.11 67.15
AAL-MAML+++ CHV 33.06 40.75 - -
AAL-MAML+++ CHVR 33.21 40.34 - -
AAL-MAML+++ CHV + CUT 33.34 39.44 - -
AAL-MAML+++ CHV + DROP 30.86 40.41 - -
AAL-MAML+++ CHVW 33.30 46.98 - -
AAL-MAML+++ CHVWG 34.57 49.18 - -
AAL-MAML+++ CHVR + CUT 33.09 40.11 - -
AAL-MAML+++ CHVR + DROP 31.70 39.38 - -
AAL-MAML+++ CHV + DROP + CUT 31.55 38.76 - -
AAL-MAML+++ CHVR + DROP + CUT 31.44 39.87 - -
AAL-ProtoNets+ CHV 37.67 40.29 - -
AAL-ProtoNets+ CHV + CUT 36.38 40.89 - -
AAL-ProtoNets+ CHV + CUT + DROP 33.13 36.64 - -
AAL-ProtoNets+ CHVR + CUT + DROP 31.93 36.45 - -
AAL-ProtoNets+ CHVR + CUT 33.92 39.87 - -
AAL-ProtoNets+ CHV + DROP 32.12 36.12 - -
AAL-ProtoNets+ CHVR + DROP 31.13 36.83 - -
AAL-ProtoNets+ CHVR 34.28 39.83 - -
UFLST without data Augmentation 33.77 45.03 53.35 56.72
MAML (Finn et al., 2017) (Supervised) 46.81 62.13 71.03 75.54
ProtoNets (Snell et al., 2017) (Supervised) 46.56 62.29 70.05 72.04
Table 2: Performances of different unsupervised few-shot learning models on Mini-ImageNet under different settings. The accuracy with std of our model is :, , , on 5-way 1-shot, 5-way 5-shot, 5-way 20-shot, 5-way 50-shot, respectively.

Overall, training a few shot learner on the Mini-ImageNet dataset under the unsupervised setting is quite tricky. All the three aforementioned approaches adopt domain specific knowledge and data augmentation tricks in their training. For example, UMTRA uses the statistical likelihood of picking different classes for the training data of in case of and large number of classes, and an augmentation function fors the validation data. CACTUs relies on an unsupervised feature learning algorithm to provide a statistical likelihood of difference and sameness in the training and validation data of . The choice of the right augmentation function for UMTRA and AAL, the right feature embedding approach for CACTUs, and the other hyper-parameters have a strong impact on the performance.

The model architecture trained on the Mini-ImageNet dataset is exactly the same as on the Omniglot dataset, i.e., the 4-layer convnet described in  Sec.4.2. We only report the results by training without any data augmentation. We achieve and under the 5-way 1-shot and 5-way 5-shot scenario respectively. Compare to the model training from scratch ( under the 5-way 1-shot scenario), our model has a gain of . The best 5-way 1-shot accuracy in the CACTUs model is . However, comparing to the CACTUs model is unfair because they used the AlexNet or the VGG16 to first learn a very good feature embedder for downstream feature clustering process, while our model is only composed of a 4-layer convenet. Both of the best results in the UMTRA model and the ALL model are acquired by using fancy data augmentations, such as shifting, random flipping, color distortions, image-Warping and image-pixel dropout (see  Khodadadeh et al. (2018); Antoniou and Storkey (2019) for more details) while we don’t use any data augmentation tricks here. It is noteworthy that our model outperforms the UMTRA trained without any data augmentation to a large extent ( vs. ).

Compared to the results on Omniglot and Market1501, the results on the Mini-ImageNet is not the state-of-the-art. The underline reason may come from three aspects. (1) For a fair comparison to other unsupervised few-shot learning models, we use the 4-layer convnet. However, the in-class variations of the Mini-ImageNet is very large, which is hard for such a small network to capture the semantic meanings of images. (2) In unsupervised learning, it is hard to choose suitable hyper-parameters, such as the clustering frequency, DBSCAN-related parameters, and the learning rate. (3) The ground truth for the class number of Mini-ImageNet is small 1(64 for training, 16 for validating and 20 for testing). But, for constructing episodic tasks, we prefer to over-segment the dataset, and this over-segmentation tend to assign data belonging to the same class into different clusters, leading to a degenerate performance. Our model performs very well on Omniglot and Maket1501, which may be attributed to that both datasets have large class numbers and the number of examples in each class is small. This type of dataset is very suitable for constructing episodic tasks to learn a few-shot learner. In our future work, we will explore more domain specific knowledge and data augmentation strategies to improve the accuracy on the Mini-ImageNet dataset and extend our model to more datasets.

4.6 Results on Market1501

We also applied our model to a real-world application on person Re-ID. In reality, labeled data is extremely lacking for person Re-ID, and unsupervised learning becomes crucial. Table 3 presents the performances of our model on the benchmark datset Market1501. There is no reported unsupervised few-shot learning result on this dataset in the literature.  Rahimpour and Qi (2018) report the supervised results under the 100-way 1-shot scenario. To evaluate our model, we trained a supervised model adapted from Xiong et al. (2018). We find that the model performance using the hardtriplet loss is much better than that using the prototype loss. This is due to that large variations in the appearance and environment of detected pedestrians lead to that noisy samples may be chosen as the prototypes, which deteriorates learning; while the hardtriplet loss focuses on correcting highly noisy examples that violate the margin and hence alleviates the problem. Overall, we observe that our model achieves encouraging performances compared to the supervised method, in particular, in the scenario of low-way classification, which suggest that our model is feasible in practice for person Re-ID when annotated labels are not unavailable.

5-way 10-way 15-way 20-way 50-way 100-way
UFLST-Tripetloss 72.8 63.0 56.2 53.4 42.5 35.4
UFLST-Prototypeloss 88.3 81.2 75.8 73.0 62.5 54.0
UFLST-HardTripletloss 91.4 86.9 81.6 80.4 70.1 62.1
Our supervised model 96.8 94.7 92.5 91.1 83.7 77.3
ARM (Rahimpour and Qi, 2018) - - - - - 76.99
Table 3: Performances of our model on Market1501 with different settings. The supervised model is adapted from Xiong et al. (2018). Only 1-shot learning is considered to mimic the typical single query condition in person Re-ID applications.

4.7 Effect of the size of the unlabeled dataset

We also evaluate how our model relies on the number of unlabeled examples. Table 4 presents the results on the Omniglot dataset with varying number of training examples. Overall, the model performance is improved when the number of training examples increases. Notably, by using only a quarter of the unlabeled data, our model already achieves performances comparable to other unsupervised few-shot learning methods (comparing UFLST-300 with those in Table 1). This demonstrates the feasibility of our model when the number of unlabeled examples is not large.

5-way Acc. 20-way Acc.
Number of Classes 1-shot 5-shot 1-shot 5-shot
UFLST-200 82.83 92.97 65.85 83.73
UFLST-300 86.03 95.05 70.52 87.60
UFLST-400 91.30 97.27 78.64 92.50
UFLST-500 95.27 98.86 87.02 96.05
UFLST-1200 97.03 99.19 91.28 97.37
Table 4: Performances of our model on Omniglot using different numbers of unlabeled training examples. The hardtriplet loss is used.

5 Discussion

In this study, we have proposed an unsupervised model UFLST for few-shot learning via self-training. The model consists of two processes, progressive clustering and episodic training, which are executed iteratively. Other unsupervised methods also consider the two processes, but they are performed separately, in term of that unsupervised clustering for feature extraction is accomplished before applying episodic learning. This separation has a shortcoming, since there is no guarantee that the extract features by unsupervised clustering are suitable for the followed few-shot learning. Here, our model carries out the two processes in an alternate manner, which allows them to facilitate with each other, such that feature representation and model generalization are optimized concurrently, and eventually it produces a high quality few-shot learner. To our knowledge, our work is the first one that integrates progressive clustering and episodic training for unsupervised few-shot learning.

On the Omniglot dataset, our model outperforms other state-of-the-art unsupervised few-shot learning methods to large extend and approaches to the performances of supervised modes. On the MIni-ImageNey dataset, our model achieves comparable results with previous unsupervised few-shot learning models. On the Market1501 dataset, our model also achieves encouraging performances compared to a supervised method. The high effectiveness of our model makes us think about why it works. Few-shot learning in essence is to extract good representations of data suitable for prediction by using very few training examples. To resolve this challenge, the episodic learning paradigm aims to create a set of episodic few-shot learning scenarios having the same setting as the testing environment, so that the model learns to extract good feature representations that are transferable to novel but related tasks. To this end, the real labels of data are helpful but not essential, and we can construct pseudo-labeled examples to train the model. But crucially, as demonstrated by this study, the construction of pseudo-labeled examples must go along with the episodic training, so that the extracted features of data really matches the few-shot learning task. Notably, this unsupervised and progressive way of learning agrees with the nature of human on few-shot learning.


  • M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas (2016) Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981–3989. Cited by: §2.
  • A. Antoniou and A. Storkey (2019) Assume, augment and learn: unsupervised few-shot meta-learning via random labels and data augmentation. arXiv preprint arXiv:1902.09884. Cited by: §2, §4.4, §4.5, Table 1.
  • M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 132–149. Cited by: §2.
  • V. R. de Sa (1994) Learning classification with unlabeled data. In Advances in neural information processing systems, pp. 112–119. Cited by: §2.
  • E. Dupoux (2018)

    Cognitive science in the era of artificial intelligence: a roadmap for reverse-engineering the infant language-learner

    Cognition 173, pp. 43–59. Cited by: §1.
  • M. Ester, H. Kriegel, J. Sander, X. Xu, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §3.1.2.
  • H. Fan, L. Zheng, C. Yan, and Y. Yang (2018) Unsupervised person re-identification: clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (4), pp. 83. Cited by: §2.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §2, Table 1, Table 2.
  • A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §3.2.2.
  • K. Hsu, S. Levine, and C. Finn (2018) Unsupervised learning via meta-learning. arXiv preprint arXiv:1810.02334. Cited by: §2, §4.4, Table 1.
  • S. Khodadadeh, L. Bölöni, and M. Shah (2018) Unsupervised meta-learning for few-shot image and video classification. arXiv preprint arXiv:1811.11819. Cited by: §2, §4.4, §4.5, Table 1.
  • B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §4.1.
  • Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang (2018) Learning to propagate labels: transductive propagation network for few-shot learning. arXiv preprint arXiv:1805.10002. Cited by: §2.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.3.
  • D. McClosky, E. Charniak, and M. Johnson (2006) Effective self-training for parsing. In Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics, pp. 152–159. Cited by: §2.
  • N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2017) A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141. Cited by: §2.
  • A. Nichol and J. Schulman (2018) Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999 2. Cited by: §2.
  • D. Qin, S. Gammeter, L. Bossard, T. Quack, and L. Van Gool (2011) Hello neighbor: accurate object retrieval with k-reciprocal nearest neighbors. In CVPR 2011, pp. 777–784. Cited by: §3.1.1.
  • A. Rahimpour and H. Qi (2018) Attention-based few-shot person re-identification using meta learning. CoRR abs/1806.09613. External Links: Link, 1806.09613 Cited by: §4.6, Table 3.
  • S. Ravi and H. Larochelle (2016) Optimization as a model for few-shot learning. Cited by: §2, §4.1.
  • M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel (2018) Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676. Cited by: §2.
  • A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2018) Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960. Cited by: §2.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §2, §3.2.1, §4.2, Table 1, Table 2.
  • L. Song, C. Wang, L. Zhang, B. Du, Q. Zhang, C. Huang, and X. Wang (2018) Unsupervised domain adaptive re-identification: theory and practice. arXiv preprint arXiv:1807.11334. Cited by: §2.
  • F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1199–1208. Cited by: §2.
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §1, §2, §4.1, §4.1, §4.2.
  • R. Wang, J. Zhang, S. A. Klein, D. M. Levi, and C. Yu (2014) Vernier perceptual learning transfers to completely untrained retinal locations after double training: a “piggybacking” effect. Journal of Vision 14 (13), pp. 12–12. Cited by: §1.
  • Y. Wang, R. Girshick, M. Hebert, and B. Hariharan (2018) Low-shot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7278–7286. Cited by: §1.
  • K. Q. Weinberger and L. K. Saul (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §3.2.2.
  • J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In International conference on machine learning, pp. 478–487. Cited by: §2.
  • F. Xiong, Y. Xiao, Z. Cao, K. Gong, Z. Fang, and J. T. Zhou (2018) Towards good practices on building effective cnn baseline model for person re-identification. arXiv preprint arXiv:1807.11042. Cited by: §4.2, §4.6, Table 3.
  • L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116–1124. Cited by: §4.1, §4.2.
  • Z. Zhong, L. Zheng, D. Cao, and S. Li (2017) Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1318–1327. Cited by: §3.1.1.