SLADE: A Self-Training Framework For Distance Metric Learning

11/20/2020 ∙ by Jiali Duan, et al. ∙ Amazon University of Southern California 4

Most existing distance metric learning approaches use fully labeled data to learn the sample similarities in an embedding space. We present a self-training framework, SLADE, to improve retrieval performance by leveraging additional unlabeled data. We first train a teacher model on the labeled data and use it to generate pseudo labels for the unlabeled data. We then train a student model on both labels and pseudo labels to generate final feature embeddings. We use self-supervised representation learning to initialize the teacher model. To better deal with noisy pseudo labels generated by the teacher network, we design a new feature basis learning component for the student network, which learns basis functions of feature representations for unlabeled data. The learned basis vectors better measure the pairwise similarity and are used to select high-confident samples for training the student network. We evaluate our method on standard retrieval benchmarks: CUB-200, Cars-196 and In-shop. Experimental results demonstrate that our approach significantly improves the performance over the state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A self-training framework for retrieval. In the training phase, we train the teacher and student networks using both labeled and unlabeled data. In the testing phase, we use the learned student network to extract embeddings of query images for retrieval.

Existing distance metric learning methods mainly learn sample similarities and image embeddings using labeled data [20, 15, 2, 27], which often require a large amount of data to perform well. A recent study [19] shows that most methods perform similarly when hyper-parameters are properly tuned despite employing various forms of losses. The performance gains likely come from the choice of network architecture. In this work, we explore another direction that uses unlabeled data to improve retrieval performance.

Recent methods in self-supervised learning  

[12, 5, 4] and self-training  [30, 6] have shown promising results using unlabeled data. Self-supervised learning leverages unlabeled data to learn general features in a task-agnostic manner. These features can be transferred to downstream tasks by fine-tuning. Recent models show that the features produced by self-supervised learning achieve comparable performance to those produced by supervised learning for downstream tasks such as detection or classification  [4]. Self-training methods [30, 6] improve the performance of fully-supervised approaches by utilizing a teacher/student paradigm. However, existing methods for self-supervised learning or self-training mainly focus on classification but not retrieval.

We present a SeLf-trAining framework for Distance mEtric learning (SLADE) by leveraging unlabeled data. Figure 1 illustrates our method. We first train a teacher model on the labeled dataset and use it to generate pseudo labels for the unlabeled data. We then train a student model on both labels and pseudo labels to generate a final feature embedding.

We utilize self-supervised representation learning to initialize the teacher network. Most deep metric learning approaches use models pre-trained on ImageNet (

[15], [27], etc). Their extracted representations might over-fit to the pre-training objective such as classification and not generalize well to different downstream tasks including distance metric learning. In contrast, self-supervised representation learning  [4, 5, 6, 12] learns task-neutral features and is closer to distance metric learning. For these reasons, we initialize our models using self-supervised learning. Our experimental results (Table 3) provide an empirical justification for this choice.

Once the teacher model is pre-trained and fine-tuned, we use it to generate pseudo labels for unlabeled data. Ideally, we would directly use these pseudo labels to generate positive and negative pairs and train the student network. However in practice, these pseudo labels are noisy, which affects the performance of the student model (cf. Table 4). Moreover, due to their different sources, it is likely that the labeled and unlabeled data include different sets of categories (see Section 4.1

for details about labeled and unlabeled datasets). The features extracted from the embedding layer may not adequately represent samples from those unseen classes. To tackle these issues, we propose an additional representation layer after the embedding layer. This new layer is only used for unlabeled data and aims at learning basis functions for the feature representation of unlabeled data. The learning objective is contrastive, i.e. images from the same class are mapped close while images from different classes are mapped farther apart. We use the learned basis vectors to compute the feature representation of each image and measure pairwise similarity for unlabeled data. This enables us to select high-confident samples for training the student network. Once the student network is trained, we use it to extract embeddings of query images for retrieval.

We evaluate our model on several standard retrieval benchmarks: CUB-200, Cars-196 and In-shop. As shown in the experimental section, our approach outperforms several state-of-the-art methods on CUB-200 and Cars-196, and is competitive on In-shop. We also provide various ablation studies in the experimental section.

The main technical contributions of our work are:

  • A self-training framework for distance metric learning, which utilizes unlabeled data to improve retrieval performance.

  • A feature basis learning approach for the student network, which better deals with noisy pseudo labels generated by the teacher network on unlabeled data.

2 Related work

Distance metric learning is an active research area with numerous publications. Here we review those that are relevant to our work. While a common objective is to push similar samples closer to each other and different samples away from each other, approaches differ on their losses and sample mining methods. One can train a model using cross entropy loss [35], hinge loss [20], triplet loss [29], proxy-NCA loss [18, 15, 22], etc. [18] used the proxy-NCA loss to minimize the distance between a sample and their assigned anchor(s). These set of anchors were learnable. [15] further improved the proxy-based loss by combining it with a pair-based loss. We also use a set of learnable vectors but do not optimize directly on their distances to samples. Rather, we use them as a basis (anchor set) to represent output features operated on by a distribution loss. Our intuition is that while individual pseudo labels can be noisy, representing features using these anchors makes them more robust to noise [7], [33].

As mentioned previously, we use self-supervised training to seed our teacher model before fine-tuning it. There has been a significant progress recently in self-supervised learning of general visual representation [5, 6, 9, 4]. [5] learns representation by minimizing the similarity between two transformed versions of the same input. Transformations include various data augmentation operations such as crop, distort or blur. [6] narrowed the gap between self-supervised and supervised learning further by using larger models, a deeper projection head and a weight stabilization mechanism. In [4], a clustering step was applied on the output presentation. Their algorithm maximizes the consistency of cluster assignments between different transformations of the same input image. Most of these works aimed at learning a generic visual representation that can later be used in various downstream tasks.

Self-training involves knowledge distillation from larger, more complex models or from ensembles of models (teachers) to less powerful, smaller students (e.g., [13, 34, 31]). Their end purpose is often reducing model size. Recently,  [30] and  [37]

used iterative self-training to improve classification accuracy of both teacher and student models. At a high level, our self-training is similar to these approaches, but it is designed for distance metric learning and semi-supervised learning settings.

Unlabeled data has been used to improve performance in various computer vision tasks such as classification and semantic segmentation (e.g.,

[21]). They have been also used in self-supervised and unsupervised representation learning such as in [30] or [36]. But there is still a performance gap compared to the fully supervised setting. For distance metric learning, most algorithms that are competitive on popular benchmarks (CUB-200, Cars-196 and In-shop) used fully labeled data ([15], [22], [27], etc). Here, we additionally use external unlabeled data to push performance on these datasets further.

Figure 2: An overview of our self-training framework. Given labeled and unlabeled data, our framework has three main steps. (1) We first initialize the teacher network using self-supervised learning, and fine-tune it by a ranking loss on labeled data; (2) We use the learned teacher network to extract features, cluster and generate pseudo labels on unlabeled data; (3) We optimize the student network and basis vectors on labeled and unlabeled data. The purpose of feature basis learning is to select high-confidence samples (e.g., positive and negative pairs) for the ranking loss, so the student network can learn better and reduce over-fitting to noisy samples.

3 Method

Figure 2 illustrates the system overview of our self-training framework. Our framework has three main components. First, we use self-supervised learning to initialize the teacher network, and then fine-tune it on labeled data. We use the pre-trained SwAV model [4] and fine-tune it on our data to initialize the teacher network. After pre-training, we fine-tune the teacher network with a ranking loss (e.g., contrastive loss) on labeled data. The details of self-supervised pre-training and fine-tuning of the teacher network are presented in section 3.1.

Second, we use the fine-tuned teacher network to extract features and cluster the unlabeled data using k-means clustering. We use the cluster ids as pseudo labels. In practice, these pseudo labels are noisy. Directly optimizing the student network with these pseudo labels does not improve the performance of the teacher network. Therefore, we introduce a feature basis learning approach to select high-confidence samples for training the student network. The details of pseudo label generation are presented in Section 

3.2.

Third, we optimize the student network and basis vectors using labeled and unlabeled data. The basis vectors are defined as a set of weights that map the feature embedding of each image into a feature representation . We train the basis vectors such that images from the same class are mapped close and images from different classes are mapped farther apart. The basis vectors are used to select high-confidence samples for the ranking loss. The student network and basis vectors are optimized in an end-to-end manner. The details of student network optimization and feature basis learning are in Section 3.3.

3.1 Self-Supervised Pre-Training and Fine-Tuning for Teacher Network

Existing deep metric learning methods often use the ImageNet pre-trained model

[8] for initialization, which may over-fit to the pre-training objective, and not generalize well to downstream tasks. Instead, we use self-supervised learning to initialize the teacher model. We use the pre-trained SwAV model [4] and fine-tune it on our data. As shown in the experimental section, this choice leads to improvement in retrieval performance as compared to the pre-trained ImageNet models (see Table 3). We conjecture that this might be because deep metric learning and self-supervised learning are related, they both learn embeddings that preserve distances between similar and dissimilar data.

Specifically, we are given a set of labeled images: and unlabeled images: . We denote the parameters of the teacher network as and the parameters of the student network as . In the pre-training stage, we fine tune the SwAV model on the union of the labeled and unlabeled images without using the label information to initialize the teacher model. Once the teacher network is pre-trained, we fine-tune the teacher network using a ranking loss (for example, a constrastive loss [10]) on the labeled data:

(1)

where is the positive pairs, is the negative pairs, and and are the margins.

3.2 Pseudo Label Generation

We use the teacher model to extract features, and cluster the unlabeled images using k-means. The unlabeled images are assigned to the nearest cluster centers. The assigned cluster ids are then used as pseudo labels. One can train a student network with a ranking loss that uses the positive and negative pair samples sampled from the pseudo labels. However, in practice, the pseudo labels are noisy, and unlabeled data may have unseen categories. The features extracted from the teacher model may not work well on those unseen categories (the pseudo labels are incorrectly estimated). This motivates us to design a new feature basis learning approach that better models the pairwise similarity on unlabeled data and can be used to select high-confidence sample pairs for training the student network.

3.3 Optimization of Student Network and Basis Vectors

We first explain the ideas of feature basis learning, and the use of basis vectors for sample mining and describe the training for student network and basis vectors.

3.3.1 Feature Basis Learning

Basis vectors are a set of learnable weights that map a feature embedding of an image to a feature representation . We denote a set of basis vectors as , where each basis vector is a vector. For simplicity, we represent the basis vectors as a matrix . Given an image , we use the student network to obtain the feature embedding of the input image. The feature representation is computed by , where .

We train the basis vectors to optimize the feature representation such that images from the same class are mapped close while images from different classes are mapped farther apart. Specifically, we train the basis vectors using two losses, a cross-entropy loss and a similarity distribution loss. The loss function for feature basis learning is defined as:

(2)

where the first term is the cross entropy loss on the labeled data and the second term is the similarity distribution loss on the unlabeled data.

The cross-entropy loss is applied on labeled data. The ground truth class labels can be used as a strong supervision signal to regularize the basis vectors to separate different classes. The cross entropy loss on labeled data is:

(3)

where is the softmax function, is the matrix for basis vectors, and is the parameters for the student network.

For unlabeled data, one can also train a cross-entropy loss on the pseudo labels similar to labeled data. However, we found that this leads to poor performance since the model tends to over-fit to the noisy pseudo labels. Instead, we optimize the pairwise similarity on unlabeled data. Pairwise similarity is a simpler (relaxed) form compared to multi-class entropy loss, where the objective is to make the samples from the same class close while distancing the samples with different classes. The motivation is that even though there exists noises in the pseudo labels, a certain portion of pairs are correctly estimated with the pseudo labels, which are good enough to train a class-agnostic pairwise similarity model.

We use the pseudo labels to sample a set of pseudo positive pairs and pseudo negative pairs, where the pseudo positive pairs are sampled from the same pseudo class and the pseudo negative pairs are sampled from different pseudo classes. We compute the similarity of each image pair by using the cosine similarity of two normalized feature representation:

(4)

We model the similarity distributions as two Gaussian distributions

and

. The idea is to separate the two Gaussian distributions by maximizing the difference between their means and penalizing the variance of each distribution. The similarity distribution loss is defined as:

(5)

where () and () are the mean and variance of the Gaussian distributions respectively, and is the margin. We update the parameters in a batch-wise manner:

(6)

where and are the mean and variance in a batch. is the updating rate. The parameters of are updated in a similar way.

3.3.2 Sample Mining

We use the basis vectors to select high-confidence sample pairs from the unlabeled images for training the student network. Given a set of samples in a batch, we compute the pair-wise similarity for all samples using equation 4 and select positive and negative pairs by:

(7)

We set the confidence thresholds of and using and . The positive and negative pairs will be used in the ranking function (see Equation 9).

3.3.3 Joint Training

We train the student network and basis vectors by minimizing a function :

(8)
(9)

where is a ranking loss (see Equation 1). We train the ranking loss on both labeled and unlabeled images. Note that for unlabeled data, we use the sample mining to obtain the positive and negative pairs. Our framework is generic and applicable to different pair-based ranking losses. We report the results of different losses, e.g., contrastive loss and multi-similarity loss in Table 1.

We first train the basis vectors for a few iterations to get a good initialization, then train the student network and basis vectors end-to-end. After training the student network, we use the student as a new teacher and go back to the pseudo label generation step. We iterate this a few times. During testing, we discard the teacher model and only use the student model to extract the embedding of a query image for retrieval.

Methods Frwk Init Arc / Dim CUB-200-2011 Cars-196
MAP@R RP P@1 MAP@R RP P@1
Contrastive [10] [19] ImageNet BN / 512 26.53 37.24 68.13 24.89 35.11 81.78
Triplet [29] [19] ImageNet BN / 512 23.69 34.55 64.24 23.02 33.71 79.13
ProxyNCA [18] [19] ImageNet BN / 512 24.21 35.14 65.69 25.38 35.62 83.56
N. Softmax [35] [19] ImageNet BN / 512 25.25 35.99 65.65 26.00 36.20 83.16
CosFace [25, 26] [19] ImageNet BN / 512 26.70 37.49 67.32 27.57 37.32 85.52
FastAP [3] [19] ImageNet BN / 512 23.53 34.20 63.17 23.14 33.61 78.45
MS+Miner [27] [19] ImageNet BN / 512 26.52 37.37 67.73 27.01 37.08 83.67
Proxy-Anchor [15] [15] ImageNet R50 / 512 - - 69.9 - - 87.7
Proxy-Anchor [15] [19] ImageNet R50 / 512 25.56 36.38 66.04 30.70 40.52 86.84
ProxyNCA++ [22] [22] ImageNet R50 / 2048 - - 72.2 - - 90.1
Mutual-Info [1] [1] ImageNet R50 / 2048 - - 69.2 - - 89.3
Contrastive [10] () [19] ImageNet R50 / 512 25.02 35.83 65.28 25.97 36.40 81.22
Contrastive [10] () [19] SwAV R50 / 512 29.29 39.81 71.15 31.73 41.15 88.07
SLADE (Ours) () [19] ImageNet R50 / 512 29.38 40.16 68.92 31.38 40.96 85.8
SLADE (Ours) () [19] SwAV R50 / 512 33.59 44.01 73.19 36.24 44.82 91.06
MS  [27] () [19] ImageNet R50 / 512 26.38 37.51 66.31 28.33 38.29 85.16
MS  [27] () [19] SwAV R50 / 512 29.22 40.15 70.81 33.42 42.66 89.33
SLADE (Ours) () [19] ImageNet R50 / 512 30.90 41.85 69.58 32.05 41.50 87.38
SLADE (Ours) () [19] SwAV R50 / 512 33.90 44.36 74.09 37.98 46.92 91.53
Table 1: MAP@R, RP, P@1 () on the CUB-200-2011 and Cars-196 datasets. Pre-trained Image-Net model is denoted as ImageNet and the fine-tuned SwAV model on our data is denoted as SwAV. The teacher networks (, , and ) are trained with the different losses, which are then used to train the student networks (, , and ) (e.g., the teacher is used to train the student ). Note that the results may not be directly comparable as some methods (e.g.,  [15, 22, 1]) report the results based on their own frameworks with different settings, e.g., embedding dimensions, batch sizes, data augmentation, optimizer etc. More detailed explanations are in Section 4.4.

4 Experiments

We first introduce the experimental setup including datasets, evaluation criteria and implementation details. Then we report the results of our method on three common retrieval benchmarks CUB-200, Cars-196 and In-shop ([24, 16, 17]) 111We did not carry out experiments on the SOP dataset [20], since we could not find an unlabeled dataset that is publicly available and is similar to it in content.. Finally, we conduct ablation studies to analyze different design choices of the components in our framework.

4.1 Datasets

CUB-200/NABirds: We use CUB-200-2011 [24] as the labeled data and NABirds [23] as the unlabeled data. CUB-200-2011 contains 200 fine-grained bird species, with a total number of 11,788 images. NABirds is the largest publicly available bird dataset with 400 species and 743 categories of North America’s birds. It has 48,000 images with approximately 100 images for each species. We measure the overlaps between CUB-200 and NABirds, where there are 655/743 unseen classes in NABirds compared to CUB, showing the challenges of handling the out-of-domain images for unlabeled data.

Cars-196/CompCars: We use Cars-196 [16] as the labeled data, which contains 16,185 images of 196 classes of cars. Classes are annotated at the level of make, model, and year (e.g., 2012 Tesla Model S). We use CompCars  [32] as the unlabeled data. It is collected at model level, so we filter out unbalanced categories to avoid being biased towards minority classes, resulting in 16,537 images categorized into 145 classes.

In-shop/Fashion200k: In-shop Clothes Retrieval Benchmark [17] includes 7,982 clothing items with 52,712 images. Different from CUB-200 and Cars196, In-shop is an instance-level retrieval task. Each article is considered as an individual category (each article has multiple views, such as front, back and side views), resulting in an average of 6.6 images per class. We use Fashion-200k [11] as the unlabeled data, since it has similar data organization (e.g., catalog images) as the In-shop dataset.

4.2 Evaluation Criteria

For CUB-200 and Cars-196, we follow the settings in  [18, 19, 15] that use half of the classes for training and the other half for testing, and use the evaluation protocol in [19] to fairly compare different algorithms. They evaluate the retrieval performance by using MAP@R, RP and P@1. For each query, P@1 (also known as Recall@1 in previous metric learning papers) reflects whether the i-th retrieval result is correct. However, P@1 is not stable. For example, if only the first of all retrieval results is correct, P@1 is still . RP measures the percentage of retrieval results that belong to the same class as the query. However, it does not take into account ranking of correct retrievals. combines the idea of mean average precision with RP and is a more accurate measure. For In-shop experiment, we use the Recall@

as the evaluation metric  

[17].

We compare our model with full-supervised baselines that are trained with of labeled data and are fine-tuned end-to-end. Our settings are different from the ones in self-supervised learning frameworks  [5, 4], where they evaluate the models in a label fraction setting (e.g., using or labeled data from the same dataset for supervised fine-tuning or linear evaluation). Evaluating our model in such setting is important, since in practice, we often use all available labels to fine-tune the entire model to obtain the best performance. These settings also pose several challenges. First, our fully supervised models are stronger as they are trained with of labels. Second, since our labeled and unlabeled data come from different image distributions, the model trained on labeled data may not work well on unlabeled data - there will be noisy pseudo labels we need to deal with.

4.3 Implementation Details

We implement our model using the framework  [19]. We use 4-fold cross validation and a batch size of 32 for both labeled and unlabeled data. ResNet-50 is used as our backbone network. We use the pre-trained SwAV [4] model and fine-tune it on our data. In each fold, a student model is trained and outputs an 128-dim embedding, which will be concatenated into a 512-dim embedding for evaluation. We set the updating rate to 0.99, and and in Equation 9 to 1 and 0.25 respectively to make the magnitude of each loss in a similar scale.

For iterative training, we train a teacher and a student model in each fold, then use the trained student model as the new teacher model for the next fold. This produces a better teacher model more quickly compared to getting a new teacher model after the student network finishes all training folds.

4.4 Results

Recall@ 1 10 20 40
N. Softmax [35] 88.6 97.5 98.4 -
MS [27] 89.7 97.9 98.5 99.1
ProxyNCA++ [22] 90.4 98.1 98.8 99.2
Cont. w/M [28] 91.3 97.8 98.4 99.0
Proxy-Anchor [15] 91.5 98.1 98.8 99.1
SLADE (Ours) () 91.3 98.6 99.0 99.4
Table 2: Recall@ () on the In-shop dataset.

The retrieval results for CUB-200 and Cars-196 are summarized in Table 1. We compare our method with state-of-the-art methods reported in [19] and some recent methods [15, 22, 1]. Note that numbers may not be directly comparable, as some methods use their own settings. For example, ProxyAnchor [15]

uses a larger batch size of 120 for CUB-200 and Cars-196. It also uses the combination of global average pooling and global max pooling. Mutual-Info 

[1] uses a batch size of 128 and a larger embedding size of 2048. ProxyNCA++ [22] uses a different global-pooling, layer normalization and data sampling scheme.

We evaluate the retrieval performance using original images for CUB-200 and Cars-196 rather than cropped images such as in [14] (CGD). For ProxyAnchor, in addition to reporting results using their own framework  [15] (denoted as ProxyAnchor), we also report the results using the framework [19] (denoted as ProxyAnchor).

We use the evaluation protocol of  [19] for fair comparison. We use ResNet50 instead of BN-Inception as our backbone network because it is commonly used in self-supervised frameworks. We experiment with different ranking losses in our framework, e.g., contrastive loss  [10] and multi-similarity loss  [27]. The teacher networks (, , and ) are used to train the corresponding student networks (, , and ) (e.g., teacher is used to train student ). We use the same loss for both teacher and student networks. For example, (teacher) and (student) are trained with the contrastive loss. We also report the results of our method using different pre-trained models: pre-trained Image-Net model (denoted as ImageNet) and the fine-tuned SwAV model (denoted as SwAV).

We compare our method with the supervised baselines that are trained with 100% labeled data. Even in such setting, we still observe a significant improvement using our method compared to state-of-the-art approaches that use ResNet50 or BN-Inception. We boost the final performance to and for CUB-200 and Cars-196 respectively. The results validate the effectiveness of self-supervised pre-training for retrieval as well as the feature basis learning to improve the sample quality on unlabeled data.

We also show that our method generalizes to different losses (e.g., contrastive loss  [10] and multi-similarity (MS) loss  [27]). Both losses lead to improvements using our method. The performance () of MS loss is improved from 66.31 to 74.09 on CUB-200, and 85.16 to 91.53 on Cars-196 respectively. We also report the performance of our method using the pre-trained ImageNet model (), which achieves on-par performance with state-of-the-art approaches (e.g., Proxy-Anchor) even when we use a lower baseline model as the teacher (e.g., MS loss).

Table 2 summarizes the results for In-shop. Different from CUB-200 and Cars-196, In-shop is an instance-level retrieval task, where each individual article is considered as a category. Fashion200k is used as the unlabeled data. We train the teacher and student models using the multi-similarity loss [27] similar to the settings as and in Table 1. We report the results using  [17]. We achieve competitive results against several state-of-the-art methods. We note that the images in In-shop dataset are un-cropped while the images in Fashion200k dataset are cropped. So there exist notable distribution differences between these two datasets. We use the un-cropped version of In-shop to fairly compare with the baseline methods.

Figure 3: Retrieval results on CUB-200 and Cars-196. We show some challenging cases where our self-training method improves Proxy-Anchor [15]. Our results are generated based on the student model in Table 1. The red bounding boxes are incorrect predictions.

4.5 Ablation study

4.5.1 Initialization of Teacher Network

We first investigate the impact of using different pre-trained weights to initialize the teacher network (see Table 3). The results in the table are the final performance of our framework using different pre-trained weights. The teacher network is trained with a contrastive loss. We compare three different pre-trained weights: (1) a model trained on ImageNet with supervised learning; (2) a model trained on ImageNet with SwAV [4]; and (3) a fine-tuned SwAV model on our data without using label information. From the results, we can see that the weights from self-supervised learning significantly improve the pre-trained ImageNet model, and improvements on CUB-200 and Cars-196 respectively. This validates the effectiveness of self-supervised pre-training for retrieval. We also find that fine-tuned SwAV model further improves the performance ( improvements) of pre-trained SwAV model.

Pre-trained weight MAP@R
CUB-200 Cars-196
ImageNet [8] 29.38 31.38
Pre-trained SwAV [4] 32.79 35.54
Fine-tuned SwAV 33.59 36.24
Table 3: Comparison of different weight initialization schemes of the teacher network, where the teacher is trained with a contrastive loss. The results are the final performance of our framework.

4.5.2 Components in Student Network

Table 4 analyzes the importance of each component in our self-training framework. The results are based on the teacher network trained with a contrastive loss. Training the student with the positive and negative pairs sampled from pseudo labels only improves the teacher slightly, and on CUB-200 and Cars-196 respectively. The improvements are not very significant because the pseudo labels are noisy on unlabeled data. The performance is further improved by using the feature basis learning and sample mining, which supports the proposed method for better regularizing the embedding space with the feature basis learning and selecting the high-confidence sample pairs. We boost the final performance to 33.59 and 36.24 for CUB-200 and Cars-196 respectively.

Components MAP@R
CUB-200 Cars-196
Teacher (contrastive) 29.29 31.73
Student (pseudo label) 30.81 31.99
+ Basis 32.45 35.78
+ Basis + Mining 33.59 36.24
Table 4: Ablation study of different components in our framework on CUB-200 and Cars-196. The teacher network is trained with a contrastive loss.

4.5.3 Pairwise Similarity Loss

We also investigate different design choices of the loss functions (Equation 5) used for feature basis learning. One alternative is to assign a binary label to each constructed pair according to the pseudo labels (e.g., for pseudo-positive pair and for pseudo-negative pair) and calculate a batch-wise cross entropy loss between pairwise similarities against these binary labels. We denote this option as local-CE. Another alternative is to first update the global Gaussian means using Equation 6, and then assign a binary label to the global means, followed by a cross entropy loss. We denote this option as global-CE. Similarity distribution loss performs better than cross-entropy, either locally or globally. The reason could be that basis vectors need not generalize to pseudo labels as they could be noisy. Another reason is that SD optimizes both the means and variances to reduce distribution overlap.

Regularization CUB-200
MAP@R RP P@1
Local-CE 32.69 43.20 72.64
Global-CE 32.23 42.68 72.45
SD (Ours) 33.59 44.01 73.19
Table 5: Accuracy of our model in MAP@R, RP and P@1 versus different loss designs on CUB-200.

4.5.4 Number of Clusters

Tables 6 analyzes the influence of different numbers of clusters on unlabeled data. The results are based on the teacher network trained with a contrastive loss. The best performance is obtained with , which is not surprising, as our model (student network) is trained on the NABirds dataset, which has 400 species. We can also see that our method is not sensitive to the number of clusters.

NABirds
MAP@R RP P@1
100 31.83 42.25 72.19
200 32.61 43.02 72.75
300 32.81 43.18 72.21
400 33.59 44.01 73.19
500 33.26 43.69 73.26
Table 6: Influence of using different numbers of clusters () on NABirds, which is used as the unlabeled data for CUB-200.

5 Conclusion

We presented a self-training framework for deep metric learning that improves the retrieval performance by using unlabeled data. Self-supervised learning is used to initialize the teacher model. To deal with noisy pseudo labels, we introduced a new feature basis learning approach that learns basis functions to better model pairwise similarity. The learned basis vectors are used to select high-confidence sample pairs, which reduces the noise introduced by the teacher network and allows the student network to learn more effectively. Our results on standard retrieval benchmarks demonstrate our method outperforms several state-of-the art methods, and significantly boosts the performance of fully-supervised approaches.

References

  • [1] I. B. Ayed (2020) A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses. In ECCV, Cited by: Table 1, §4.4.
  • [2] A. Brown, W. Xie, V. Kalogeiton, and A. Zisserman (2020)

    Smooth-ap: smoothing the path towards large-scale image retrieval

    .
    In ECCV, Cited by: §1.
  • [3] F. Cakir, K. He, X. Xia, B. Kulis, and S. Sclaroff (2019) Deep metric learning to rank. In CVPR, Cited by: Table 1.
  • [4] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, Cited by: §1, §1, §2, §3.1, §3, §4.2, §4.3, §4.5.1, Table 3.
  • [5] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In ICML, Cited by: §1, §1, §2, §4.2.
  • [6] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton (2020) Big self-supervised models are strong semi-supervised learners. In NeurIPS, Cited by: §1, §1, §2.
  • [7] Y. Chen, C. Chou, and Y. F. Wang (2020) Learning to learn in a semi-supervised fashion. arXiv preprint arXiv:2008.11203. Cited by: §2.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §3.1, Table 3.
  • [9] J. e. al. Grill (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §2.
  • [10] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In CVPR, Cited by: §3.1, Table 1, §4.4, §4.4.
  • [11] X. Han, Z. Wu, P. X. Huang, X. Zhang, M. Zhu, Y. Li, Y. Zhao, and L. S. Davis (2017) Automatic spatially-aware fashion concept discovery. In ICCV, Cited by: §4.1.
  • [12] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, Cited by: §1, §1.
  • [13] G. Hinton, O. Vinyals, and J. Dean (2015)

    Distilling the knowledge in a neural network

    .
    arXiv preprint arXiv:1503.02531. Cited by: §2.
  • [14] H. Jun, B. Ko, Y. Kim, I. Kim, and J. Kim (2020) Combination of multiple global descriptors for image retrieval. arXiv preprint arXiv:1903.10663. Cited by: §4.4.
  • [15] S. Kim, D. Kim, M. Cho, and S. Kwak (2020) Proxy anchor loss for deep metric learning. In CVPR, Cited by: §1, §1, §2, §2, Table 1, Figure 3, §4.2, §4.4, §4.4, Table 2.
  • [16] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In ICCV workshop, Cited by: §4.1, §4.
  • [17] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In CVPR, Cited by: §4.1, §4.2, §4.4, §4.
  • [18] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In ICCV, Cited by: §2, Table 1, §4.2.
  • [19] K. Musgrave, S. Belongie, and S. Lim (2020) A metric learning reality check. ECCV. Cited by: §1, Table 1, §4.2, §4.3, §4.4, §4.4, §4.4.
  • [20] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In CVPR, Cited by: §1, §2, footnote 1.
  • [21] Y. Ouali, C. Hudelot, and M. Tami (2020) Semi-supervised semantic segmentation with cross-consistency training. In CVPR, Cited by: §2.
  • [22] E. W. Teh, T. DeVries, and G. W. Taylor (2020) ProxyNCA++: revisiting and revitalizing proxy neighborhood component analysis. In ECCV, Cited by: §2, §2, Table 1, §4.4, Table 2.
  • [23] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona, and S. Belongie (2015) Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In CVPR, Cited by: §4.1.
  • [24] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.1, §4.
  • [25] F. Wang, J. Cheng, W. Liu, and H. Liu (2018) Additive margin softmax for face verification. IEEE Signal Processing Letters. Cited by: Table 1.
  • [26] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu (2018)

    Cosface: large margin cosine loss for deep face recognition

    .
    In CVPR, Cited by: Table 1.
  • [27] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott (2019) Multi-similarity loss with general pair weighting for deep metric learning. In CVPR, Cited by: §1, §1, §2, Table 1, §4.4, §4.4, §4.4, Table 2.
  • [28] X. Wang, H. Zhang, W. Huang, and M. R. Scott (2020) Cross-batch memory for embedding learning. In CVPR, Cited by: Table 2.
  • [29] K. Q. Weinberger and L. K. Saul (2009) Distance metric learning for large margin nearest neighbor classification.. JMLR. Cited by: §2, Table 1.
  • [30] Q. Xie, M. Luong, E. Hovy, and Q. V. Le (2020) Self-training with noisy student improves imagenet classification. In CVPR, Cited by: §1, §2, §2.
  • [31] G. Xu, Z. Liu, X. Li, and C. C. Loy (2020) Knowledge distillation meets self-supervision. In ECCV, Cited by: §2.
  • [32] L. Yang, P. Luo, C. Change Loy, and X. Tang (2015) A large-scale car dataset for fine-grained categorization and verification. In CVPR, Cited by: §4.1.
  • [33] H. Yu, W. Zheng, A. Wu, X. Guo, S. Gong, and J. Lai (2019) Unsupervised person re-identification by soft multilabel learning. In CVPR, Cited by: §2.
  • [34] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng (2020) Revisiting knowledge distillation via label smoothing regularization. In CVPR, Cited by: §2.
  • [35] A. Zhai and H. Wu (2019) Classification is a strong baseline for deep metric learning. In BMVC, Cited by: §2, Table 1, Table 2.
  • [36] X. Zhan, J. Xie, Z. Liu, Y. Ong, and C. C. Loy (2020) Online deep clustering for unsupervised representation learning. In CVPR, Cited by: §2.
  • [37] B. Zoph, G. Ghiasi, T. Lin, Y. Cui, H. Liu, E. D. Cubuk, and Q. V. Le (2020) Rethinking pre-training and self-training. arXiv preprint arXiv:2006.06882. Cited by: §2.