Multi Proxy Anchor Loss and Effectiveness of Deep Metric Learning Performance Metrics

10/08/2021
by   Shozo Saeki, et al.
0

Deep metric learning (DML) learns the mapping, which maps into embedding space in which similar data is near and dissimilar data is far. In this paper, we propose the new proxy-based loss and the new DML performance metric. This study contributes two following: (1) we propose multi-proxies anchor (MPA) loss, and we show the effectiveness of the multi-proxies approach on proxy-based loss. (2) we establish the good stability and flexible normalized discounted cumulative gain (nDCG@k) metric as the effective DML performance metric. Finally, we demonstrate MPA loss's effectiveness, and MPA loss achieves new state-of-the-art performance on two datasets for fine-grained images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/31/2020

Proxy Anchor Loss for Deep Metric Learning

Existing metric learning losses can be categorized into two classes: pai...
04/22/2021

VeriMedi: Pill Identification using Proxy-based Deep Metric Learning and Exact Solution

We present the system that we have developed for the identification and ...
06/09/2020

Smooth Proxy-Anchor Loss for Noisy Metric Learning

Many industrial applications use Metric Learning as a way to circumvent ...
06/25/2020

Adaptive additive classification-based loss for deep metric learning

Recent works have shown that deep metric learning algorithms can benefit...
08/12/2021

Text Anchor Based Metric Learning for Small-footprint Keyword Spotting

Keyword Spotting (KWS) remains challenging to achieve the trade-off betw...
11/09/2020

Mask Proxy Loss for Text-Independent Speaker Recognition

Open-set speaker recognition can be regarded as a metric learning proble...
10/26/2020

Fewer is More: A Deep Graph Metric Learning Perspective Using Fewer Proxies

Deep metric learning plays a key role in various machine learning tasks....
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep metric learning (DML) object is to learn a mapping that maps into embedding space in which similar data are near and dissimilar data are far. Here, the similarity measures using a certain metric, for example, L1 distance and L2 distance. DML has been applied few-shot learning tasks FewShot

, face recognition tasks

Triplet , and image retrieval tasks Cars ; CUB ; InShop . The results of DML for these tasks have achieved state-of-the-art accuracy MS ; XBM ; SoftTriple .

The factors that significantly influence the accuracy of DML approaches have network backbone, loss function, and batch sampling

XBM ; MS ; SoftTriple ; Triplet ; ProxyNCA

. Loss functions roughly classify two types that the pair-based losses and the proxy-based losses. The pair-based losses tend to be greatly influenced by batch sampling

Triplet ; Npair ; Contrastive ; LiftedStructure ; MS . Conversely, the proxy-based losses are less affected by batch sampling ProxyNCA ; SoftTriple ; ProxyAnchor . At the same time, a cross-batch memory module (XBM) XBM and deep variational metric learning (DVML) DVML are also proposed to improve the performance of DML.

ProxyNCA loss and ProxyAnchor loss, among proxy-based losses, have only one proxy for each class ProxyNCA ; ProxyAnchor

. Classes in practical datasets could have some local centers caused by intra-class variance, and one proxy can’t represent these structures. In contrast, SoftTriple loss haa multiple centers for each class to capture manifold structures

SoftTriple . ProxyNCA loss and SoftTriple loss have similar properties to softmax function ProxyNCA ; SoftTriple . We assume SoftTriple loss has the issue of gradient likes ProxyNCA loss ProxyAnchor

. In this paper, we propose multi-proxies anchor (MPA) loss extended SoftTriple loss and ProxyAnchor loss. In addition, we also suggest an evaluation metric using normalized discounted cumulative gain (nDCG) to evaluate better.

An overview of the rest of the paper is as follows. Section 2, we review the related works in this area. Section 3 describes the MPA loss and and nDCG indexes for DML tasks. In section 4, we present implementation details and comparison of state-of-the-art losses. Finally, section 5 concludes this work and discusses future works.

2 Related Works

In this section, we introduce loss functions for DML, which significantly affect the performance of DML. DML loss functions are classified into two types, which are pair-based loss and proxy-based loss. Following, we describe pair-based losses and proxy-based losses.

2.1 Pair-based Losses

The pair-based losses compute similarities between data in a feature space, and then losses are computed based on these similarities Contrastive ; Triplet ; LiftedStructure ; Npair ; HTL ; MS . Triplet loss Triplet is computed using an anchor data, positive data, and negative data. Note that positive data is a similar data against anchor data, and negative data is a dissimilar data against anchor data, and then similar data is the same class as anchor data, dissimilar data is otherwise. It encourages larger the similarity of a positive pair of anchor and positive data than the similarity of a negative pair of anchor and negative data. On the other hand, lifted structured lossLiftedStructure , N-pair loss Npair , multi-similarity loss MS , and the others are computed all combinations in a mini-batch. These pair-based losses can redefine a general pair weighting(GPW) MS framework, which uses a unified weighting formulation.

However, DML learns by dividing training data into mini-batch, like a general neural network training framework. Further, the size of the datasets has recently increased incredibly. Therefore, computing the loss for all combinations of training data is difficult. For this reason, learning DML with pair-based losses is greatly affected by sampling mini-batch. A good sampling strategy is crucial for good performance and fast convergence in pair-based loss

Triplet . Whereas a good sampling is very difficult, and learning results are easy to fluctuate. The Cross-batch memory (XBM) XBM module preserves the embeddings of the previous batch to learn the network using the target batch and the previous batches, in which XBM assumes a ”slow drift” phenomenon. This module is possible to use more learning pairs, despite the small memory cost. However, this module needs to use a suitable memory size; otherwise, DML’s accuracy drops XBM .

2.2 Proxy-based Losses

The proxy-based losses consider proxies besides training data and compute a loss with proxies. The concepts of these losses alleviate the fluctuation of learning results due to the sampling strategy SoftTriple . The proxy-based losses are ProxyNCA loss ProxyNCA , ProxyAnchor loss, and SoftTriple loss SoftTriple . Compared with pair-based losses, the SoftTriple loss and ProxyAnchor are state-of-the-art accuracy SoftTriple ; ProxyAnchor , although the sampling strategy is a random sample strategy.

The main differences between these losses are the number of proxies and the structure of functions. ProxyNCA loss and ProxyAnchor loss are single proxy loss, and SoftTriple loss is multiple proxies loss ProxyNCA ; SoftTriple ; ProxyAnchor

. Furthermore, ProxyNCA loss has the issue of a gradient, which affects backpropagation learning. ProxyAnchor loss improved the gradient issue of ProxyNCA loss

ProxyAnchor .

3 Proposed Loss Function and Evaluation metric

In this section, we propose multi-proxies anchor loss and a new DML evaluation metric for accurate comparison. We first review and examine the nature of SoftTriple loss. After, we propose multi-proxies anchor loss, which is extended to SoftTriple loss and ProxyAnchor loss. Finally, we also propose DML evaluation using nDCG, which is for more accurate comparison.

3.1 The Nature of SoftTriple Loss

Firstly, we introduce SoftTriple loss and notations used in this paper.

Let

denotes the feature vectors, and

denotes the corresponding label of data . The proxy-based DML losses compute the similarity between the instance and the class . In SoftTriple loss, each class has multiple centers where is class index, and is center index. Note that, is normalized using L2 normalization just like feature vector ; hence L2 norm of equals one. The similarity between data and class of SoftTriple loss SoftTriple is defined as

(1)

where

is a hyperparameter. The proxy-based DML losses alleviate batch sampling effects by computing the similarity using instances and proxy class centers. On the other hand, in the pair-based DML losses, the similarity is computed by the dot product or euclidean distance between the instances in many cases

Contrastive ; Triplet ; MS ; XBM . Compared with a similarity of the proxy-based losses, a similarity of the pair-based losses is heavily dependent on the combination of instances.

SoftTriple loss is the combination of the similarity loss and the proxy centers’ regularization loss SoftTriple . The similarity loss for data is defined as

(2)

where and , where denotes the margin, and is the hyperparameter. Center regularization loss minimizes L2 distance between centers, and then center regularization loss is defined as

(3)

Finally, SoftTriple loss is computed following SoftTriple

(4)

SoftTriple loss might have the issue about learning process like ProxyNCA loss ProxyAnchor . We examine the gradient of SoftTriple loss, and we check the characteristic of the SoftTriple loss. The gradient of SoftTriple loss with respect to the similarity is calculated as

(5)

The gradient almostly approximates to zero if is large enough. However, it might not necessarily be maximized positived similarity and minimized negative similarity when the gradient is equal to zero.

3.2 Multi Proxies Anchor Loss

We propose multi-proxies anchor (MPA) loss, which has multi-proxies and loss structures like multi similarity loss MS . Hence, MPA loss is a proxy-based loss and could be considered an extension of SoftTriple loss and ProxyAnchor loss. Multi-proxies are valid representations for real-world datasets such as a class with several local centers, and multi similarity loss structure is also useful for issues of the gradient of softmax-based loss SoftTriple ; ProxyAnchor . Figure 1 shows the differences in loss structure on embedding space. SoftTriple loss, ProxyAnchor loss, and MPA loss are proxy-based losses, and these losses reduce the effect of batch sampling strategy by the proxies than MultiSimilarity loss. Proxy Anchor loss and MPA loss are similar structures to Multi Similarity loss, and these loss differences are the number of proxies. MPA loss has several proxies, MPA loss is an extension of ProxyAnchor loss.

Figure 1: The differences in MultiSimilarity loss, SoftTriple loss, ProxyAnchor loss, and Multi-Proxies Anchor loss. Symbol shape denotes classes, and the lines denote the similarity between. Red symbols represent the positive embeddings, blue symbols represent the negative embeddings, and black symbols represent the proxies. DML loss learns the embedding space so that similarity data are near (red lines are short) and dissimilar data are far (blue lines are long). (a) MultiSimilarity loss MS is pair-based DML loss, and accuracy influenced by batch sampling strategy. (b) SoftTriple loss SoftTriple computes the similarity between data and multi-proxies, and loss computes using these similarities. (c) ProxyAnchor loss ProxyAnchor introduces the proxy to MultiSimilarity loss structure, and ProxyAnchor loss has a single proxy for each class. (d) MPA loss extended the SoftTriple loss and ProxyAnchor loss.

MPA loss computes the similarity between data and proxies using an inner product and the similarity between data and classes using softmax the same way as (1). Note that, a similarity computation has several ways, for example, max similarity and mean similarity strategy. MPA similarity loss is computed like multi similarity loss with the similarities between data and classes. Hence, MPA similarity loss is defined as

(6)

where denotes the set of all classes, and denotes the set of positive classes in the batch. Besides, indicates the set of the positive data in the batch against class ; on the other hand, indicates the set of the negative data in the batch against class . Note that, is equal to the batch size. Finally, MPA loss combined MPA similarity loss (3.2) and center regularization (3), this loss is defined as

(7)

For comparison of the nature of loss, we compare the gradient of MPA loss (7) and SoftTriple loss. The gradient of MPA similarity loss (3.2) with respect to the similarity is computed following:

(8)

where let and . The gradient of MPA loss occurs the percentage of loss in the set of each class. Hence, the gradient of MPA loss is only equal to zero when the loss much smaller than the other losses in class. Compared with SoftTriple loss, MPA loss is not likely the gradient becomes equal to zero.

3.3 Effective Deep Metric Learning Performance Metric

Conventionally, Recall@k and normalized mutual information (NMI) metrics are used as the DML performance metrics LiftedStructure ; NMI ; MS ; SoftTriple ; ProxyAnchor . However, these metrics cannot fully evaluate the DML performance, which is image retrieval performance RealityCheck . A new performance metric, MAP@R, which combines mean average precision and R-precision, is proposed. MAP@R is more stable than recall@k and NMI as DML performance metrics RealityCheck . When evaluating the DML performance, search length R of MAP@R is the number of positive data in datasets. Note that, in MAP@R, the search length is different for each class. In the DML datasets, the size of positive data for each class is much small or big. When the size of positive data is small, MAP@R can not evaluate the DML performance enough due to the short search length. Besides, assuming each class has some local centers, MAP@R can also not evaluate the DML performance enough when the size of positive data is big. Hence, DML performance evaluation needs to validate some search length. Note that, MAP@k which k is any search length, may fail to evaluate the DML when the search length is longer than the number of positive class data.

We propose the nDCG@k metric as the DML performance metric, in which nDCG is conventionally used as the evaluation metric for rank function at search engines NDCG1 ; NDCG2 . nDCG is better stability than the Recall@k metric and more flexibility than the MAP@R metric. nDCG is computed using discounted cumulative gain (DCG) from search results and DCG from the best search results. This paper defines the DCG as

(9)

where denotes the th search result rating and denotes the search length. When evaluating DML performance, is a binary value. indicates when the th search result is positive class data for query data, while indicates the opposite result.

(10)

where denotes the DCG when the best search results.

For comparison of the DML performance metrics, we show the examples in Table 1 in the number of positive data is four against the query. According to table 1, Recall@10, Precision@10, MAP@R, and MAP@10 are not able to evaluate search results enough, while nDCG@10 can evaluate the performance enough in all results.

Search results Recall@10 Precision@10 MAP@R MAP@10 nDCG@10
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 100 10 25.0 10.0 0.390
[1, 0, 0, 0, 0, 0, 0, 0, 0, 1] 100 20 25.0 12.0 0.503
[1, 0, 1, 0, 0, 0, 0, 0, 0, 0] 100 20 41.7 16.7 0.586
[1, 0, 1, 0, 0, 0, 1, 0, 0, 1] 100 40 41.7 25.0 0.829
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0] 100 40 100.0 40.0 1.000
Table 1: The comparison of the DML performance metrics. The DML performance metrics show Recall@10, Precision@10, MAP@R where R is four, and nDCG@10. The search results are in the order from left to right. Last column result shows the best search result.

4 Evaluation

In this section, we compare MPA loss and the state-of-the-art losses for effectiveness of loss. We evaluate image retrieval tasks’ performance on three benchmark datasets for image retrieval tasks and fine-grained tasks CUB ; Cars ; LiftedStructure . The performance of a retrieval task is measured by Recall@k and proposed nDCG@k metric.

4.1 Datasets

We evaluate the proposed method on three datasets. The datasets are widely-used benchmark datasets, which are fine-grained datasets and large-scale few-shot image datasets. The evaluation datasets are CUB-200-2011

CUB , Cars196 Cars , and Stanford Online Products LiftedStructure dataset.

CUB-200-2011 CUB contains 11,788 bird images in 200 classes. We split the data set into two so that the number of classes is even. Following, we use 5,924 images of 100 class for the training and 5,864 images of 100 class for the test. Cars196 Cars contains 16,185 car images of 196 classes. Same as CUB-200-2011, we split the data set into two so that the number of classes is even. The training data and test data are 8,054 images in 98 class and 8,131 images, respectively. Stanford Online Products (SOP) contains 120,053 images in 22,634 categories. We use 59,551 images in 11,318 categorical for training, and 60,502 images in categories for the test.

4.2 Implementation Details

We use the Inception Inception

with the batch normalization

BN

as the backbone network in which the parameter is trained on the ImageNet ILSVRC 2012 data set

ImageNet . Then these parameters are fine-tuned on the target dataset finetune . The output vectors are 512 dimensions in our experiments. We apply cubic GeM GeM as a global descriptor to outputs of the backbone network. In the preprocessing setting for the data, training images are randomly resized and cropped to , and then these images are randomly horizontal mirrored. On the other hand, test images are resized to and cropped to on the image center. The optimization method uses AdamW optimizer AdamW for all experiments. The initial learning rate for the backbone and center learning rate for SoftTriple loss is set to be and for CUB-200-2011 and Cars196, respectively. In SOP, the learning rate sets to , and the center learning rate sets to

. The training batch size is 180 in CUB-200-2011 and SOP, and the training batch size is 64 in Cars196. The number of epochs is 60 in CUB-200-2011 and Cars196, and the number of epochs is 100 in SOP. Batch sampling uses random sampling, the same as SoftTriple loss

SoftTriple . Besides, we decay these learning rates every 10 epochs in CUB-200-2011 and Cars196, and every 20 epochs in the SOP. We set to , , and for SoftTriple loss in (7). The number of centers sets to for SOP, and then for CUB-200-2011 and Cars196. For fairness, we experiment with the SoftTriple loss and ProxyAnchor loss in the same architecture and settings and compare with MPA loss.

4.3 Comparison of state-of-the-art Losses

In this section we compared MPA loss with state-of-the-art losses. We show the comparison of recall@k and nDCG@k on CUB-200-2011, Cars196, and SOP in table 2, table 3, and table 4, respectively.

In CUB-200-2011 and Cars196, MPA loss was the best performance compared with the other DML losses. Especially, MPA loss achieved a higher at nDCG@2 and nDCG@4 by than ProxyAnchor loss in Cars196. However, SoftTriple loss was the best performance compared with MPA loss and ProxyAnchor loss, and MPA loss and ProxyAnchor loss accuracies were not much different in the SOP dataset.

According to these results, we might concern the number of classes and the mean of positive data for each class in the datasets. The cause of the result which SoftTriple loss is better than the proposed method in SOP is SOP has much more the number of classes than CUB-200-2011 and Cars196. In this research, the batch sampling strategy is random sampling, and random sampling might be difficult to sample the effective batch when the number of classes is much more, in MPA loss and ProxyAnchor loss. Hence, MPA loss and ProxyAnchor loss might be more susceptible to batch sampling than SoftTriple loss, while MPA loss and ProxyAnchor loss may also obtain better results by the batch sampling strategy, balanced sampling. Besides, the mean of positive data for each class is , , and in CUB-200-2011, Cars196, and SOP, respectively. This difference is considered to affect the MPA loss and Proxy Anchor loss results. MPA loss is superior to ProxyAnchor loss when the mean of positive data is large like CUB-200-2011 and Cars196, while MPA loss is mostly unchanged to ProxyAnchor loss when the mean of positive data is small like SOP. In the case of the mean is large, each class could have several local centers as like assuming MPA loss. Conversely, each class may have only a center when the case of the mean is small.

Methods Network R@1 R@2 R@4 R@8 nDCG@2 nDCG@4 nDCG@8
Clusteing Clustering - - -
ProxyNCA ProxyNCA - - -
HDC HDC - - -
Margin Margin - - -
HTL HTL - - -
MS MS - - -
SoftTriple SoftTriple - - -
ProxyAnchor ProxyAnchor - - -
SoftTriple (our)
ProxyAnchor (our)
MPA
Table 2: Comparison of recall@k and nDCG@k on CUB-200-2011. The column of Network denotes the type of backbone network. G, B and R of Network means GoogleNet, InceptionBN, and ResNet50 respectively. Furthermore, superscripts means the size of feature vectors.
Methods Network R@1 R@2 R@4 R@8 nDCG@2 nDCG@4 nDCG@8
Clusteing Clustering - - -
ProxyNCA ProxyNCA - - -
HDC HDC - - -
Margin Margin - - -
HTL HTL - - -
MS MS - - -
SoftTriple SoftTriple - - -
Proxy-Anchor ProxyAnchor - - -
SoftTriple (our)
ProxyAnchor (our)
MPA
Table 3: Comparison of recall@k and nDCG@k on Cars196.
Methods Network R@1 R@10 R@100 nDCG@10 nDCG@100
HDC HDC - -
HTL HTL - -
MS MS - -
SoftTriple SoftTriple - -
Proxy-Anchor ProxyAnchor - -
SoftTriple (our)
ProxyAnchor (our)
MPA
Table 4: Comparison of recall@k and nDCG@k on SOP

5 Conclusion

We have proposed the multi-proxies anchor (MPA) loss which is extended SoftTriple loss and ProxyAnchor loss. MPA loss is flexibly in responding to the real-world datasets which have several local centers and solved the issue of SoftTriple loss, which is a small gradients issue for gradient descent. Besides, MPA loss has demonstrated better accuracy than ProxyAnchor loss. We have also proposed normalized discounted cumulative gain (nDCG@k) metric as the effective DML performance metric. nDCG@k metric demonstrated more flexibility and effectiveness as DML performance evaluation while keeping good stability compared with conventional DML performance metrics like recall@k metric and MAP@R metric. The similarity is affected by several factors like kind of object, situation, and background. Conventional DML approaches focus on the kind of object, and our future work studies a new DML approach which is considered to several factors.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Number JP18K11528.

References

  • (1)

    Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1735–2742, 2006.

  • (2) Florian Schroff, Dmity Kalenichenko, and James Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp.815–823, 2015.
  • (3) Hyun Oh Song, Yu Xiang, Stefanie Jegeka, and Silvio Savarese. Deep Metric Learning via Lifted Structured Feature Embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 4004–4012, 2016.
  • (4) Kihyuk Sohn. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. Advances in Neural Information Processing Systems 29 (NIPS), pp. 1857–1865, 2016.
  • (5) Weifeng Ge, Weilin Huang, Dengke Dong, and Matthew R. Scott. Deep Metric Learning with Hierarchical Triplet Loss. In Proceedings of the European Conference on Computer Vision(ECCV), pp. 272–288, 2018.
  • (6) Xun Wang, Xintong Han, Weiling Huang, Dengke Dong, and Matthew R. Scott. Multi-Similarity Loss with General Pair Weighting for Deep Metric Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5022–5030, 2019.
  • (7) Xun Wang, Haozhi Zhang, Weilin Huang, Matthew R. Scott. Cross-Batch Memory for Embedding Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2020.
  • (8) Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, and Philipp Krähenbühl. Sampling Matters in Deep Embedding learning. In Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 2840–2848, 2017.
  • (9) Yair Movshovitz-Attias, Alexander Toshev, Thomas K. Laung, Sergey Ioffe, and Saurabh Singh. No Fuss Distance Metric Learning using Proxies. In Proceedings of the IEEE international Conference on Computer Vision(ICCV), pp. 360–368, 2017.
  • (10) Qi Qian, Lei Shang, Baigui Sun, Juhua Hu, Hao Li, and Rong Jin. SoftTriple Loss: Deep Metric Learning Without Tripe Sampling. In Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 6450–6458, 2019.
  • (11)

    Haque Ishfaq, Assaf hoogi, and Daniel Rubin. TVAE: Triplet-based Variational Autoencoder using Metric Learning. Workshop on the International Conference on Learning Representations, 2018.

  • (12) Xudoing Lin, Yueqi Duan, Qiyuan Dong, Jiwen Lu, and Jie Zhou. Deep Variational Metric Learning. In Proceeding of the European Conference on Computer Vision(ECCV), pp. 714–729, 2018.
  • (13) Maciej Zieba and Lei Wang Training Triplet Networks with GAN. Workshop on the International Conference on Computer Representation, 2017.
  • (14) Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. International Conference on Learning Representations (ICLR), 2014.
  • (15) Christian Szegedy, Wei Liu, Yangqing Jia, pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, 2015.
  • (16)

    Sergey Ioffe and Chrisian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of Machine Learning Research, pp. 448–456, 2015.

  • (17)

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), pp. 211–252, 2015.

  • (18) Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks?. Advances in Neural Information Processing System 27 (NIPS), pp. 3320–3328, 2014.
  • (19) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval, Cambridge university press, 2008.
  • (20) Catherine Wah, Steve branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical report, 2011.
  • (21) Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D Object Representations for Fine-Grained Categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), 2013.
  • (22) Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1096–1104, 2016.
  • (23) Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. Deep Metric Learning via Facility Location. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2206–2214, 2017.
  • (24) Yuhui Yuan, Kuiyuan Yang, and Chao Zhang. Hard-Aware Deeply Cascaded Embedding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp.814–823, 2017.
  • (25) Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese Neural Networks for One-shot Image Recognition. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.
  • (26) Diederik P. Kingma, Tim Salimans, and Max Welling. Variationl Dropout and the Local Reparameterization Trick. Advances in Neural Information Processing Systems 28 (NIPS), pp. 2575–2583, 2015.
  • (27) Min Lin, Qiang Chen, and Shuicheng Yan. Network In Network. arXiv preprint arXiv:1312.4400, 2013.
  • (28)

    Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Is Object Localization for Free? - Weakly-supervised Learning with Convolutional Neural Networks. In proceedings of the IEEE International Conference on Computer Visioin and Pattern Recognition (CVPR), 2015.

  • (29) Filip Radenovic, Giorgos Tolias, and Ondrej Chum. Fine-tuning CNN Image Retrieval with No Human Annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • (30) Kim Sungyeon, Kim Dongwon, Cho Minsu, and Kwak Suha. Proxy Anchor Loss for Deep Metric Learning. In proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • (31) Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A Metric Learning Reality Check. European Computer Vision Association (ECVA), 2020.
  • (32) Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM Trans, pp. 422–446, 2002.
  • (33) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to Rank using Gradient Descent. In proceedings 22nd International Conference on Machine Learning (ICML), pp. 89–96, 2005.
  • (34) Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In proceedings International Conference on Learning Representations (ICLR), 2019.