Deep metric learning (DML) object is to learn a mapping that maps into embedding space in which similar data are near and dissimilar data are far. Here, the similarity measures using a certain metric, for example, L1 distance and L2 distance. DML has been applied few-shot learning tasks FewShot
, face recognition tasksTriplet , and image retrieval tasks Cars ; CUB ; InShop . The results of DML for these tasks have achieved state-of-the-art accuracy MS ; XBM ; SoftTriple .
The factors that significantly influence the accuracy of DML approaches have network backbone, loss function, and batch samplingXBM ; MS ; SoftTriple ; Triplet ; ProxyNCA
. Loss functions roughly classify two types that the pair-based losses and the proxy-based losses. The pair-based losses tend to be greatly influenced by batch samplingTriplet ; Npair ; Contrastive ; LiftedStructure ; MS . Conversely, the proxy-based losses are less affected by batch sampling ProxyNCA ; SoftTriple ; ProxyAnchor . At the same time, a cross-batch memory module (XBM) XBM and deep variational metric learning (DVML) DVML are also proposed to improve the performance of DML.
. Classes in practical datasets could have some local centers caused by intra-class variance, and one proxy can’t represent these structures. In contrast, SoftTriple loss haa multiple centers for each class to capture manifold structuresSoftTriple . ProxyNCA loss and SoftTriple loss have similar properties to softmax function ProxyNCA ; SoftTriple . We assume SoftTriple loss has the issue of gradient likes ProxyNCA loss ProxyAnchor
. In this paper, we propose multi-proxies anchor (MPA) loss extended SoftTriple loss and ProxyAnchor loss. In addition, we also suggest an evaluation metric using normalized discounted cumulative gain (nDCG) to evaluate better.
An overview of the rest of the paper is as follows. Section 2, we review the related works in this area. Section 3 describes the MPA loss and and nDCG indexes for DML tasks. In section 4, we present implementation details and comparison of state-of-the-art losses. Finally, section 5 concludes this work and discusses future works.
2 Related Works
In this section, we introduce loss functions for DML, which significantly affect the performance of DML. DML loss functions are classified into two types, which are pair-based loss and proxy-based loss. Following, we describe pair-based losses and proxy-based losses.
2.1 Pair-based Losses
The pair-based losses compute similarities between data in a feature space, and then losses are computed based on these similarities Contrastive ; Triplet ; LiftedStructure ; Npair ; HTL ; MS . Triplet loss Triplet is computed using an anchor data, positive data, and negative data. Note that positive data is a similar data against anchor data, and negative data is a dissimilar data against anchor data, and then similar data is the same class as anchor data, dissimilar data is otherwise. It encourages larger the similarity of a positive pair of anchor and positive data than the similarity of a negative pair of anchor and negative data. On the other hand, lifted structured lossLiftedStructure , N-pair loss Npair , multi-similarity loss MS , and the others are computed all combinations in a mini-batch. These pair-based losses can redefine a general pair weighting(GPW) MS framework, which uses a unified weighting formulation.
However, DML learns by dividing training data into mini-batch, like a general neural network training framework. Further, the size of the datasets has recently increased incredibly. Therefore, computing the loss for all combinations of training data is difficult. For this reason, learning DML with pair-based losses is greatly affected by sampling mini-batch. A good sampling strategy is crucial for good performance and fast convergence in pair-based lossTriplet . Whereas a good sampling is very difficult, and learning results are easy to fluctuate. The Cross-batch memory (XBM) XBM module preserves the embeddings of the previous batch to learn the network using the target batch and the previous batches, in which XBM assumes a ”slow drift” phenomenon. This module is possible to use more learning pairs, despite the small memory cost. However, this module needs to use a suitable memory size; otherwise, DML’s accuracy drops XBM .
2.2 Proxy-based Losses
The proxy-based losses consider proxies besides training data and compute a loss with proxies. The concepts of these losses alleviate the fluctuation of learning results due to the sampling strategy SoftTriple . The proxy-based losses are ProxyNCA loss ProxyNCA , ProxyAnchor loss, and SoftTriple loss SoftTriple . Compared with pair-based losses, the SoftTriple loss and ProxyAnchor are state-of-the-art accuracy SoftTriple ; ProxyAnchor , although the sampling strategy is a random sample strategy.
The main differences between these losses are the number of proxies and the structure of functions. ProxyNCA loss and ProxyAnchor loss are single proxy loss, and SoftTriple loss is multiple proxies loss ProxyNCA ; SoftTriple ; ProxyAnchor
. Furthermore, ProxyNCA loss has the issue of a gradient, which affects backpropagation learning. ProxyAnchor loss improved the gradient issue of ProxyNCA lossProxyAnchor .
3 Proposed Loss Function and Evaluation metric
In this section, we propose multi-proxies anchor loss and a new DML evaluation metric for accurate comparison. We first review and examine the nature of SoftTriple loss. After, we propose multi-proxies anchor loss, which is extended to SoftTriple loss and ProxyAnchor loss. Finally, we also propose DML evaluation using nDCG, which is for more accurate comparison.
3.1 The Nature of SoftTriple Loss
Firstly, we introduce SoftTriple loss and notations used in this paper.
denotes the feature vectors, anddenotes the corresponding label of data . The proxy-based DML losses compute the similarity between the instance and the class . In SoftTriple loss, each class has multiple centers where is class index, and is center index. Note that, is normalized using L2 normalization just like feature vector ; hence L2 norm of equals one. The similarity between data and class of SoftTriple loss SoftTriple is defined as
is a hyperparameter. The proxy-based DML losses alleviate batch sampling effects by computing the similarity using instances and proxy class centers. On the other hand, in the pair-based DML losses, the similarity is computed by the dot product or euclidean distance between the instances in many casesContrastive ; Triplet ; MS ; XBM . Compared with a similarity of the proxy-based losses, a similarity of the pair-based losses is heavily dependent on the combination of instances.
SoftTriple loss is the combination of the similarity loss and the proxy centers’ regularization loss SoftTriple . The similarity loss for data is defined as
where and , where denotes the margin, and is the hyperparameter. Center regularization loss minimizes L2 distance between centers, and then center regularization loss is defined as
Finally, SoftTriple loss is computed following SoftTriple
SoftTriple loss might have the issue about learning process like ProxyNCA loss ProxyAnchor . We examine the gradient of SoftTriple loss, and we check the characteristic of the SoftTriple loss. The gradient of SoftTriple loss with respect to the similarity is calculated as
The gradient almostly approximates to zero if is large enough. However, it might not necessarily be maximized positived similarity and minimized negative similarity when the gradient is equal to zero.
3.2 Multi Proxies Anchor Loss
We propose multi-proxies anchor (MPA) loss, which has multi-proxies and loss structures like multi similarity loss MS . Hence, MPA loss is a proxy-based loss and could be considered an extension of SoftTriple loss and ProxyAnchor loss. Multi-proxies are valid representations for real-world datasets such as a class with several local centers, and multi similarity loss structure is also useful for issues of the gradient of softmax-based loss SoftTriple ; ProxyAnchor . Figure 1 shows the differences in loss structure on embedding space. SoftTriple loss, ProxyAnchor loss, and MPA loss are proxy-based losses, and these losses reduce the effect of batch sampling strategy by the proxies than MultiSimilarity loss. Proxy Anchor loss and MPA loss are similar structures to Multi Similarity loss, and these loss differences are the number of proxies. MPA loss has several proxies, MPA loss is an extension of ProxyAnchor loss.
MPA loss computes the similarity between data and proxies using an inner product and the similarity between data and classes using softmax the same way as (1). Note that, a similarity computation has several ways, for example, max similarity and mean similarity strategy. MPA similarity loss is computed like multi similarity loss with the similarities between data and classes. Hence, MPA similarity loss is defined as
where denotes the set of all classes, and denotes the set of positive classes in the batch. Besides, indicates the set of the positive data in the batch against class ; on the other hand, indicates the set of the negative data in the batch against class . Note that, is equal to the batch size. Finally, MPA loss combined MPA similarity loss (3.2) and center regularization (3), this loss is defined as
where let and . The gradient of MPA loss occurs the percentage of loss in the set of each class. Hence, the gradient of MPA loss is only equal to zero when the loss much smaller than the other losses in class. Compared with SoftTriple loss, MPA loss is not likely the gradient becomes equal to zero.
3.3 Effective Deep Metric Learning Performance Metric
Conventionally, Recall@k and normalized mutual information (NMI) metrics are used as the DML performance metrics LiftedStructure ; NMI ; MS ; SoftTriple ; ProxyAnchor . However, these metrics cannot fully evaluate the DML performance, which is image retrieval performance RealityCheck . A new performance metric, MAP@R, which combines mean average precision and R-precision, is proposed. MAP@R is more stable than recall@k and NMI as DML performance metrics RealityCheck . When evaluating the DML performance, search length R of MAP@R is the number of positive data in datasets. Note that, in MAP@R, the search length is different for each class. In the DML datasets, the size of positive data for each class is much small or big. When the size of positive data is small, MAP@R can not evaluate the DML performance enough due to the short search length. Besides, assuming each class has some local centers, MAP@R can also not evaluate the DML performance enough when the size of positive data is big. Hence, DML performance evaluation needs to validate some search length. Note that, MAP@k which k is any search length, may fail to evaluate the DML when the search length is longer than the number of positive class data.
We propose the nDCG@k metric as the DML performance metric, in which nDCG is conventionally used as the evaluation metric for rank function at search engines NDCG1 ; NDCG2 . nDCG is better stability than the Recall@k metric and more flexibility than the MAP@R metric. nDCG is computed using discounted cumulative gain (DCG) from search results and DCG from the best search results. This paper defines the DCG as
where denotes the th search result rating and denotes the search length. When evaluating DML performance, is a binary value. indicates when the th search result is positive class data for query data, while indicates the opposite result.
where denotes the DCG when the best search results.
For comparison of the DML performance metrics, we show the examples in Table 1 in the number of positive data is four against the query. According to table 1, Recall@10, Precision@10, MAP@R, and MAP@10 are not able to evaluate search results enough, while nDCG@10 can evaluate the performance enough in all results.
|[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]||100||10||25.0||10.0||0.390|
|[1, 0, 0, 0, 0, 0, 0, 0, 0, 1]||100||20||25.0||12.0||0.503|
|[1, 0, 1, 0, 0, 0, 0, 0, 0, 0]||100||20||41.7||16.7||0.586|
|[1, 0, 1, 0, 0, 0, 1, 0, 0, 1]||100||40||41.7||25.0||0.829|
|[1, 1, 1, 1, 0, 0, 0, 0, 0, 0]||100||40||100.0||40.0||1.000|
In this section, we compare MPA loss and the state-of-the-art losses for effectiveness of loss. We evaluate image retrieval tasks’ performance on three benchmark datasets for image retrieval tasks and fine-grained tasks CUB ; Cars ; LiftedStructure . The performance of a retrieval task is measured by Recall@k and proposed nDCG@k metric.
We evaluate the proposed method on three datasets. The datasets are widely-used benchmark datasets, which are fine-grained datasets and large-scale few-shot image datasets. The evaluation datasets are CUB-200-2011CUB , Cars196 Cars , and Stanford Online Products LiftedStructure dataset.
CUB-200-2011 CUB contains 11,788 bird images in 200 classes. We split the data set into two so that the number of classes is even. Following, we use 5,924 images of 100 class for the training and 5,864 images of 100 class for the test. Cars196 Cars contains 16,185 car images of 196 classes. Same as CUB-200-2011, we split the data set into two so that the number of classes is even. The training data and test data are 8,054 images in 98 class and 8,131 images, respectively. Stanford Online Products (SOP) contains 120,053 images in 22,634 categories. We use 59,551 images in 11,318 categorical for training, and 60,502 images in categories for the test.
4.2 Implementation Details
We use the Inception Inception
with the batch normalizationBN
as the backbone network in which the parameter is trained on the ImageNet ILSVRC 2012 data setImageNet . Then these parameters are fine-tuned on the target dataset finetune . The output vectors are 512 dimensions in our experiments. We apply cubic GeM GeM as a global descriptor to outputs of the backbone network. In the preprocessing setting for the data, training images are randomly resized and cropped to , and then these images are randomly horizontal mirrored. On the other hand, test images are resized to and cropped to on the image center. The optimization method uses AdamW optimizer AdamW for all experiments. The initial learning rate for the backbone and center learning rate for SoftTriple loss is set to be and for CUB-200-2011 and Cars196, respectively. In SOP, the learning rate sets to , and the center learning rate sets to
. The training batch size is 180 in CUB-200-2011 and SOP, and the training batch size is 64 in Cars196. The number of epochs is 60 in CUB-200-2011 and Cars196, and the number of epochs is 100 in SOP. Batch sampling uses random sampling, the same as SoftTriple lossSoftTriple . Besides, we decay these learning rates every 10 epochs in CUB-200-2011 and Cars196, and every 20 epochs in the SOP. We set to , , and for SoftTriple loss in (7). The number of centers sets to for SOP, and then for CUB-200-2011 and Cars196. For fairness, we experiment with the SoftTriple loss and ProxyAnchor loss in the same architecture and settings and compare with MPA loss.
4.3 Comparison of state-of-the-art Losses
In CUB-200-2011 and Cars196, MPA loss was the best performance compared with the other DML losses. Especially, MPA loss achieved a higher at nDCG@2 and nDCG@4 by than ProxyAnchor loss in Cars196. However, SoftTriple loss was the best performance compared with MPA loss and ProxyAnchor loss, and MPA loss and ProxyAnchor loss accuracies were not much different in the SOP dataset.
According to these results, we might concern the number of classes and the mean of positive data for each class in the datasets. The cause of the result which SoftTriple loss is better than the proposed method in SOP is SOP has much more the number of classes than CUB-200-2011 and Cars196. In this research, the batch sampling strategy is random sampling, and random sampling might be difficult to sample the effective batch when the number of classes is much more, in MPA loss and ProxyAnchor loss. Hence, MPA loss and ProxyAnchor loss might be more susceptible to batch sampling than SoftTriple loss, while MPA loss and ProxyAnchor loss may also obtain better results by the batch sampling strategy, balanced sampling. Besides, the mean of positive data for each class is , , and in CUB-200-2011, Cars196, and SOP, respectively. This difference is considered to affect the MPA loss and Proxy Anchor loss results. MPA loss is superior to ProxyAnchor loss when the mean of positive data is large like CUB-200-2011 and Cars196, while MPA loss is mostly unchanged to ProxyAnchor loss when the mean of positive data is small like SOP. In the case of the mean is large, each class could have several local centers as like assuming MPA loss. Conversely, each class may have only a center when the case of the mean is small.
We have proposed the multi-proxies anchor (MPA) loss which is extended SoftTriple loss and ProxyAnchor loss. MPA loss is flexibly in responding to the real-world datasets which have several local centers and solved the issue of SoftTriple loss, which is a small gradients issue for gradient descent. Besides, MPA loss has demonstrated better accuracy than ProxyAnchor loss. We have also proposed normalized discounted cumulative gain (nDCG@k) metric as the effective DML performance metric. nDCG@k metric demonstrated more flexibility and effectiveness as DML performance evaluation while keeping good stability compared with conventional DML performance metrics like recall@k metric and MAP@R metric. The similarity is affected by several factors like kind of object, situation, and background. Conventional DML approaches focus on the kind of object, and our future work studies a new DML approach which is considered to several factors.
This work was supported by JSPS KAKENHI Grant Number JP18K11528.
Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1735–2742, 2006.
- (2) Florian Schroff, Dmity Kalenichenko, and James Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp.815–823, 2015.
- (3) Hyun Oh Song, Yu Xiang, Stefanie Jegeka, and Silvio Savarese. Deep Metric Learning via Lifted Structured Feature Embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 4004–4012, 2016.
- (4) Kihyuk Sohn. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. Advances in Neural Information Processing Systems 29 (NIPS), pp. 1857–1865, 2016.
- (5) Weifeng Ge, Weilin Huang, Dengke Dong, and Matthew R. Scott. Deep Metric Learning with Hierarchical Triplet Loss. In Proceedings of the European Conference on Computer Vision(ECCV), pp. 272–288, 2018.
- (6) Xun Wang, Xintong Han, Weiling Huang, Dengke Dong, and Matthew R. Scott. Multi-Similarity Loss with General Pair Weighting for Deep Metric Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5022–5030, 2019.
- (7) Xun Wang, Haozhi Zhang, Weilin Huang, Matthew R. Scott. Cross-Batch Memory for Embedding Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2020.
- (8) Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, and Philipp Krähenbühl. Sampling Matters in Deep Embedding learning. In Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 2840–2848, 2017.
- (9) Yair Movshovitz-Attias, Alexander Toshev, Thomas K. Laung, Sergey Ioffe, and Saurabh Singh. No Fuss Distance Metric Learning using Proxies. In Proceedings of the IEEE international Conference on Computer Vision(ICCV), pp. 360–368, 2017.
- (10) Qi Qian, Lei Shang, Baigui Sun, Juhua Hu, Hao Li, and Rong Jin. SoftTriple Loss: Deep Metric Learning Without Tripe Sampling. In Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 6450–6458, 2019.
Haque Ishfaq, Assaf hoogi, and Daniel Rubin. TVAE: Triplet-based Variational Autoencoder using Metric Learning. Workshop on the International Conference on Learning Representations, 2018.
- (12) Xudoing Lin, Yueqi Duan, Qiyuan Dong, Jiwen Lu, and Jie Zhou. Deep Variational Metric Learning. In Proceeding of the European Conference on Computer Vision(ECCV), pp. 714–729, 2018.
- (13) Maciej Zieba and Lei Wang Training Triplet Networks with GAN. Workshop on the International Conference on Computer Representation, 2017.
- (14) Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. International Conference on Learning Representations (ICLR), 2014.
- (15) Christian Szegedy, Wei Liu, Yangqing Jia, pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, 2015.
Sergey Ioffe and Chrisian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of Machine Learning Research, pp. 448–456, 2015.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), pp. 211–252, 2015.
- (18) Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks?. Advances in Neural Information Processing System 27 (NIPS), pp. 3320–3328, 2014.
- (19) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval, Cambridge university press, 2008.
- (20) Catherine Wah, Steve branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical report, 2011.
- (21) Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D Object Representations for Fine-Grained Categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), 2013.
- (22) Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1096–1104, 2016.
- (23) Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. Deep Metric Learning via Facility Location. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2206–2214, 2017.
- (24) Yuhui Yuan, Kuiyuan Yang, and Chao Zhang. Hard-Aware Deeply Cascaded Embedding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp.814–823, 2017.
- (25) Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese Neural Networks for One-shot Image Recognition. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.
- (26) Diederik P. Kingma, Tim Salimans, and Max Welling. Variationl Dropout and the Local Reparameterization Trick. Advances in Neural Information Processing Systems 28 (NIPS), pp. 2575–2583, 2015.
- (27) Min Lin, Qiang Chen, and Shuicheng Yan. Network In Network. arXiv preprint arXiv:1312.4400, 2013.
Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Is Object Localization for Free? - Weakly-supervised Learning with Convolutional Neural Networks. In proceedings of the IEEE International Conference on Computer Visioin and Pattern Recognition (CVPR), 2015.
- (29) Filip Radenovic, Giorgos Tolias, and Ondrej Chum. Fine-tuning CNN Image Retrieval with No Human Annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
- (30) Kim Sungyeon, Kim Dongwon, Cho Minsu, and Kwak Suha. Proxy Anchor Loss for Deep Metric Learning. In proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- (31) Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A Metric Learning Reality Check. European Computer Vision Association (ECVA), 2020.
- (32) Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM Trans, pp. 422–446, 2002.
- (33) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to Rank using Gradient Descent. In proceedings 22nd International Conference on Machine Learning (ICML), pp. 89–96, 2005.
- (34) Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In proceedings International Conference on Learning Representations (ICLR), 2019.