Calibrated neighborhood aware confidence measure for deep metric learning

06/08/2020 ∙ by Maryna Karpusha, et al. ∙ Amazon 2

Deep metric learning has gained promising improvement in recent years following the success of deep learning. It has been successfully applied to problems in few-shot learning, image retrieval, and open-set classifications. However, measuring the confidence of a deep metric learning model and identifying unreliable predictions is still an open challenge. This paper focuses on defining a calibrated and interpretable confidence metric that closely reflects its classification accuracy. While performing similarity comparison directly in the latent space using the learned distance metric, our approach approximates the distribution of data points for each class using a Gaussian kernel smoothing function. The post-processing calibration algorithm with proposed confidence metric on the held-out validation dataset improves generalization and robustness of state-of-the-art deep metric learning models while provides an interpretable estimation of the confidence. Extensive tests on four popular benchmark datasets (Caltech-UCSD Birds, Stanford Online Product, Stanford Car-196, and In-shop Clothes Retrieval) show consistent improvements even at the presence of distribution shifts in test data related to additional noise or adversarial examples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

1.1 Deep Distance Metric Learning

Suppose a supervised training on given independent and identically distributed (i.i.d.) instances with and

from unknown joint distribution of

and where is the dimension of the input space and is the set of the labels in the training set. The goal of DDML training is to learn a mapping function such that distances between sample pairs belonging to the same class in the embedding space are smaller than those between sample pairs belonging to the different classes. To quantify the distance we define a function representing the distance in the embedding space where is the dimension of the embedding space. Then the mapping with deep neural network model should satisfy the following relation:

(1)

for all such that and . Euclidean distance or cosine distance is typically used for in DDML models.

The performance of this mapping is measured by some loss function

. Early approaches were based on contrastive loss Chopra et al. (2005) and triplet loss Hoffer and Ailon (2015) where the loss function is defined on pairs or triplets of samples in order to minimize intra-class distances and maximize inter-class distances. Since computing the loss on every possible pair or triplet is intractable, recent approaches propose either better sampling strategies Wu et al. (2017); Schroff et al. (2015); Ge et al. (2018)

or loss functions that consider the relationship of all samples within the training batch. Parametric models have also been proposed with the idea of storing some information about the global context  

Movshovitz-Attias et al. (2017); Andrew and Wu (2019); Wen et al. (2016). Several model ensembles have also been explored, focusing on improving classification and retrieval performance using boosting Opitz et al. (2018) or attending diverse spatial locations Kim et al. (2018).

1.2 Confidence in Deep Learning

In optimization theory, a model is defined robust if it can perform well under a certain level of uncertainty Ben-Tal et al. (2009). However, it has been shown that neural networks are susceptible to small intentional or unintentional shifts in the data distribution Hendrycks et al. (2019); Goodfellow et al. (2015). In real-world decision-making systems, it is important to indicate whether or not the prediction is reliable. To achieve this, given a requested point , we associate the confidence estimate for every class prediction where and . The confidence estimate is said to be calibrated if it represents the true probability of the correctness  Guo et al. (2017).

As deep neural networks are shown to provide overconfident predictions, multiple post-processing steps were proposed to produce calibrated confidence measures.

Calibration on the held-out validation data.

Different parametric and non-parametric approaches where the logits are used as features to learn a calibration model from a held-out validation data have been proposed 

Guo et al. (2017); Naeini et al. (2015); Zadrozny and Elkan (2001, 2002). Simple a single-parameter variant of Platt scaling Platt (1999), known as temperature scaling, is often an effective method at obtaining calibrated probabilities Guo et al. (2017); Park et al. (2020). The key insight is that in the temperature scaling approach, only a single parameter is fit to the validation data. Therefore, unlike fitting the original neural network, the temperature scaling algorithm comes with generalization guarantees based on traditional statistical learning theory.

Bayesian approximation. Bayesian deep neural networks evaluate distributions over the models or their parameters. This approach proposes more accurate and tractable approximation of the uncertainty, but at the cost of expensive computation during training and inference Gal (2016)

. Some recent alternatives have been proposed to approximate predictive uncertainty. For example, Gal and Ghahramani propose considering dropout as a way of ensembling, which approximates Bayesian inference in deep Gaussian processes 

Gal and Ghahramani (2016). Unfortunately, empirical results show that one needs to run inference at least 100 times to achieve accurate approximation, which can be infeasible in practice.

Support set based uncertainty estimation. Papernot et al. Frosst et al. (2019) introduce a trust score that measures the conformance between the classier and -nearest neighbors on the support example. It enhances robustness to adversarial attacks and leads to better calibrated uncertainty estimates Hu et al. (2018). Jiang et al.develop this idea further and compute the trust score on deeper layers of DNN than the input to avoid the high-dimensionality of inputs Jiang et al. (2018).

The proposed NED algorithm described in the next Section 2 can be considered as calibration on the support set. The algorithm is designed specifically for DDML models.

2 NED Algorithm

Our goal is to find prediction and confidence for requested point from test data given support set and learned mapping function . The classes between train and support sets can be disjoint if samples share same learned similarity concept. is the set of labels in support and test sets.

Since metric learning enables a similarity comparison directly in the embedding space using distance metric, the simplest way to obtain a prediction is to compare

to a set of samples from support set in the embedding space and pick the class of the nearest example. Then intuitively, we can assume that the magnitude of this distance can represent the confidence in the prediction. However, such a distance is an unbounded positive value, and we need to calibrate it to obtain an interpretable probability estimate of the true correctness likelihood. Furthermore, outliers in the support set can easily lead to misclassification since multiple neighbors are not considered.

To overcome this difficulty, we propose the NED confidence score. To show that it very closely approximates the true correctness likelihood, we first propose the following theorem. The proof can be found in Appendix A.

Theorem 1.

Given embedding and support set where and the probability of belonging to , i.e., , can be approximate by:

(2)

where for and > 0 is a parameter to be tuned.

When , the equation in (2

) is a simple softmax function of the negatives of squares of the Euclidean distances. However, similar to the bandwidth selection in kernel density estimation 

Bashtannyk and Hyndman (2001), we can control the degree of smoothing applied to the samples using . For example, as becomes larger, the relative influence of near samples becomes smaller. On the other hand, if becomes smaller, the relative influence of near samples becomes larger. Therefore the optimizing of is equivalent to finding which provides the best relative distances in terms of probability estimate for the true correctness likelihood.

We optimize with respect to the negative log likelihood on the support set instead of optimizing on held-out validation dataset. Because the parameter does not change the relative magnitudes of the softmax function, we can preserve the spatial order of the nearest neighbors to the requested point in the embedding space.

Suppose that we chose nearest neighbors with labels for the confidence score calculation. Because the rest of the data points are far enough from in the embedding space, for , thus the probability in (2) becomes

(3)

The equation in (3) defines the NED confidence score. Below we show the precise NED algorithm description.

Inputs : query point , DDML model , support set , optimal .
Output : prediction with estimated confidence .
1 = {Obtain embedding vectors for samples in support set};
2 {Obtain embedding vector for query sample };
3 find nearest neighbors and their labels ;
4 for  to  do
5       ;
6      
7 end for
8;
9 where ;
return ,
Algorithm 1 NED algorithm

As we show in section  3, one of the advantages of using such a post-processing algorithm is that it can be used to improve generalization and robustness of already trained DDML models. Moreover, it can be combined with known defenses like adversarial training Goodfellow et al. (2015) or adversarial logit pairing Wallace et al. (2018) to improve further adversarial robustness. Mao et al. proposes adversarial training specially modified with deep metric learning settings Mao et al. (2019). By carefully sampling examples for metric learning, the learned representation increases robustness to adversarial examples and help detect previously unseen adversarial samples.

2.1 Comparison with -nearest neighbor (kNN) and weighted kNN algorithms

Non-parametric classifiers like

-nearest neighbors (kNN) and weighted

-nearest neighbors (WkNN) can also be used to improve the classification performance for already trained DMML models. However, as we show in section 3, they provide non-calibrated confidence estimation. We use kNN and WkNN as baselines to compare with the proposed NED algorithm.

The kNN classifier is a simple non-parametric classifier that predicts the label of an input based on a majority vote from labels of the neighbors in the embedding space. Intuitively, the confidence score for every class can be selected as the percentage of nearest neighbors labels belonging to class:

(4)

The robustness of kNN has already been shown from both theoretical perspectives and empirical analyses Wang et al. (2018); Papernot et al. (2016a); Gal and Ghahramani (2016); Schott et al. (2019). However, the main disadvantage of kNN is that its reliability depends critically on the value of . Therefore many weighted -nearest neighbor (WkNN) approaches have been proposed Gou et al. (2012); Dudani (1976); Hechenbichler and Schliep (2004) where the closer neighbors are weighted more heavily than the farther ones. We have experimented with various weighted approaches and achieved the most reliable and calibrated estimation of the confidence score using an approach described in Dudani (1976) and Gou et al. (2012).

In this case, if the distance-weighted function is defined, the confidence score can be selected as the weighted percentage of nearest neighbors belonging to class:

(5)

For these methods, the weights are linear functions of the distance between and , i.e., in the embedding space.

To compare the performance of NED algorithm with kNN and WkNN algorithm, we use equations (4) and (5) to calculate in algorithm 1.

3 Evaluation

We consider the following scenarios to evaluate the performance of the proposed NED algorithm.

First, we train the state-of-the-art DDML model with the normalized, temperature-weighted version of the cross-entropy loss following the protocol described in Zhai and Wu (2019). Minimizing the cross-entropy can be seen as an approximate bound-optimization algorithm for minimizing many popular in DDML pairwise losses Wen et al. (2016); Wang et al. (2019b); Wu et al. (2018); Hoffer and Ailon (2015); Ge et al. (2018); Chopra et al. (2005)

. Therefore, we use this approach as a baseline that gives state-of-the-art results with simpler hyperparameter and sampling strategy. Empirical experiments with other popular DDML approaches can be found in Appendix 

B. When DDML model is trained, we evaluate performance of the model complemented with Algorithm 1.

Second, we empirically evaluate the robustness of our approach in the presence of the distribution shift in test data. In particular, we repeat our experiments when test data contains small common distortions (image transformations related to JPEG compressions, different illumination conditions, camera quality) or when test images are modified using adversarial white-box attacks.

3.1 Metrics

For all experiments, we use the following metrics:

Accuracy. We use an accuracy metric to evaluate reliability for challenging extreme multi-class classification problems. Classification accuracy is equivalent to the Recall@1 metric in image retrieval Song et al. (2016).

Reliability Diagrams. A reliability diagram such as the one shown in Figure 1 is a visual representation of confidence metric calibration  DeGroot and Fienberg (1983). They show the relationship between expected accuracy and confidence estimation. To estimate the expected accuracy from finite samples, we split test data into bins. For each bin, the mean predicted confidence score is plotted against the true fraction of positive cases. Both metrics should be near the diagonal line if the model is well calibrated.

Expected Calibration Error. While the reliability diagram is a useful visual representation method of confidence calibration, it does not show proportion of samples in each bin. Therefore, we use Expected Calibration Error (ECE) to evaluate calibration Guo et al. (2017) which is defined as:

(6)

where is the number of samples in test dataset, is the number of the bins, is the index set for the th bin, and and are the accuracy and average confidence for the th bin respectively which are defined as:

(7)

3.2 Datasets

We conduct our experiments using DDML models on four benchmark datasets: Caltech-UCSD Birds (CUB-200) Wah et al. (2011), Stanford Online Product (SOP) Song et al. (2016), Stanford Car-196 (CARS196) Krause et al. (2013) and In-shop Clothes Retrieval Liu et al. (2016).

We follow the common evaluation protocol for these datasets Zhai and Wu (2019). In particular, the object categories between train and test sets are disjoint. This split makes the problem more challenging since deep networks can overfit to the categories in the train set and generalization to unseen object categories could be poor.

4 Results

Table 1 shows the performance of DDML model complemented with NED algorithm, and the performance comparison with three baselines kNN, WkNN Dudani (1976), and WkNN Gou et al. (2012). We provide the accuracy reported in Zhai and Wu (2019) and the accuracy of our experiment for the case the label of the first nearest neighbor in the embedding space is used for prediction, i.e., 1NN. The difference in accuracy is caused by using different initialization parameters during training. The results presented in the table demonstrate that the proposed approach using NED algorithm outperforms all the other approaches in both accuracy and ECE. For all experiments, the accuracy for the model with NED algorithm is higher at least by 0.3% and at most by 7.3% compared to 1NN. Similarly, using kNN and different versions of WkNN algorithm improves classification accuracy, which demonstrates that outliers to the training distribution can be identified more accurately at test time when more than the first neighbor is considered.

CUB-200 CARS196 SOP InShop
Accuracy ECE Accuracy ECE Accuracy ECE Accuracy ECE
1NN(reported) 67.6 - 89.1 - 80.8 - 90.6 -
1NN 67.2 - 89.1 - 81.2 - 90.9 -
kNN 73.8 9.2 90.1 10.0 81.2 24.4 90.9 32.5
WkNN Dudani (1976) 74.3 20.5 91.3 12.2 81.2 5.2 91.0 17.8
WkNN Gou et al. (2012) 74.3 20.7 91.3 12.3 81.2 5.2 91.1 17.9
NED(proposed) 74.9 2.4 91.5 1.5 81.2 2.2 91.3 0.3
Table 1: The accuracy and ECE of proposed NED algorithm and the four baselines (1NN, kNN, WkNN Dudani (1976), and WkNN Gou et al. (2012)). The best results of each column are shown in boldface. The proposed approach consistently outperforms the baseline methods for all experiments.

While kNN and WkNN provide more reliable predictions, their confidences do not represent the interpretable confidence of predictions. The ECE is significantly higher than for the proposed NED approach (at least by 3% and at most by 31.2%).

The SOP dataset contains very scarce data (from 2 to 12 images per class). While the proposed approach does not improve classification accuracy compare to baselines, it provides better calibrated confidences. The ECE of the proposed approach is 2.2% whereas that of WkNN is 5.2%. This difference demonstrates that NED can be used for uncertainty estimation to detect cases when the model most probably misclassifies for even such scarce dataset like SOP.

Figure 1 shows reliability diagrams for CARS dataset. We see that the kNN, WkNN Dudani (1976), and WkNN Dudani (1976) tend to be overconfident in its predictions. On the other hand, NED algorithm produces much better confidence estimation at the cost of tuning a single parameter . Also, all the bins are well calibrated by NED algorithm.

(a) kNN
(b) WkNN Dudani (1976)
(c) WkNN Gou et al. (2012)
(d) NED
Figure 1: Reliability diagrams for DDML model complemented with different approaches to estimate confidence of classification (CARS196 dataset). Our novel NED algorithm provides more accurate estimation of true correctness likelihood with the smallest expected calibration error (ECE).

We analyze the impact of value of on the effectiveness of the proposed algorithm. Figure 2 shows the accuracy of the DDML model with different approaches based on the number of neighbors, , used to detect the predicted label. Since the kNN method weights data points far from with the same importance as those close to , its performance degrades with a larger . Both variants of WkNN are better than kNN because they consider the distances between and the neighbors . However, the weights they impose on samples are linear functions of the distances whereas Theorem 1 shows that the correct weights should be exponential functions of the negative of the squared distance, i.e., , hence should rapidly decrease as the distance increases. This is why both kNN and WkNN show poor performance for large s.

On the other hand, the performance of NED algorithm monotonically improves as grows. This is due to Theorem 1, i.e., (2) is the accurate value for the correctness likelihood and (3) is a better approximation for (2) with a larger . Therefore, the performance of the NED algorithm should yield better results with larger s.

Figure 2 also shows a critical advantage of NED algorithm over the other methods. The accuracy of NED algorithm does not vary much after a certain point. Therefore NED is robust to the choice of , which reduces the efforts of choosing considerably whereas the other methods are sensitive to the choice of requiring much effort to find the optimal value for .

This can also be explained by Theorem 1. In (3), we can see become negligible for those points far from in the embedding space, hence the additional accuracy gain obtained by increasing rapidly diminishes after a certain point.

(a) CUB
(b) CARS
(c) SOP
(d) InShop
Figure 2: Accuracy of DDML model based on the value on CUB-200, CARS196, SOP and InShop.

In Appendix A we also provide the detailed interpretation of parameter tuning for . Tuning corresponds to tuning the smoothing factor in the Gaussian kernel function. Hence we can interpret the tuning of

as finding a better estimate for the true probability distribution function of the data.

5 Distorted and Adversarial Images

To evaluate the robustness of the proposed algorithm, we modified images in the test set with fifteen different types of corruption and five severity levels following the protocol described in Hendrycks and Dietterich (2019). Note that the support set was not altered and the value of is the same as that used for the former experiments without distortions.

Table 2 shows the average accuracy and average ECE for DDML model completed with proposed NED algorithm and the four baselines (1NN, kNN, WkNN Dudani (1976), and WkNN Gou et al. (2012)). The results show that, on average, the model with the NED algorithm is more robust than the baselines when considering the types of image distortions. In this case, the difference between the accuracy of model with NED algorithm and that of 1NN is even higher compared to the results obtained on clean images, i.e., the average accuracy for the model with NED algorithm is higher than that of 1NN by at least 1.6% and at most 12%.

CUB-200 CARS196 SOP InShop
Accuracy ECE Accuracy ECE Accuracy ECE Accuracy ECE
1NN 52.8 - 72.9 - 74.1 - 80.2 -
kNN 60.5 12.3 76.0 14.1 75.1 27.3 81.8 34.4
WkNN Dudani (1976) 62.1 25.6 77.0 16.8 75.2 10.5 82.5 20.6
WkNN Gou et al. (2012) 62.2 25.6 77.0 16.9 75.2 10.5 82.5 20.6
NED(proposed) 64.8 3.0 77.2 2.3 75.7 2.9 85.6 0.9
Table 2: The average accuracy and average ECE for DDML model with proposed NED algorithm and the four baselines (1NN, kNN, WkNN Dudani (1976), and WkNN Gou et al. (2012)) when images in test data are distorted with common corruption types Hendrycks and Dietterich (2019). The proposed approach consistently outperforms the baseline methods for all experiments.

To further examine the robustness of the models with the NED algorithm, we evaluate its performance on the adversarial examples. We craft adversarial samples using three popular white-box bounded untargeted attacks: Fast Gradient Sign Method (FGSM) Goodfellow et al. (2015), Basic Iterative Method (BIM) Kurakin et al. (2017), and Projected Gradient Descent (PGD) Madry et al. (2018). We generate adversarial examples using PGD with 0.01 step size for 40 steps and using BIM with 0.02 step size for 10 steps during the training.

Table 3 shows the performance of DDML model with NED algorithm and the four baselines (1NN, kNN, WkNN Dudani (1976), and WkNN Gou et al. (2012)) for CARS dataset. The proposed approach consistently outperforms baseline methods for all experiments.

FGSM BIM PGD
A. E. A. E. A. E. A. E. A. E. A. E.
1NN 86.3 - 77.7 - 85.7 - 69.5 - 86.4 - 74.4 -
NN 87.5 10.4 79.6 9.8 87.3 10.8 71.8 9.6 87.8 10.6 76.1 10.9
WNN Dudani (1976) 88.6 12.7 80.9 12.7 88.4 13.2 73.0 13.1 88.9 12.8 77.7 14.5
WNN Gou et al. (2012) 88.6 12.9 80.9 12.9 88.4 13.3 73.2 13.2 88.9 12.9 77.7 14.5
NED 88.7 1.9 81.3 3.6 80.4 2.2 73.3 2.5 89.0 2.0 78.0 2.9
Table 3: The accuracy (A.) and ECE (E.) for DDML model complemented with NED algorithm for adversarial images. The model with NED algorithm is more robust to adversarial images compared to the baselines.

6 Conclusion

In this paper we described a novel algorithm called NED that can be used with pre-trained DDML models to improve the classification results while providing accurate approximation of true correctness likelihood. We demonstrate the consistent performance improvement over the other baseline methods for examples with normal data distribution, those with distribution shifts related to common image corruptions, and the adversarial examples.

References

  • [1] Z. Andrew and H.-Y. Wu (2019) Classification is a strong baseline for deep metric learning. In Proceedings of The British Machine Vision Conference, Cited by: §1.1.
  • [2] D.M. Bashtannyk and R.J. Hyndman (2001) Bandwidth selection for kernel conditional density estimation. Computational Statistics and Data Analysis 36, pp. 279–298. Cited by: §2.
  • [3] A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski (2009) Robust optimization. Princeton University Press. Cited by: §1.2.
  • [4] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain (2015) Sparse local embeddings for extreme multi-label classification. In Proceedings of the 29th Conference on Neural Information Processing Systems, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [5] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In

    Proceedings of the Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §1.1, §3.
  • [6] M. H. DeGroot and S. E. Fienberg (1983) The comparison and evaluation of forecasters. The statistician. Cited by: §3.1.
  • [7] S. Dodge and L. Karam (2017) A study and comparison of human and deep learning recognition performance under visual distortions. In Proceedings of the 26th international conference on computer communication and networks, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [8] S. Dodge and L. Karam (2017) A study and comparison of human and deep learning recognition performance under visual distortions. In Proceedings of 26th International Conference on Computer Communications and Networks, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [9] S. A. Dudani (1976) The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics. Cited by: Appendix A, Table 4, §2.1, 0(b), Table 1, §4, §4, Table 2, Table 3, §5, §5.
  • [10] N. Frosst, N. Papernot, and G. Hinton (2019) Analyzing and improving representations with the soft nearest neighbor loss. In

    Proceedings of the 36th International Conference on Machine Learning

    ,
    Cited by: §1.2.
  • [11] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In Proceeding of the 33rd International Conference on International Conference on Machine Learning, Cited by: §1.2, §2.1, Calibrated neighborhood aware confidence measure for deep metric learning.
  • [12] Y. Gal (2016) Uncertainty in deep learning. Ph.D. Thesis, University of Cambridge. Cited by: §1.2.
  • [13] W. Ge, W. Huang, D. Dong, and M. R. Scott (2018) Deep metric learning with hierarchical triplet loss. In Proceedings of the 14th European Conference on Computer Vision, Cited by: §1.1, §3.
  • [14] J. Gilmer, R. P. Adams, I. Goodfellow, D. Anderson, and G. E. Dahl (2018) Motivating the rules of the game for adversarial example research. In arXiv preprint arXiv:1807.06732, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [15] I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In Proceedings of International Conference on Learning Representations, Cited by: §1.2, §2, §5.
  • [16] J. Gou, L. Du, Y. Zhang, and T. Xiong (2012) A new distance-weighted k-nearest neighbor classifier. Journal of Information and Computational Science. Cited by: Appendix A, Table 4, §2.1, 0(c), Table 1, §4, Table 2, Table 3, §5, §5.
  • [17] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural netwroks. In Proceedings of the 34th International Conference on Machine Learning, Cited by: §1.2, §1.2, §3.1, Calibrated neighborhood aware confidence measure for deep metric learning.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In Proceedings of the 14th European Conference on Computer Vision, Cited by: Appendix B.
  • [19] K. Hechenbichler and K. Schliep (2004) Weighted k-nearest-neighbor techniques and ordinal classification. Discussion Paper No. 399, Collaborative Research Center 386.. Cited by: §2.1.
  • [20] D. Hendrycks and T. G. Dietterich (2019) Benchmarking neural network robustness to common corruptions and surface variations. In arXiv preprint arXiv:1807.01697, Cited by: Table 2, §5.
  • [21] D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song (2019) Natural adversarial examples. In arXiv preprint arXiv:1907.07174, Cited by: §1.2, Calibrated neighborhood aware confidence measure for deep metric learning.
  • [22] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe (2016) Deep clustering: discriminative embeddings for segmentation and separation. In Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [23] E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network. In Proceedings of the Third International Workshop on Similarity Based Pattern Analysis and Recognition, Cited by: §1.1, §3.
  • [24] H. Hu, Y. Wang, L. Yang, P. Komlev, L. Huang, X. Chen, J. Huang, Y. Wu, M. Merchant, and A. Sacheti (2018) Web-scale responsive visual search at bing. In Proceedings of the 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Cited by: §1.2, Calibrated neighborhood aware confidence measure for deep metric learning.
  • [25] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Mandry (2019) Adversarial examples are not bugs, they are features. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [26] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Cited by: Appendix B.
  • [27] S. Jha, S. Raj, S. Fernandes, S. K. Jha, S. Jha, B. Jalaian, G. Verma, and A. Swami (2019) Attribution-based confidence metric for deep neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [28] H. Jiang, B. Kim, M. Guan, and M. Gupta (2018) To trust or not to trust a classifier. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Cited by: §1.2.
  • [29] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon (2018) Attention-based ensemble for deep metric learning. In Proceedings of the 14th European Conference on Computer Vision, Cited by: §1.1.
  • [30] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013-12) 3D object representations for fine-grained categorization. In 2013 IEEE International Conference on Computer Vision Workshops, pp. 554–561. External Links: Document Cited by: §3.2.
  • [31] A. Kurakin, I. J. Goodfellow, and S. Bangio (2017) Adversarial examples in the physical world. In arXiv preprint arXiv:1607.02533, Cited by: §5.
  • [32] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.2.
  • [33] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In Proceedings of International Conference on Learning Representations, Cited by: §5.
  • [34] C. Mao, Z. Zhong, J. Yang, C. Vondrick, and B. Ray (2019) Metric learning for adversarial robustness. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Cited by: §2.
  • [35] A. Meinke and M. Hein (2020) Towards neural networks that provably know when they don’t know. In Proceedings of Eighth International Conference on Learning Representatives, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [36] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: Appendix B, §1.1, Calibrated neighborhood aware confidence measure for deep metric learning.
  • [37] M. P. Naeini, G. Cooper, and M. Hauskrecht (2015) Obtaining well calibrated probabilities using bayesian binning. In

    Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1.2.
  • [38] M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2018) Deep metric learning with bier: boosting independent embeddings robustly. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.1.
  • [39] N. Papernot, P. D. McDaniel, and I. J. Goodfellow (2016) Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. In arXiv preprint arXiv:1605.07277, Cited by: §2.1.
  • [40] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [41] N. Papernot and P. McDaniel (2018) Deep k-nearest neighbors: towards confident, interpretable and robust deep learning. In arXiv preprint arXiv: 1803.04765, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning, Calibrated neighborhood aware confidence measure for deep metric learning.
  • [42] S. Park, O. Bastani, N. Matni, and I. Lee (2020) PAC confidence sets for deep neural networks via calibrated prediction. In Proceedings of Eighth International Conference on Learning Representatives, Cited by: §1.2, Calibrated neighborhood aware confidence measure for deep metric learning.
  • [43] J. Platt (1999)

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

    .
    Advances in Large Margin Classifiers. Cited by: §1.2.
  • [44] L. Schott, J. Rauber, M. Bethge, and W. Brendel (2019) Towards the first adversarially robust neural network model on mnist. In Proceedings of the International Conference on Learning Representations, Cited by: §2.1.
  • [45] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Appendix B, §1.1.
  • [46] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Cited by: Appendix B, Calibrated neighborhood aware confidence measure for deep metric learning.
  • [47] H. O. Song, Xiang,Y., S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1, §3.2, Calibrated neighborhood aware confidence measure for deep metric learning.
  • [48] D. Stutz, M. Hein, and B. Schiele (2019) Disentangling adversarial robustness and generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [49] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: Appendix B.
  • [50] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §3.2.
  • [51] E. Wallace, S. Feng, and J. Boyd-Graber (2018) Interpreting neural networks with nearest neighbors. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP, Cited by: §2.
  • [52] J. Wang, K.-C. Wang, M. Law, F. Rudzicz, and M. Brudho (2019) Centroid-based deep metric learning for speaker recognition. In Proceedings of the 44th International Conference on Acoustics, Speech, and Signal Processing, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [53] X. Wang, X. Han, W. Huang, D. Dong, and M.R. Scott (2019) Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Cited by: §3.
  • [54] Y. Wang, S. Jha, and K. Chaudhuri (2018) Analysing the robustness of nearest neighbors to adversarial examples. In Proceedings of the International Conference on Machine Learning, Cited by: §2.1.
  • [55] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision, Cited by: §1.1, §3.
  • [56] C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Kraehenbuehl (2017) Sampling matters in deep embedding learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.1, Calibrated neighborhood aware confidence measure for deep metric learning.
  • [57] Z. Wu, A.A. Efros, and S.X. Yu (2018) Improving generalization via scalable neighborhood component analysis. In Proceedings of the European Conference on Computer Vision, Cited by: §3.
  • [58] C. Xing, S. Arik, Z. Zhang, and T. Pfister (2020) Distance-based learning from errors for confidence calibration. In Proceedings of Eighth International Conference on Learning Representatives, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [59] S. Xiong, Y. Zhang, D. Ji, and Y. Lou (2016) Distance metric learning for aspect phrase grouping. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [60] I. E.-H. Yen, X. Huang, P. Ravikumar, K. Zhong, and I. Dhillon (2018) Pd-sparse: a primal and dual sparse approach to extreme multiclass and multilabel classification. In Proceedings of the 33rd International Conference on Machine Learning, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [61] B. Zadrozny and C. Elkan (2001)

    Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers

    .
    In Proceedings of the International Conference on Machine Learning, Cited by: §1.2.
  • [62] B. Zadrozny and C. Elkan (2002) Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Cited by: §1.2.
  • [63] A. Zhai and H. Wu (2019) Classification is a strong baseline for deep metric learning. In Proceedings of British Machine Vision Conference, Cited by: §3.2, §3, §4.
  • [64] A. Zhai and H. Wu (2020) Extreme classification via adversarial softmax approximation. In Proceedings of Eighth International Conference on Learning Representatives, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [65] W. Zhang, J. Yan, X. Wang, and H. Zha (2018) Deep extreme multi-label learning. In Proceedings of the 35th International Conference on Machine Learning, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [66] Y. Zhang, P. Pan, Y. Zheng, K. Zhao, Y. Zhang, X. Ren, and R. Jin (2018) Visual search at alibaba. In Proceedings of the 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.
  • [67] S. Zheng, Y. Song, T. Leung, and I. Goodfellow (2016) Improving the robustness of deep neural networks via stability training. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Calibrated neighborhood aware confidence measure for deep metric learning.

Appendix A The proof of Theorem 1

In this appendix, we show that with the NED algorithm we can provide a good approximation of the true correctness likelihood assuming that the empirical distribution function obtained using the Gaussian kernel smoothing function well estimates the true distribution of data in each class. Therefore we can not only show by experiments that NED score outperforms other methods, such as experiments with 1NN, kNN, WkNN Dudani (1976), and WkNN Gou et al. (2012), but also provide the theoretical ground to support the experimental results.

Theorem 1.

The probability of belonging to given embedding and support set can be approximate by:

(8)

where > 0 is a parameter to be tuned.

Proof.

Suppose that there are  classes for multi-class classification problem and the set of the classes is defined by . Let and

be the random variables representing the independent variables and the associated class respectively with the joint probability density function (PDF),

. Let represents the trained deep distance metric learning (DDML) model. Then this defines the joint PDF of and , . In general, we do not know these PDFs. Here and are the dimensions of the original and embedding spaces respectively.

The true confidence score of a data point for -th class is the probability of belongs to  given in the embedding space, i.e.,

(9)

where and are the marginal PDF of and respectively and is the conditional PDF of given . Since we do not know the joint PDF , we cannot calculate any of these three quantities. Here we estimate these quantities using the data we are given.

Let us assume that we are given data points, i.e., with and where these are the samples from and that is the number of data points belonging to the -th class, i.e., for . Since

(10)

we need to estimate only two quantities and .

For the first quantity, we can estimate it by counting the data points belonging to each class assuming that the data set is balanced, i.e., the size of data for each class is proportional to the true portion of each class. Thus we have

(11)

where denotes the estimate for the mass probability function (PMF) for , i.e., .

For the second quantity, we use a probability density estimation method. If we knew the type of probability distribution the data of each class in the embedding space come from, we could use some parametric estimation method to estimate the conditional probability density, . However, we do not know the details of the distribution in general. Therefore we use a non-parametric distribution estimation method which does not assume a specific type of data distribution. One such method uses the empirical probability density function (PDF), i.e.,

(12)

where is a Dirac delta function whose range is in the set of the extended real numbers and denotes an estimate for . This estimated PDF has infinite peaks, hence we cannot use it in practice. However, we can apply the convolution operation with some smoothing kernel to (12) to obtain finite PDF estimate, i.e.,

(13)

where is a kernel smoothing function and is the convolution operator. One typical choice for the kernel smoothing function is the (multi-dimensional) Gaussian PDF. In this case, the estimated PDF becomes

(14)

where is the PDF of for some , i.e., zero mean Gaussian with as its covariance matrix. We can interpret the convolution with the Gaussian kernel as applying -dimensional low-pass filter. Here denotes the set of all positive definite matrices in and the PDF is defined by

(15)

Note that the PDF in (15) is not used for representing a random variable, but as a kernel smoothing function.

Now the right hand side of (10) can be approximated by

(16)

where (11) and (14) are used. Note that the approximation in (16) does not depend on because it is cancelled out. This is because the number of data in each class, , accounts for the probability for each class.

Now (10) implies that the true confidence score can be approximated by

(17)

where we use the fact that for the derivation of (17) from (16).

In our work, we use DDML models that assume during the training that all the coordinates of the embedding variable are independent and properly normalized for each class. Therefore, we can replace with where and

denotes the identity matrix. Then (

17) becomes

(18)

where denotes the

norm. If we further assume that the variance of the data in the embedding space is equal for all classes,

i.e., , we have

(19)
(20)

If we replace with , we obtain the equation (8). ∎

Appendix B Experiments with different DDML approaches

We evaluate the performance of the proposed NED algorithm on the following three additional DDML models:

Triplet Semihard. We train a model with triplet loss and semi-hard sample mining Schroff et al. (2015). We use GoogleNet Szegedy et al. (2016) with batch normalization  Ioffe and Szegedy (2015) replacing the final dense layer to match the embedding dimension before triplet loss. Due to batch-size constraints, we omitted the experiments with this model on the SOP dataset.

N Pairs. Following the implementation details described in Sohn (2016), we train a model with N pair loss, which is defined as a softmax cross-entropy loss function on the pairwise distances within each batch. The approach with N pair loss is known to perform better than the triplet variant. Batch composition strategy samples pairs of images from N unique classes. We add one dense layer and normalization using L2 norm to GoogleNet with batch normalization.

Proxy NCA. We train a model with proxy NCA loss Movshovitz-Attias et al. (2017). We replace positive and negative samples with points that represent the ideal cluster center of each class - proxies, which are initialized randomly and learned along with the embedding function. The model is not sensitive to the batch selection process and converges faster. We add one dense layer and normalization using the L2 norm to a pretrained Resnet50 He et al. (2016). Since both proxies and embeddings are normalized and training can stall when relative distances become very small, we add a temperature parameter equal to 0.33 to the NCA loss.

The results presented in the Table 4 demonstrates the potential of using NED algorithm with different DDML models.

CUB-200
Triplet Semihard N Pairs Proxy NCA
Accuracy ECE Accuracy ECE Accuracy ECE
1NN(reported) - - 50.96 - 49.21 -
1NN 49.20 - 52.76 - 57.85 -
kNN 55.93 4.40 59.53 10.21 63.47 8.64
WkNN Dudani (1976) 56.51 13.65 59.81 12.50 63.51 18.20
WkNN Gou et al. (2012) 56.51 13.65 59.93 13.01 63.51 18.20
NED(proposed) 56.78 3.62 60.41 2.30 64.73 2.18
CARS196
Triplet Semihard Npairs Proxy NCA
Accuracy ECE Accuracy ECE Accuracy ECE
1NN(reported) - - 71.12 - 73.22 -
1NN 59.62 20.03 67.92 4.16 73.28 -
kNN 63.28 3.76 71.59 6.16 74.01 12.04
WkNN Dudani (1976) 64.26 8.95 72.60 16.78 75.68 17.04
WkNN Gou et al. (2012) 64.26 8.95 73.01 16.00 75.68 17.04
NED(proposed) 64.32 3.73 73.15 2.57 76.29 2.33
SOP
Triplet Semihard Npairs Proxy NCA
Accuracy ECE Accuracy ECE Accuracy ECE
1NN(reported) - - 67.73 - 73.73 -
1NN - - 68.38 - 74.09 -
kNN - - 68.38 20.15 74.09 23.24
WkNN Dudani (1976) - - 69.45 8.49 74.28 10.38
WkNN Gou et al. (2012) - - 69.45 9.01 74.28 10.39
NED(proposed) - - 70.15 2.74 74.28 3.01
Table 4: The accuracy and ECE for DDML models trained with three different loss functions complemented with proposed NED algorithm and the four baselines (1NN, kNN, WkNN Dudani (1976), and WkNN Gou et al. (2012)) on CUB-200, CARS196 and SOP datasets. The best results of each column are in bold. The proposed approach consistently outperforms baseline methods for all experiments.