1 Introduction
Distance metric learning (DML) focuses on learning similarity or dissimilarity between data and it has been actively researched in classification and clustering [47, 18, 2], as well as domainspecific applications such as information retrieval [27, 46]
[17, 14, 30] and bioinformatics [42]. A commonly studied distance metric is the generalized (squared) Mahalanobis distance, which defines the distance between any two instances aswhere is a positive semidefinite (PSD) matrix. Owing to its PSD property, can be decomposed into with ; thus the Mahalanobis distance is equivalent to the Euclidean distance
in the linearly transformed feature space. When
, instances and are transformed from a highdimensional instance space to a lowdimensional feature space.To learn a specific distance metric for each task, prior knowledge on instance similarity and dissimilarity should be provided as side information. Metric learning methods differ by the form of side information they use and the supervision encoded in similar and dissimilar pairs. For example, pairwise constraints enforce the distance between instances of the same class to be small (or smaller than a threshold value) and the distance between instances of different classes to be large (or larger than a threshold value) [41, 40, 38, 9, 13]. The thresholds could be either predefined or learned for similar and dissimilar pairs [7, 19]. In triplet constraints , distance between the differentclass pair should be larger than distance between the sameclass pair , and typically, plus a margin [39, 33, 44, 24]. More recently, quadruplet constraints are proposed, which require the difference in the distance of two pairs of instances to exceed a margin [20], and tuplet extends the triplet constraint for multiclass classification [32, 22].
The gap between thresholds in pairwise constraints and the margin in triplet and quadruplet constraints are both designed to learn a distance metric that could ensure good generalization of the subsequent nearest neighbor (NN) classifier. However, such a separating margin imposed at the distance and decision level does not necessarily produce a robust metric – indeed it may be sensitive to a small perturbation at the instance level. As illustrated in Fig. 1 (upper), a tiny perturbation from to in the instance space can be magnified by the learned distance metric, leading to a change in its NN from to in the feature space, and even worse, an incorrect label prediction if 1NN is used.
In this paper, we propose a simple yet effective method to enhance robustness of the learned distance metric against instance perturbation. The principal idea is to expand a certified neighborhood, defined as the largest hypersphere in which a training instance could be perturbed without changing the label of its nearest neighbor (or nearest neighbors if required) in the feature space.
Our contributions are mainly fourfold. Firstly, we derive an analytically elegant solution to the radius of certified neighborhood (Sec. 2.1). It is equivalent to the distance between a training instance and its nearest adversarial example [34] termed support point. Building on a geometric insight, the support point can be easily identified as the closest point to in the instance space that lies on the decision boundary in the feature space. Secondly, we define a new perturbation loss that penalizes the radius for being small, or equivalently, encourages an expansion of certified neighborhood (Sec. 2.1), which can be optimized jointly with any existing tripletbased metric learning methods (Sec. 2.2
). The optimization problem suggests that our method learns a discriminative metric in a weighted manner and simultaneously imposes a datadependent regularization. Thirdly, because learning a distance metric for highdimensional data may suffer from overfitting, we extend the perturbation loss so that the metric could be learned based on PCA transformed data in a lowdimensional subspace while retaining the ability to withstand perturbation in the original highdimensional instance space (Sec.
2.3). Fourthly, we show the benefit of expanding a certified neighborhood to the generalization ability of the learned distance metric by using the theoretical technique of algorithmic robustness [43] (Theorem 3, Sec. 2.4). Experiments in noisefree and noisy settings show that the proposed method outperforms existing robust metric learning methods in terms of classification accuracy and validate its robustness to noise (Sec. 3).Related work
To improve robustness to perturbation that is likely to exist in practice, many robust metric learning methods have been proposed, which can be categorized into three main types. The first type of methods imposes structural assumption or regularization over so as to avoid overfitting [16, 23, 37, 21, 15, 26, 25]. However, structural information often exists in image datasets but is generally unavailable in the symbolic datasets studied in this paper. Regularizationbased methods are proposed to reduce the risk of overfitting to feature noise. Our proposal, which is aimed to withstand perturbation, does not conflict with these methods and can be combined with them to learn a more effective and robust distance metric; an example is shown in Sec. 3.2. The second type of methods explicitly models the perturbation distribution or identifies clean latent examples [48, 29]. The expected Mahalanobis distance is then used to adjust the value of separating margin. The third type of methods generates hard instances through adversarial perturbation and trains a metric to fare well in the new hard problem [6, 11]
. Although sharing the aim of improving metric robustness, these methods approach the task at a datalevel by synthesizing real examples that incur large losses, while our method tackles perturbation at a modellevel by designing a loss function that considers the definition of robustness with respect to the decision maker
NN. By preventing change in the nearest neighbor in a strict manner, our method can obtain a certification on the adversarial margin. Finally, we note that a large margin in the instance space has been studied in deep neural networks for enhancing robustness and generalization ability
[1, 12, 45, 8]. In contrast, our paper investigates such margin in the framework of metric learning, defines it specifically with respect to the NN classifier, and provides an exact and analytical solution to the margin.Notation
Let denote the set of training instance and label pairs, where and ; is called the instance space in this paper. Our framework is based on triplet constraints and we adopt the following strategy for generating triplets [39]:
is termed the target neighbor of and is termed the impostor. and denote the squared Euclidean and Mahalanobis distances, respectively; , where is the cone of realvalued PSD matrices. . denotes the cardinality of a set . denotes the indicator function. for .
2 Proposed approach
In this section, we first derive an explicit formula for the support point and provide the rationale behind the advocated perturbation loss, followed by its optimization problem. We then extend the method for highdimensional data. Lastly, we discuss the benefit of our method to the generalization ability of the learned metric. Main concepts are illustrated in Fig. 2 and listed in Table 5 of Appendix A.
2.1 Support point and perturbation loss
As mentioned in the introduction, a learned distance metric may be sensitive to perturbation in the sense that a small change of the instance could alter its nearest neighbor in the learned feature space, from an instance of the same class to one of a different class, and consequently, increasing the risk of misclassification from NN. A perturbed point, that causes a change in the nearest neighbors and thus prediction, is termed an adversarial example [34]; if the adversarial examples of an instance are all far away from the instance itself, a high degree of robustness is expected. Based on this reasoning, we will construct a loss function to penalize the small distance between a training instance and its closest adversarial example (i.e. support point), and therefore, allowing to retain prediction correctness even when perturbed to a larger extent.
We start by building a geometric insight into the support point: for any instance associated with the triplet constraint , the support point is the closest point to in the instance space that lies on the decision boundary formed by and in the feature space. Note that closeness is defined in the instance space and will be calculated using the Euclidean distance since we target at changes on the original feature of an instance; and that the decision boundary is found in the feature space since NNs are identified by using the Mahalanobis distance. Mathematically, we can formulate the support point as follows:
(1) 
With a pregiven positive definite matrix , the objective function of Eq. 1 defines an arbitrarily oriented hyperellipsoid, representing any heterogeneous and correlated perturbation. Without prior knowledge on the perturbation, we simplify
as the identity matrix. In this case, the objective function defines a hypersphere, representing perturbation of equal magnitude in all directions. It can also be interpreted as minimizing the Euclidean distance from the training instance
. For clarity, we always refer the certified neighborhood as the largest hypersphere in this paper; the hyperellipsoid case is discussed in Appendix B. The constraint defines the decision boundary, which is the perpendicular bisector of points and. In other words, it is a hyperplane that is perpendicular to the line joining points
and and passes their midpoint ; all points on the hyperplane are equidistant from and .Since Eq. 1 minimizes a convex quadratic function with an equality constraint, we can find an explicit formula for the support point by using the method of Lagrangian multipliers; please see Appendix B for detailed derivation:
(2) 
With a closedform solution of , we can now calculate the squared Euclidean distance between and :
(3) 
For clarity, we will call the adversarial margin, in contrast to the distance margin as in LMNN. It defines the radius of the certified neighborhood.
To improve robustness of distance metric, we design a perturbation loss to encourage an expansion of certified neighborhood. Two situations need to be distinguished here. Firstly, when the nearest neighbor of is an instance from the same class, we will penalize a small adversarial margin by using the hinge loss . The reasons are that (a) the adversarial margin is generally smaller for hard instances that are close to the class boundary in contrast to those locating far away and (b) it is these hard instances that are more vulnerable to perturbation and demand an improvement in their robustness. Therefore, we introduce for directing attention to hard instances and controlling the desired margin. Secondly, in the other situation where the nearest neighbor of belongs to a different class, metric learning should focus on satisfying the distance requirement specified in the triplet constraint. In this case, we simply assign a large penalty of to promote a nonincreasing loss function. Integrating these two situations leads to the proposed perturbation loss:
(4) 
where is an abbreviation for . To prevent the denominator of Eq. 3 from being zero, which may happen when differentclass instances and are close to each other, we add a small constant (=1e10) to the denominator; that is, .
2.2 Metric learning with certified robustness
As support points are derived from triplet constraints, it would be natural and straightforward to embed the proposed perturbation loss into a metric learning method that is also based on triplet constraints. LMNN is thus adopted as an example for its wide use and effective classification performance.
The objective function of the proposed LMNN with certified robustness (LMNNCR) is as follows:
(5) 
where stands for . The weight parameter controls the importance of perturbation loss () relative to the loss function of LMNN (). balances the impacts between pulling together target neighbors and pushing away impostors.
We adopt the projected gradient descent algorithm to solve the optimization problem (Eq. 5). The gradient of and are given as follows:
where ; and are defined similarly. The gradient of is a sum of two descent directions. The first direction agrees with LMNN, indicating that our method updates the metric toward better discrimination in a weighted manner. The second direction controls the scale of ; the metric will descend at a faster pace in the direction of a larger correlation between and . This suggests our method functions as a datadependent regularization. Let denote the Mahalanobis matrix learned at the th iteration. The distance matrix will be updated as
where denotes the learning rate. To guarantee the PSD property, we factorize as
via eigendecomposition and truncate all negative eigenvalues to zero, i.e.
.2.3 Extension to highdimensional data
As PCA is often applied to preprocess highdimensional data prior to metric learning, we propose an extension so that the distance metric learned in the lowdimensional PCA subspace could still achieve certified robustness against perturbation in the original highdimensional instance space. Defining perturbation loss in conjunction with PCA is realizable as our derivation builds on the linear transformation induced by the distance metric and PCA also performs a linear transformation to map data onto a lower dimension subspace. Let denote the linear transformation matrix obtained from PCA; is the original feature dimension and is the reduced feature dimension. Following same principle as before, the support point should be the closest point to in the original highdimensional instance space and lie on the perpendicular bisector of points and , i.e. after first mapping the data to a lowdimensional subspace by and then mapping it to a feature space by . The mathematical formulation is as follows:
(6) 
As shown in Appendix B.1, again has a closedform solution and equations on the adversarial margin and perturbation loss can be extended accordingly.
2.4 Generalization benefit
From the perspective of algorithmic robustness [43], enlarging the adversarial margin could potentially improve the generalization ability of tripletbased metric learning methods. The following generalization bound, i.e. the gap between the generalization error and the empirical error , follows from the pseudorobust theorem of [3]. Preliminaries and derivations are given in Appendix D.
Theorem 1.
Let be the optimal solution to Eq. 5. Then for any
, with probability at least
we have:(7) 
where denotes the number of triplets whose adversarial margins are larger than , is a constant denoting the upper bound of the loss function (i.e. Eq. 5), and .
Enlarging the desired adversarial margin will reduce the value of and in Eq. 9. On the one hand, decreases with at a polynomial rate of the input dimensionality and hence the upper bound of generalization gap reduces at a rate of . On the other hand, the reduction in increases the upper bound. However, remains relatively stable when increases as long as most instances in the dataset do not have a small margin in the original instance space. Therefore, for this type of dataset, we expect an improvement in the generalization ability of the learned distance metric from enlarging the adversarial margin.
3 Experiments
LMNNbased  SCMLbased  

Dataset  AML  LMNN  LDD  CAP  DRIFT  LMNNCR  SCML  SCMLCR 
Australian  83.252.59  83.702.43  84.182.37  83.972.45  84.472.02  84.471.63  84.762.08  84.422.18 
Breast cancer  97.101.21  97.121.25  96.951.51  97.001.08  96.981.16  97.021.30  97.001.09  97.071.24 
Fourclass  75.122.35  75.102.31  75.152.32  75.022.48  75.082.34  75.122.35  75.102.27  75.122.35 
Haberman  72.584.00  72.193.89  72.423.95  71.523.54  72.023.94  72.644.29  72.753.79  72.364.38 
Iris  87.005.41  87.115.08  87.674.70  86.675.49  85.894.46  87.334.73  86.896.40  87.445.31 
Segment  95.210.72  95.310.89  95.580.81  95.510.70  95.750.65  95.640.83  92.616.65  93.951.47 
Sonar  84.134.86  86.674.10  87.223.90  87.224.38  86.194.43  87.783.53  82.384.15  84.134.61 
Voting  95.341.64  95.801.78  95.801.41  95.921.45  95.311.32  96.151.56  95.841.58  96.261.28 
WDBC  96.931.39  96.991.30  96.961.43  96.991.51  96.701.16  97.131.33  97.251.30  97.251.52 
Wine  97.131.75  97.311.94  96.671.76  96.852.26  97.691.79  97.691.89  97.691.79  97.222.04 
# outperform    9  8  10  9    7   
For methods with LMNN as the backbone, the best ones are shown in bold and the second best ones are underlined; for methods with SCML as the backbone, the best ones are shown in bold. ‘# outperform’ counts the number of datasets where LMNNCR (SCMLCR, resp.) outperforms or performs equally well with LMNNbased (SCML, resp.) methods.
In this section, we evaluate the generalization performance and robustness of the proposed method on 12 benchmark datasets (10 low/mediumdimensional and two highdimensional), followed by a comparison of computational cost. In Appendix E.3, we present experiments on three synthetic datasets to illustrate the difference in the learning behavior between LMNN and the proposed method.
3.1 Experiments on UCI data
3.1.1 Data description and experimental setting
We evaluate the proposed LMNNCR and SCMLCR on 10 UCI datasets [10]. All datasets are preprocessed with meancentering and standardization, followed by normalization to unit length. We use 7030% trainingtest partitions and report the performance over 20 rounds.
The proposed methods are compared with two types of methods. First, we consider different regularizers on . Specifically, we replace the regularizer in LMNN from to the logdeterminant divergence (LDD) [7], which encourages learning a metric toward the identity matrix, and to the capped trace norm (CAP) [15], which encourages a lowrank matrix. Second, we compare with the method DRIFT [48], which models the perturbation distribution explicitly. We also report the performance of adversarial metric learning (AML) [6]. However, it is not directly comparable to our method as it learns from pairwise constraints. In all experiments, triplet constraints are generated from 3 target neighbors and 10 nearest impostors, calculated under the Euclidean distance.
Hyperparameters of our methods are tuned via random search [4]. We randomly sample 50 sets of values from the following ranges: , , .
denotes the uniform distribution.
denotes the th percentile of , where is calculated with respect to the Euclidean distance. Information about the datasets, optimization details of the proposed and other methods, and evaluation of hyperparameter sensitivity are given in Appendices E.1, E.2, E.5, respectively.3.1.2 Evaluation on classification performance
Table 1 reports classification accuracy of 3NN. LMNNCR outperforms LMNN on 9 out of 10 datasets. Among the methods with LMNN as the backbone, our method achieves the highest accuracy on 6 datasets and second highest accuracy on the remaining 4 datasets. SCMLCL outperforms or performs equally well with SCML on 7 datasets. These experimental results demonstrate the benefit of perturbation loss to generalization of the learned distance metric.
3.1.3 Investigation into robustness
We start with an indepth study on the dataset Australian to investigate the relationship between the perturbation loss, adversarial margin and robustness against instance perturbation. First, we compare the adversarial margins obtained from LMNN and LMNNCR. Instances with nearzero adversarial margins are incapable of defensing perturbation. From Fig. 2(a), we see that nearly half of these vulnerable instances have a larger adversarial margin after learning with the proposed loss. Next, we evaluate robustness by adding two types of zeromean Gaussian noise to test data, namely spherical Gaussian
with a diagonal covariance matrix and equal variances and
Gaussianwith unequal variances. The noise intensity is controlled via the signaltonoise ratio (SNR). In addition, test data is augmented to the sample size of 10,000. Fig.
2(b) plots the classification accuracy of LMNNbased methods under different levels of spherical Gaussian noise. When the noise intensity is low, the performance of LMNN and LMNNCR remain stable. When the noise intensity increases to the SNR of 10 dB or 5 dB, the performances of both method degrade. Owing to the enlarged adversarial margin, the influence on LMNNCR is slightly smaller than that on LMNN. When the SNR equals 1 dB, the performance gain from using LMNNCR becomes smaller. This result is reasonable as the desired margin is selected according to the criterion of classification accuracy and hence may be too small to withstand a high level of noise. LMNNCR surpasses all other LMNNbased methods until the noise intensity is very large. Fig. 2(c) plots the accuracy under the Gaussian noise. The degradation of all methods is more pronounced in this case, but the pattern remains similar.LMNNbased  SCMLbased  

Dataset  AML  LMNN  LDD  CAP  DRIFT  LMNNCR  SCML  SCMLCR 
Australian  82.261.62  82.131.52  82.571.55  81.821.52  81.971.53  82.901.53  82.591.70  82.841.64 
Breast cancer  96.701.01  96.241.06  96.661.07  96.271.03  96.610.97  96.691.07  96.341.03  96.631.03 
Fourclass  69.001.06  67.741.25  68.841.14  67.841.19  69.131.02  69.041.11  68.221.10  68.961.11 
Haberman  70.211.84  70.211.84  70.251.82  69.392.06  69.312.50  70.251.90  69.981.64  70.241.85 
Iris  79.073.25  78.752.96  79.043.17  77.903.31  78.573.09  79.203.08  78.323.60  79.183.13 
Segment  85.870.70  79.033.37  83.491.17  82.772.49  83.881.33  82.132.70  61.289.78  62.868.76 
Sonar  83.503.38  83.544.30  86.182.93  85.442.79  84.653.30  84.993.13  76.914.32  79.493.80 
Voting  94.101.07  94.011.00  94.241.13  94.371.17  93.941.12  94.641.21  93.991.15  94.651.09 
WDBC  96.471.12  92.011.65  96.300.94  96.141.11  96.020.88  96.070.89  95.751.29  96.221.14 
Wine  95.031.14  93.271.62  93.971.38  93.871.49  94.551.15  94.441.21  93.921.55  94.521.33 
# outperform    10  6  7  7    10   
We now turn to test robustness on all data sets. Gaussian noise of 5 dB is added to the test data. Table 2 shows that LMNNCR and SCMLCR improve the robustness of the corresponding baselines on all datasets, which clearly demonstrates the benefit of perturbation loss to improving robustness. Moreover, LMNNCR is superior to the robust metric learning methods CAP and DRIFT on 7 datasets. The method LDD is also quite robust to perturbation. However, this should not be surprising as it encourages learning a metric close to the Euclidean distance, and the Euclidean distance is less sensitive to perturbation than the discriminative Mahalanobis distance. The performance under spherical Gaussian noise is similar to the Gaussian noise, as shown in Appendix E.4.
3.2 Experiments on highdimensional data
Isolet  MNIST  

Method  Clean  SG20  SG5  G20  G5  Adv.  Clean  SG20  SG5  G20  G5  Adv. 
(0.081)  (0.423)  (0.059)  (0.318)  margin  (0.054)  (0.294)  (0.065)  (0.348)  margin  
LMNN  90.14.5  90.14.1  86.03.5  90.24.0  87.83.9  0.110  90.6  90.0  88.4  90.1  88.4  0.153 
LMNNCR  91.13.7  91.03.8  87.93.3  91.13.7  89.43.8  0.125  91.2  91.4  90.8  91.5  90.4  0.223 
CAP  91.13.7  91.13.9  89.04.0  91.13.7  89.93.9  0.151  91.7  91.8  91.4  91.8  90.7  0.222 
CAPCR  91.64.0  91.53.9  89.93.7  91.53.9  90.73.7  0.156  92.0  91.9  90.9  92.0  90.7  0.226 
SCML  90.74.1  90.34.2  86.54.2  90.54.1  88.53.7  0.068  89.0  88.8  87.4  88.9  86.5  0.122 
SCMLCR  90.84.2  90.74.1  86.53.7  90.84.2  88.44.1  0.082  89.2  89.2  88.5  89.4  88.1  0.143 
Columns 36 and 912 report methods’ robustness against spherical Gaussian noise and Gaussian noise with SNR of 20 dB and 5 dB. Values in brackets are the average perturbation size, calculated as the mean value of the norm of noises ().
We verify the efficacy of the extended LMNNCR proposed in Sec. 2.3 on the following datasets:

MNIST2k [5]: The dataset includes the first 2,000 training images and first 2,000 test images of the MNIST database. We apply PCA to reduce the feature dimension from 784 to 141, accounting for 95% of total variance. All methods are evaluated once on the pregiven training/test partition.

Isolet [10]: This spoken letter database includes 7,797 instances, grouped into four training sets and one test set. Applying PCA reduces the feature dimension from 617 to 170. All methods are trained four times, one time on each training set, and evaluated on the pregiven test set.
In addition, we introduce CAPCR, which comprises the triplet loss of LMNN, the proposed perturbation loss, and the lowrank regularizer of CAP. For a fair comparison, CAPCR uses the same rank and regularization weight as CAP; are tuned from 10 randomly sampled sets of values.
Table 3 compares the generalization and robustness performance of LMNN, CAP, SCML and our method; the accuracy of other methods are inferior to LMNNCR and are reported in Appendix E.4. First, on both datasets, our method achieves higher clean accuracy than the baseline methods, validating its effectiveness in enhancing the generalization ability of the learned distance metric. Second, when the average adversarial margin is larger than the average perturbation size (SNR=20 dB), our method maintains its superiority, demonstrating that adversarial margin is indeed a contributing factor in achieving certified robustness. When the margin is smaller than the perturbation size, our method could still improve the accuracy for LMNN on both datasets, for CAP on Isolet, and for SCML on MNIST. Third, CAPCR obtains higher accuracy on both clean and noisecontaminated data than LMNNCR, suggesting that regularization and perturbation loss impose different requirements on and combining them has the potential for learning a more effective distance metric.
3.3 Computational cost
LMNN  LDD  CAP  DRIFT  LMNNCR  

Australian  13.44  0.83  3.07  1.00  2.15 
Segment  27.48  10.45  11.47  5.12  19.54 
Sonar  4.93  4.08  4.65  0.92  6.75 
WDBC  9.38  2.94  5.22  5.12  8.17 
Isolet  339.57  207.69  176.50  N/A  190.55 
MNIST  369.55  68.98  180.68  37.51  391.04 
We now analyze the computational complexity of LMNNCR. According to Eq. 2.2, our method requires additional calculations on and . Given triplets, the computational complexity of is ; given training instances, the computational complexity of is . The total complexity of our method is , same as that of LMNN.
Table 4 compares the running time of LMNNbased methods on four UCI datasets that are large in sample size or in dimensionality and two highdimensional datasets. The computational cost of our method is comparable to LMNN.
4 Conclusion
In this paper, we demonstrate that robustness and generalization of distance metrics can be enhanced by enforcing a larger margin in the instance space. By taking advantaging of the linear transformation induced by the Mahalanobis distance, we obtain an explicit formula for the support points and push them away from training instances through penalizing the perturbation loss. Extensive experiments verify that our method effectively enlarges the adversarial margin, achieves certified robustness, and sustains classification excellence. Future work include jointly learning the perturbation distribution and distance metric and extending the idea to nonlinear metric learning methods.
References
 [1] (2015) Contractive rectifier networks for nonlinear maximum margin classification. In IEEE International Conference on Computer Vision, pp. 2515–2523. Cited by: §1.

[2]
(2015)
Metric learning.
Synthesis Lectures on Artificial Intelligence and Machine Learning
9 (1), pp. 1–151. Cited by: §1.  [3] (2015) Robustness and generalization for metric learning. Neurocomputing 151, pp. 259–267. Cited by: §D.2, §2.4, Definition 1, Theorem 2.
 [4] (2012) Random search for hyperparameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §3.1.1.
 [5] (2010) Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (8), pp. 1548–1560. Note: Data: http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html Cited by: item 1.
 [6] (2018) Adversarial metric learning. In International Joint Conference on Artificial Intelligence, pp. 2021–2027. Cited by: §1, §3.1.1.
 [7] (2007) Informationtheoretic metric learning. In International Conference on Machine Learning, pp. 209–216. Cited by: §1, §3.1.1.
 [8] (2020) MMA training: direct input space margin maximization through adversarial training. In International Conference on Learning Representations, Cited by: §1.
 [9] (2020) Learning local metrics and influential regions for classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (6), pp. 1522–1529. Cited by: §1.
 [10] (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: item 2, §3.1.1.

[11]
(2018)
Deep adversarial metric learning.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2780–2789. Cited by: §1.  [12] (2018) Large margin deep networks for classification. In Advances in Neural Information Processing Systems, pp. 842–852. Cited by: §1.
 [13] (2020) Metric learning from imbalanced data with generalization guarantees. Pattern Recognition Letters 133, pp. 298–304. Cited by: §1.
 [14] (2017) Sharable and individual multiview metric learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (9), pp. 2281–2288. Cited by: §1.
 [15] (2016) Robust and effective metric learning using capped trace norm: metric learning via capped trace norm. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1605–1614. Cited by: §1, §3.1.1.
 [16] (2009) Regularized distance metric learning: theory and algorithm. In Advances in Neural Information Processing Systems, pp. 862–870. Cited by: §1.
 [17] (2012) Large scale metric learning from equivalence constraints. In IEEE Conference on Computer vision and Pattern Recognition, pp. 2288–2295. Cited by: §1.
 [18] (2013) Metric learning: a survey. Foundations and Trends® in Machine Learning 5 (4), pp. 287–364. Cited by: §1.
 [19] (2003) Learning with idealized kernels. In International Conference on Machine Learning, pp. 400–407. Cited by: §1.

[20]
(2013)
Quadrupletwise image similarity learning
. In IEEE International Conference on Computer Vision, pp. 249–256. Cited by: §1.  [21] (2014) Fantope regularization in metric learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1051–1058. Cited by: §1.
 [22] (2020) Revisiting metric learning for fewshot image classification. Neurocomputing. Cited by: §1.
 [23] (2013) Robust structural metric learning. In International Conference on Machine Learning, pp. 615–623. Cited by: §1.
 [24] (2019) Fast lowrank metric learning for largescale and highdimensional data. In Advances in Neural Information Processing Systems, pp. 817–827. Cited by: §1.
 [25] (2019) Learning robust distance metric with side information via ratio minimization of orthogonally constrained norm distances. In International Joint Conference on Artificial Intelligence, Cited by: §1.
 [26] (2018) Matrix variate Gaussian mixture distribution steered robust metric learning. In AAAI Conference on Artificial Intelligence, Cited by: §1.
 [27] (2010) Metric learning to rank. In International Conference on Machine Learning, pp. 775–782. Cited by: §1.
 [28] (2012) The matrix cookbook, nov 2012. URL http://www2. imm. dtu. dk/pubdb/p. php 3274. Cited by: Appendix B.
 [29] (2018) Largescale distance metric learning with uncertainty. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8542–8550. Cited by: §1.
 [30] (2020) Revisiting training strategies and generalization performance in deep metric learning. In International Conference on Machine Learning, Cited by: §1.
 [31] (2014) Sparse compositional metric learning. In AAAI Conference on Artificial Intelligence, Cited by: Appendix C, §2.2.
 [32] (2016) Improved deep metric learning with multiclass npair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §1.
 [33] (2017) Parameter free large margin nearest neighbor for distance metric learning. In AAAI Conference on Artificial Intelligence, Cited by: §1.
 [34] (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: Table 5, §1, §2.1.
 [35] (2015) CMU Machine Learning 10725 Lecture Slides: Proximal Gradient Descent and Acceleration. Note: URL: https://www.stat.cmu.edu/~ryantibs/convexoptS15/lectures/08proxgrad.pdf. Last visited on 18/May/2020 Cited by: §E.2.
 [36] (2019) Metric entropy and its uses. In HighDimensional Statistics: A NonAsymptotic Viewpoint, Cambridge Series in Statistical and Probabilistic Mathematics, pp. 121–158. External Links: Document Cited by: §D.2, Definition 2.
 [37] (2014) Robust distance metric learning via simultaneous norm minimization and maximization. In International Conference on Machine Learning, pp. 1836–1844. Cited by: §1.
 [38] (2019) Multisimilarity loss with general pair weighting for deep metric learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5022–5030. Cited by: §1.
 [39] (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §E.2, §1, §1.
 [40] (2008) Learning a Mahalanobis distance metric for data clustering and classification. Pattern Recognition 41 (12), pp. 3600–3612. Cited by: §1.
 [41] (2003) Distance metric learning with application to clustering with sideinformation. In Advances in Neural Information Processing Systems, pp. 521–528. Cited by: §1.
 [42] (2006) Kernelbased distance metric learning for microarray data classification. BMC Bioinformatics 7 (1), pp. 1–11. Cited by: §1.
 [43] (2012) Robustness and generalization. Machine learning 86 (3), pp. 391–423. Cited by: §1, §2.4.
 [44] (2018) Bilevel distance metric learning for robust image recognition. In Advances in Neural Information Processing Systems, pp. 4198–4207. Cited by: §1.
 [45] (2019) Adversarial margin maximization networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. External Links: Document, Link Cited by: §1.
 [46] (2018) Retrieving and classifying affective images via deep metric learning. In AAAI Conference on Artificial Intelligence, Cited by: §1.
 [47] (2006) Distance metric learning: a comprehensive survey. Michigan State Universiy 2 (2), pp. 4. Cited by: §1.
 [48] (2017) Learning Mahalanobis distance metric: considering instance disturbance helps. In International Joint Conference on Artificial Intelligence, pp. 3315–3321. Cited by: §1, §3.1.1.
Appendix A Summary of main concepts
adversarial example  a perturbed instance; the perturbation changes the label of the instance’s nearest neighbor (NN) in the feature space from being the same class to being a different class. In other words, the perturbation forces the NN classifier to produce an incorrect prediction [34]. 

support point ()  the adversarial example that is closest to the training instance in the original instance space 
certified neighborhood  the largest hypersphere that a training instance could be perturbed while keeping its NN in the feature space to be an instance of the same class 
adversarial margin ()  the Euclidean distance between a training instance and its associated support point. It defines the radius of certified neighborhood. 
Appendix B Derivation of support point, adversarial margin, and gradient of perturbation loss
First, we define the hyperellipsoid via the quadratic form. An arbitrarily oriented hyperellipsoid, centered at , is defined by the solutions to the equation
where is a positive definite matrix. By the Cholesky decomposition, . Therefore, finding the support point of on the hyperellipsoid is equivalent to finding the point that defines the smallest hypersphere given by .
The optimization problem of Eq. 1 is equivalent to the following problem:
Applying the method of Lagrangian multiplier, we transform the above problem to the following Lagrangian function by introducing the Lagrangian multiplier and then solve it by setting the first partial derivatives to zero:
The Hessian matrix equals , which is positive definite, and hence is the minimum point. Replacing (identity matrix) gives Eq. 2.
The squared adversarial margin is calculated by first simplifying and then computing as follows:
Substituting gives Eq. 3.
Next, we derive the gradient of with respect to . When and (i.e. in the hyperspherical case), or , the gradient equals zero. When and , the gradient of equals the gradient of (i.e. in the hyperspherical case), which can be calculated by using the quotient rule and the derivative of trace [28]:
where denotes the trace operator. and are defined similarly. Substituting gives Eq. 2.2.
b.1 Extension for highdimensional Data
Support point, adversarial margin and gradient of the perturbation loss with dimensionality reduction are derived by following the same principle as in Appendix B.
The method of Lagrangian multiplier is applied to derive a closedform solution to the support point:
where denotes .
The squared adversarial margin is calculated from the definition of the hyperellipsoid:
The perturbation loss is defined similarly to Eq. 4 as follows: