Towards Certified Robustness of Metric Learning

06/10/2020 ∙ by Xiaochen Yang, et al. ∙ UCL 0

Metric learning aims to learn a distance metric such that semantically similar instances are pulled together while dissimilar instances are pushed away. Many existing methods consider maximizing or at least constraining a distance "margin" that separates similar and dissimilar pairs of instances to guarantee their performance on a subsequent k-nearest neighbor classifier. However, such a margin in the feature space does not necessarily lead to robustness certification or even anticipated generalization advantage, since a small perturbation of test instance in the instance space could still potentially alter the model prediction. To address this problem, we advocate penalizing small distance between training instances and their nearest adversarial examples, and we show that the resulting new approach to metric learning enjoys a larger certified neighborhood with theoretical performance guarantee. Moreover, drawing on an intuitive geometric insight, the proposed new loss term permits an analytically elegant closed-form solution and offers great flexibility in leveraging it jointly with existing metric learning methods. Extensive experiments demonstrate the superiority of the proposed method over the state-of-the-arts in terms of both discrimination accuracy and robustness to noise.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Distance metric learning (DML) focuses on learning similarity or dissimilarity between data and it has been actively researched in classification and clustering [47, 18, 2], as well as domain-specific applications such as information retrieval [27, 46]

, computer vision 

[17, 14, 30] and bioinformatics [42]. A commonly studied distance metric is the generalized (squared) Mahalanobis distance, which defines the distance between any two instances as

where is a positive semidefinite (PSD) matrix. Owing to its PSD property, can be decomposed into with ; thus the Mahalanobis distance is equivalent to the Euclidean distance

in the linearly transformed feature space. When

, instances and are transformed from a high-dimensional instance space to a low-dimensional feature space.

To learn a specific distance metric for each task, prior knowledge on instance similarity and dissimilarity should be provided as side information. Metric learning methods differ by the form of side information they use and the supervision encoded in similar and dissimilar pairs. For example, pairwise constraints enforce the distance between instances of the same class to be small (or smaller than a threshold value) and the distance between instances of different classes to be large (or larger than a threshold value) [41, 40, 38, 9, 13]. The thresholds could be either pre-defined or learned for similar and dissimilar pairs [7, 19]. In triplet constraints , distance between the different-class pair should be larger than distance between the same-class pair , and typically, plus a margin [39, 33, 44, 24]. More recently, quadruplet constraints are proposed, which require the difference in the distance of two pairs of instances to exceed a margin [20], and -tuplet extends the triplet constraint for multi-class classification [32, 22].

Figure 1: Comparison of traditional metric learning methods and the proposed method. While classical methods separate similar and dissimilar pairs by a margin (indicated by the gap between gray dashed circles), a small perturbation from to in the instance space may change its nearest neighbor (NN) from to in the learned feature space. Our method aims to expand a certified neighborhood (indicated by blue dotted circle), defined as the largest hypersphere in which could be perturbed without any label change on its NN in the learned feature space. Points on line PB are equidistant from and with respect to the learned distance.

The gap between thresholds in pairwise constraints and the margin in triplet and quadruplet constraints are both designed to learn a distance metric that could ensure good generalization of the subsequent -nearest neighbor (NN) classifier. However, such a separating margin imposed at the distance and decision level does not necessarily produce a robust metric – indeed it may be sensitive to a small perturbation at the instance level. As illustrated in Fig. 1 (upper), a tiny perturbation from to in the instance space can be magnified by the learned distance metric, leading to a change in its NN from to in the feature space, and even worse, an incorrect label prediction if 1-NN is used.

In this paper, we propose a simple yet effective method to enhance robustness of the learned distance metric against instance perturbation. The principal idea is to expand a certified neighborhood, defined as the largest hypersphere in which a training instance could be perturbed without changing the label of its nearest neighbor (or nearest neighbors if required) in the feature space.

Our contributions are mainly fourfold. Firstly, we derive an analytically elegant solution to the radius of certified neighborhood (Sec. 2.1). It is equivalent to the distance between a training instance and its nearest adversarial example [34] termed support point. Building on a geometric insight, the support point can be easily identified as the closest point to in the instance space that lies on the decision boundary in the feature space. Secondly, we define a new perturbation loss that penalizes the radius for being small, or equivalently, encourages an expansion of certified neighborhood (Sec. 2.1), which can be optimized jointly with any existing triplet-based metric learning methods (Sec. 2.2

). The optimization problem suggests that our method learns a discriminative metric in a weighted manner and simultaneously imposes a data-dependent regularization. Thirdly, because learning a distance metric for high-dimensional data may suffer from overfitting, we extend the perturbation loss so that the metric could be learned based on PCA transformed data in a low-dimensional subspace while retaining the ability to withstand perturbation in the original high-dimensional instance space (Sec. 

2.3). Fourthly, we show the benefit of expanding a certified neighborhood to the generalization ability of the learned distance metric by using the theoretical technique of algorithmic robustness [43] (Theorem 3, Sec. 2.4). Experiments in noise-free and noisy settings show that the proposed method outperforms existing robust metric learning methods in terms of classification accuracy and validate its robustness to noise (Sec. 3).

Related work

To improve robustness to perturbation that is likely to exist in practice, many robust metric learning methods have been proposed, which can be categorized into three main types. The first type of methods imposes structural assumption or regularization over so as to avoid overfitting [16, 23, 37, 21, 15, 26, 25]. However, structural information often exists in image datasets but is generally unavailable in the symbolic datasets studied in this paper. Regularization-based methods are proposed to reduce the risk of overfitting to feature noise. Our proposal, which is aimed to withstand perturbation, does not conflict with these methods and can be combined with them to learn a more effective and robust distance metric; an example is shown in Sec. 3.2. The second type of methods explicitly models the perturbation distribution or identifies clean latent examples [48, 29]. The expected Mahalanobis distance is then used to adjust the value of separating margin. The third type of methods generates hard instances through adversarial perturbation and trains a metric to fare well in the new hard problem [6, 11]

. Although sharing the aim of improving metric robustness, these methods approach the task at a data-level by synthesizing real examples that incur large losses, while our method tackles perturbation at a model-level by designing a loss function that considers the definition of robustness with respect to the decision maker

NN. By preventing change in the nearest neighbor in a strict manner, our method can obtain a certification on the adversarial margin. Finally, we note that a large margin in the instance space has been studied in deep neural networks for enhancing robustness and generalization ability  

[1, 12, 45, 8]. In contrast, our paper investigates such margin in the framework of metric learning, defines it specifically with respect to the NN classifier, and provides an exact and analytical solution to the margin.

Notation

Let denote the set of training instance and label pairs, where and ; is called the instance space in this paper. Our framework is based on triplet constraints and we adopt the following strategy for generating triplets [39]:

is termed the target neighbor of and is termed the impostor. and denote the squared Euclidean and Mahalanobis distances, respectively; , where is the cone of real-valued PSD matrices. . denotes the cardinality of a set . denotes the indicator function. for .

2 Proposed approach

In this section, we first derive an explicit formula for the support point and provide the rationale behind the advocated perturbation loss, followed by its optimization problem. We then extend the method for high-dimensional data. Lastly, we discuss the benefit of our method to the generalization ability of the learned metric. Main concepts are illustrated in Fig. 2 and listed in Table 5 of Appendix A.

2.1 Support point and perturbation loss

As mentioned in the introduction, a learned distance metric may be sensitive to perturbation in the sense that a small change of the instance could alter its nearest neighbor in the learned feature space, from an instance of the same class to one of a different class, and consequently, increasing the risk of misclassification from NN. A perturbed point, that causes a change in the nearest neighbors and thus prediction, is termed an adversarial example [34]; if the adversarial examples of an instance are all far away from the instance itself, a high degree of robustness is expected. Based on this reasoning, we will construct a loss function to penalize the small distance between a training instance  and its closest adversarial example (i.e. support point), and therefore, allowing to retain prediction correctness even when perturbed to a larger extent.

We start by building a geometric insight into the support point: for any instance associated with the triplet constraint , the support point is the closest point to in the instance space that lies on the decision boundary formed by and in the feature space. Note that closeness is defined in the instance space and will be calculated using the Euclidean distance since we target at changes on the original feature of an instance; and that the decision boundary is found in the feature space since NNs are identified by using the Mahalanobis distance. Mathematically, we can formulate the support point as follows:

(1)

With a pre-given positive definite matrix , the objective function of Eq. 1 defines an arbitrarily oriented hyperellipsoid, representing any heterogeneous and correlated perturbation. Without prior knowledge on the perturbation, we simplify

as the identity matrix. In this case, the objective function defines a hypersphere, representing perturbation of equal magnitude in all directions. It can also be interpreted as minimizing the Euclidean distance from the training instance

. For clarity, we always refer the certified neighborhood as the largest hypersphere in this paper; the hyperellipsoid case is discussed in Appendix B. The constraint defines the decision boundary, which is the perpendicular bisector of points and

. In other words, it is a hyperplane that is perpendicular to the line joining points

and and passes their midpoint ; all points on the hyperplane are equidistant from and .

Figure 2: Explanation of main concepts: Given a triplet constraint , the decision boundary for is the perpendicular bisector of and , i.e. line . Points on the right-hand side of are adversarial examples. The support point is defined as the nearest adversarial example in the instance space. The Euclidean distance between and is called adversarial margin; it will be enlarged to by penalizing the perturbation loss.

Since Eq. 1 minimizes a convex quadratic function with an equality constraint, we can find an explicit formula for the support point by using the method of Lagrangian multipliers; please see Appendix B for detailed derivation:

(2)

With a closed-form solution of , we can now calculate the squared Euclidean distance between and :

(3)

For clarity, we will call the adversarial margin, in contrast to the distance margin as in LMNN. It defines the radius of the certified neighborhood.

To improve robustness of distance metric, we design a perturbation loss to encourage an expansion of certified neighborhood. Two situations need to be distinguished here. Firstly, when the nearest neighbor of is an instance from the same class, we will penalize a small adversarial margin by using the hinge loss . The reasons are that (a) the adversarial margin is generally smaller for hard instances that are close to the class boundary in contrast to those locating far away and (b) it is these hard instances that are more vulnerable to perturbation and demand an improvement in their robustness. Therefore, we introduce for directing attention to hard instances and controlling the desired margin. Secondly, in the other situation where the nearest neighbor of belongs to a different class, metric learning should focus on satisfying the distance requirement specified in the triplet constraint. In this case, we simply assign a large penalty of to promote a non-increasing loss function. Integrating these two situations leads to the proposed perturbation loss:

(4)

where is an abbreviation for . To prevent the denominator of Eq. 3 from being zero, which may happen when different-class instances and are close to each other, we add a small constant (=1e-10) to the denominator; that is, .

2.2 Metric learning with certified robustness

As support points are derived from triplet constraints, it would be natural and straightforward to embed the proposed perturbation loss into a metric learning method that is also based on triplet constraints. LMNN is thus adopted as an example for its wide use and effective classification performance.

The objective function of the proposed LMNN with certified robustness (LMNN-CR) is as follows:

(5)

where stands for . The weight parameter controls the importance of perturbation loss () relative to the loss function of LMNN (). balances the impacts between pulling together target neighbors and pushing away impostors.

We adopt the projected gradient descent algorithm to solve the optimization problem (Eq. 5). The gradient of and are given as follows:

where ; and are defined similarly. The gradient of is a sum of two descent directions. The first direction agrees with LMNN, indicating that our method updates the metric toward better discrimination in a weighted manner. The second direction controls the scale of ; the metric will descend at a faster pace in the direction of a larger correlation between and . This suggests our method functions as a data-dependent regularization. Let denote the Mahalanobis matrix learned at the th iteration. The distance matrix will be updated as

where denotes the learning rate. To guarantee the PSD property, we factorize as

via eigendecomposition and truncate all negative eigenvalues to zero, i.e.

.

The proposed perturbation loss is a generic approach to improving robustness to perturbation. In Appendix C, we give another example which incorporates the perturbation loss into the recent triplet-based method SCML [31]; the new method is termed SCML with certified robustness (SCML-CR).

2.3 Extension to high-dimensional data

As PCA is often applied to pre-process high-dimensional data prior to metric learning, we propose an extension so that the distance metric learned in the low-dimensional PCA subspace could still achieve certified robustness against perturbation in the original high-dimensional instance space. Defining perturbation loss in conjunction with PCA is realizable as our derivation builds on the linear transformation induced by the distance metric and PCA also performs a linear transformation to map data onto a lower dimension subspace. Let denote the linear transformation matrix obtained from PCA; is the original feature dimension and is the reduced feature dimension. Following same principle as before, the support point should be the closest point to in the original high-dimensional instance space and lie on the perpendicular bisector of points and , i.e. after first mapping the data to a low-dimensional subspace by and then mapping it to a feature space by . The mathematical formulation is as follows:

(6)

As shown in Appendix B.1, again has a closed-form solution and equations on the adversarial margin and perturbation loss can be extended accordingly.

2.4 Generalization benefit

From the perspective of algorithmic robustness [43], enlarging the adversarial margin could potentially improve the generalization ability of triplet-based metric learning methods. The following generalization bound, i.e. the gap between the generalization error and the empirical error , follows from the pseudo-robust theorem of [3]. Preliminaries and derivations are given in Appendix D.

Theorem 1.

Let be the optimal solution to Eq. 5. Then for any

, with probability at least

we have:

(7)

where denotes the number of triplets whose adversarial margins are larger than , is a constant denoting the upper bound of the loss function (i.e. Eq. 5), and .

Enlarging the desired adversarial margin will reduce the value of and in Eq. 9. On the one hand, decreases with at a polynomial rate of the input dimensionality and hence the upper bound of generalization gap reduces at a rate of . On the other hand, the reduction in increases the upper bound. However, remains relatively stable when increases as long as most instances in the dataset do not have a small margin in the original instance space. Therefore, for this type of dataset, we expect an improvement in the generalization ability of the learned distance metric from enlarging the adversarial margin.

3 Experiments

LMNN-based SCML-based
Dataset AML LMNN LDD CAP DRIFT LMNN-CR SCML SCML-CR
Australian 83.252.59 83.702.43 84.182.37 83.972.45 84.472.02 84.471.63 84.762.08 84.422.18
Breast cancer 97.101.21 97.121.25 96.951.51 97.001.08 96.981.16 97.021.30 97.001.09 97.071.24
Fourclass 75.122.35 75.102.31 75.152.32 75.022.48 75.082.34 75.122.35 75.102.27 75.122.35
Haberman 72.584.00 72.193.89 72.423.95 71.523.54 72.023.94 72.644.29 72.753.79 72.364.38
Iris 87.005.41 87.115.08 87.674.70 86.675.49 85.894.46 87.334.73 86.896.40 87.445.31
Segment 95.210.72 95.310.89 95.580.81 95.510.70 95.750.65 95.640.83 92.616.65 93.951.47
Sonar 84.134.86 86.674.10 87.223.90 87.224.38 86.194.43 87.783.53 82.384.15 84.134.61
Voting 95.341.64 95.801.78 95.801.41 95.921.45 95.311.32 96.151.56 95.841.58 96.261.28
WDBC 96.931.39 96.991.30 96.961.43 96.991.51 96.701.16 97.131.33 97.251.30 97.251.52
Wine 97.131.75 97.311.94 96.671.76 96.852.26 97.691.79 97.691.89 97.691.79 97.222.04
# outperform - 9 8 10 9 - 7 -

For methods with LMNN as the backbone, the best ones are shown in bold and the second best ones are underlined; for methods with SCML as the backbone, the best ones are shown in bold. ‘# outperform’ counts the number of datasets where LMNN-CR (SCML-CR, resp.) outperforms or performs equally well with LMNN-based (SCML, resp.) methods.

Table 1: Classification accuracy (meanstandard deviation) of 3NN on clean datasets.

In this section, we evaluate the generalization performance and robustness of the proposed method on 12 benchmark datasets (10 low/medium-dimensional and two high-dimensional), followed by a comparison of computational cost. In Appendix E.3, we present experiments on three synthetic datasets to illustrate the difference in the learning behavior between LMNN and the proposed method.

3.1 Experiments on UCI data

3.1.1 Data description and experimental setting

We evaluate the proposed LMNN-CR and SCML-CR on 10 UCI datasets [10]. All datasets are pre-processed with mean-centering and standardization, followed by normalization to unit length. We use 70-30% training-test partitions and report the performance over 20 rounds.

The proposed methods are compared with two types of methods. First, we consider different regularizers on . Specifically, we replace the regularizer in LMNN from to the log-determinant divergence (LDD) [7], which encourages learning a metric toward the identity matrix, and to the capped trace norm (CAP) [15], which encourages a low-rank matrix. Second, we compare with the method DRIFT [48], which models the perturbation distribution explicitly. We also report the performance of adversarial metric learning (AML) [6]. However, it is not directly comparable to our method as it learns from pairwise constraints. In all experiments, triplet constraints are generated from 3 target neighbors and 10 nearest impostors, calculated under the Euclidean distance.

Hyperparameters of our methods are tuned via random search [4]. We randomly sample 50 sets of values from the following ranges: , , .

denotes the uniform distribution.

denotes the th percentile of , where is calculated with respect to the Euclidean distance. Information about the datasets, optimization details of the proposed and other methods, and evaluation of hyperparameter sensitivity are given in Appendices E.1, E.2, E.5, respectively.

3.1.2 Evaluation on classification performance

Table 1 reports classification accuracy of 3NN. LMNN-CR outperforms LMNN on 9 out of 10 datasets. Among the methods with LMNN as the backbone, our method achieves the highest accuracy on 6 datasets and second highest accuracy on the remaining 4 datasets. SCML-CL outperforms or performs equally well with SCML on 7 datasets. These experimental results demonstrate the benefit of perturbation loss to generalization of the learned distance metric.

3.1.3 Investigation into robustness

We start with an in-depth study on the dataset Australian to investigate the relationship between the perturbation loss, adversarial margin and robustness against instance perturbation. First, we compare the adversarial margins obtained from LMNN and LMNN-CR. Instances with near-zero adversarial margins are incapable of defensing perturbation. From Fig. 2(a), we see that nearly half of these vulnerable instances have a larger adversarial margin after learning with the proposed loss. Next, we evaluate robustness by adding two types of zero-mean Gaussian noise to test data, namely spherical Gaussian

with a diagonal covariance matrix and equal variances and

Gaussian

with unequal variances. The noise intensity is controlled via the signal-to-noise ratio (SNR). In addition, test data is augmented to the sample size of 10,000. Fig. 

2(b) plots the classification accuracy of LMNN-based methods under different levels of spherical Gaussian noise. When the noise intensity is low, the performance of LMNN and LMNN-CR remain stable. When the noise intensity increases to the SNR of 10 dB or 5 dB, the performances of both method degrade. Owing to the enlarged adversarial margin, the influence on LMNN-CR is slightly smaller than that on LMNN. When the SNR equals 1 dB, the performance gain from using LMNN-CR becomes smaller. This result is reasonable as the desired margin is selected according to the criterion of classification accuracy and hence may be too small to withstand a high level of noise. LMNN-CR surpasses all other LMNN-based methods until the noise intensity is very large. Fig. 2(c) plots the accuracy under the Gaussian noise. The degradation of all methods is more pronounced in this case, but the pattern remains similar.

(a) Histogram of adversarial margins after metric learning from LMNN and LMNN-CR.
(b) Performance of LMNN-based methods under different levels of spherical Gaussian noise.
(c) Performance of LMNN-based methods under different levels of Gaussian noise.
LMNN-based SCML-based
Dataset AML LMNN LDD CAP DRIFT LMNN-CR SCML SCML-CR
Australian 82.261.62 82.131.52 82.571.55 81.821.52 81.971.53 82.901.53 82.591.70 82.841.64
Breast cancer 96.701.01 96.241.06 96.661.07 96.271.03 96.610.97 96.691.07 96.341.03 96.631.03
Fourclass 69.001.06 67.741.25 68.841.14 67.841.19 69.131.02 69.041.11 68.221.10 68.961.11
Haberman 70.211.84 70.211.84 70.251.82 69.392.06 69.312.50 70.251.90 69.981.64 70.241.85
Iris 79.073.25 78.752.96 79.043.17 77.903.31 78.573.09 79.203.08 78.323.60 79.183.13
Segment 85.870.70 79.033.37 83.491.17 82.772.49 83.881.33 82.132.70 61.289.78 62.868.76
Sonar 83.503.38 83.544.30 86.182.93 85.442.79 84.653.30 84.993.13 76.914.32 79.493.80
Voting 94.101.07 94.011.00 94.241.13 94.371.17 93.941.12 94.641.21 93.991.15 94.651.09
WDBC 96.471.12 92.011.65 96.300.94 96.141.11 96.020.88 96.070.89 95.751.29 96.221.14
Wine 95.031.14 93.271.62 93.971.38 93.871.49 94.551.15 94.441.21 93.921.55 94.521.33
# outperform - 10 6 7 7 - 10 -
Table 2: Classification accuracy of 3NN on datasets contaminated with Gaussian noise (SNR5 dB).

We now turn to test robustness on all data sets. Gaussian noise of 5 dB is added to the test data. Table 2 shows that LMNN-CR and SCML-CR improve the robustness of the corresponding baselines on all datasets, which clearly demonstrates the benefit of perturbation loss to improving robustness. Moreover, LMNN-CR is superior to the robust metric learning methods CAP and DRIFT on 7 datasets. The method LDD is also quite robust to perturbation. However, this should not be surprising as it encourages learning a metric close to the Euclidean distance, and the Euclidean distance is less sensitive to perturbation than the discriminative Mahalanobis distance. The performance under spherical Gaussian noise is similar to the Gaussian noise, as shown in Appendix E.4.

3.2 Experiments on high-dimensional data

Isolet MNIST
Method Clean SG-20 SG-5 G-20 G-5 Adv. Clean SG-20 SG-5 G-20 G-5 Adv.
(0.081) (0.423) (0.059) (0.318) margin (0.054) (0.294) (0.065) (0.348) margin
LMNN 90.14.5 90.14.1 86.03.5 90.24.0 87.83.9 0.110 90.6 90.0 88.4 90.1 88.4 0.153
LMNN-CR 91.13.7 91.03.8 87.93.3 91.13.7 89.43.8 0.125 91.2 91.4 90.8 91.5 90.4 0.223
CAP 91.13.7 91.13.9 89.04.0 91.13.7 89.93.9 0.151 91.7 91.8 91.4 91.8 90.7 0.222
CAP-CR 91.64.0 91.53.9 89.93.7 91.53.9 90.73.7 0.156 92.0 91.9 90.9 92.0 90.7 0.226
SCML 90.74.1 90.34.2 86.54.2 90.54.1 88.53.7 0.068 89.0 88.8 87.4 88.9 86.5 0.122
SCML-CR 90.84.2 90.74.1 86.53.7 90.84.2 88.44.1 0.082 89.2 89.2 88.5 89.4 88.1 0.143

Columns 3-6 and 9-12 report methods’ robustness against spherical Gaussian noise and Gaussian noise with SNR of 20 dB and 5 dB. Values in brackets are the average perturbation size, calculated as the mean value of the norm of noises ().

Table 3: Generalization and robustness of DML methods on high-dimensional datasets.

We verify the efficacy of the extended LMNN-CR proposed in Sec. 2.3 on the following datasets:

  1. MNIST-2k [5]: The dataset includes the first 2,000 training images and first 2,000 test images of the MNIST database. We apply PCA to reduce the feature dimension from 784 to 141, accounting for 95% of total variance. All methods are evaluated once on the pre-given training/test partition.

  2. Isolet [10]: This spoken letter database includes 7,797 instances, grouped into four training sets and one test set. Applying PCA reduces the feature dimension from 617 to 170. All methods are trained four times, one time on each training set, and evaluated on the pre-given test set.

In addition, we introduce CAP-CR, which comprises the triplet loss of LMNN, the proposed perturbation loss, and the low-rank regularizer of CAP. For a fair comparison, CAP-CR uses the same rank and regularization weight as CAP; are tuned from 10 randomly sampled sets of values.

Table 3 compares the generalization and robustness performance of LMNN, CAP, SCML and our method; the accuracy of other methods are inferior to LMNN-CR and are reported in Appendix E.4. First, on both datasets, our method achieves higher clean accuracy than the baseline methods, validating its effectiveness in enhancing the generalization ability of the learned distance metric. Second, when the average adversarial margin is larger than the average perturbation size (SNR=20 dB), our method maintains its superiority, demonstrating that adversarial margin is indeed a contributing factor in achieving certified robustness. When the margin is smaller than the perturbation size, our method could still improve the accuracy for LMNN on both datasets, for CAP on Isolet, and for SCML on MNIST. Third, CAP-CR obtains higher accuracy on both clean and noise-contaminated data than LMNN-CR, suggesting that regularization and perturbation loss impose different requirements on and combining them has the potential for learning a more effective distance metric.

3.3 Computational cost

LMNN LDD CAP DRIFT LMNN-CR
Australian 13.44 0.83 3.07 1.00 2.15
Segment 27.48 10.45 11.47 5.12 19.54
Sonar 4.93 4.08 4.65 0.92 6.75
WDBC 9.38 2.94 5.22 5.12 8.17
Isolet 339.57 207.69 176.50 N/A 190.55
MNIST 369.55 68.98 180.68 37.51 391.04
Table 4: Average training time (in seconds) of LMNN-based methods.

We now analyze the computational complexity of LMNN-CR. According to Eq. 2.2, our method requires additional calculations on and . Given triplets, the computational complexity of is ; given training instances, the computational complexity of is . The total complexity of our method is , same as that of LMNN.

Table 4 compares the running time of LMNN-based methods on four UCI datasets that are large in sample size or in dimensionality and two high-dimensional datasets. The computational cost of our method is comparable to LMNN.

4 Conclusion

In this paper, we demonstrate that robustness and generalization of distance metrics can be enhanced by enforcing a larger margin in the instance space. By taking advantaging of the linear transformation induced by the Mahalanobis distance, we obtain an explicit formula for the support points and push them away from training instances through penalizing the perturbation loss. Extensive experiments verify that our method effectively enlarges the adversarial margin, achieves certified robustness, and sustains classification excellence. Future work include jointly learning the perturbation distribution and distance metric and extending the idea to nonlinear metric learning methods.

References

  • [1] S. An, M. Hayat, S. H. Khan, M. Bennamoun, F. Boussaid, and F. Sohel (2015) Contractive rectifier networks for nonlinear maximum margin classification. In IEEE International Conference on Computer Vision, pp. 2515–2523. Cited by: §1.
  • [2] A. Bellet, A. Habrard, and M. Sebban (2015) Metric learning.

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    9 (1), pp. 1–151.
    Cited by: §1.
  • [3] A. Bellet and A. Habrard (2015) Robustness and generalization for metric learning. Neurocomputing 151, pp. 259–267. Cited by: §D.2, §2.4, Definition 1, Theorem 2.
  • [4] J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §3.1.1.
  • [5] D. Cai, X. He, J. Han, and T. S. Huang (2010) Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (8), pp. 1548–1560. Note: Data: http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html Cited by: item 1.
  • [6] S. Chen, C. Gong, J. Yang, X. Li, Y. Wei, and J. Li (2018) Adversarial metric learning. In International Joint Conference on Artificial Intelligence, pp. 2021–2027. Cited by: §1, §3.1.1.
  • [7] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon (2007) Information-theoretic metric learning. In International Conference on Machine Learning, pp. 209–216. Cited by: §1, §3.1.1.
  • [8] G. W. Ding, Y. Sharma, K. Y. C. Lui, and R. Huang (2020) MMA training: direct input space margin maximization through adversarial training. In International Conference on Learning Representations, Cited by: §1.
  • [9] M. Dong, Y. Wang, X. Yang, and J. Xue (2020) Learning local metrics and influential regions for classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (6), pp. 1522–1529. Cited by: §1.
  • [10] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: item 2, §3.1.1.
  • [11] Y. Duan, W. Zheng, X. Lin, J. Lu, and J. Zhou (2018) Deep adversarial metric learning. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2780–2789. Cited by: §1.
  • [12] G. Elsayed, D. Krishnan, H. Mobahi, K. Regan, and S. Bengio (2018) Large margin deep networks for classification. In Advances in Neural Information Processing Systems, pp. 842–852. Cited by: §1.
  • [13] L. Gautheron, A. Habrard, E. Morvant, and M. Sebban (2020) Metric learning from imbalanced data with generalization guarantees. Pattern Recognition Letters 133, pp. 298–304. Cited by: §1.
  • [14] J. Hu, J. Lu, and Y. Tan (2017) Sharable and individual multi-view metric learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (9), pp. 2281–2288. Cited by: §1.
  • [15] Z. Huo, F. Nie, and H. Huang (2016) Robust and effective metric learning using capped trace norm: metric learning via capped trace norm. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1605–1614. Cited by: §1, §3.1.1.
  • [16] R. Jin, S. Wang, and Y. Zhou (2009) Regularized distance metric learning: theory and algorithm. In Advances in Neural Information Processing Systems, pp. 862–870. Cited by: §1.
  • [17] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof (2012) Large scale metric learning from equivalence constraints. In IEEE Conference on Computer vision and Pattern Recognition, pp. 2288–2295. Cited by: §1.
  • [18] B. Kulis et al. (2013) Metric learning: a survey. Foundations and Trends® in Machine Learning 5 (4), pp. 287–364. Cited by: §1.
  • [19] J. T. Kwok and I. W. Tsang (2003) Learning with idealized kernels. In International Conference on Machine Learning, pp. 400–407. Cited by: §1.
  • [20] M. T. Law, N. Thome, and M. Cord (2013)

    Quadruplet-wise image similarity learning

    .
    In IEEE International Conference on Computer Vision, pp. 249–256. Cited by: §1.
  • [21] M. T. Law, N. Thome, and M. Cord (2014) Fantope regularization in metric learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1051–1058. Cited by: §1.
  • [22] X. Li, L. Yu, C. Fu, M. Fang, and P. Heng (2020) Revisiting metric learning for few-shot image classification. Neurocomputing. Cited by: §1.
  • [23] D. Lim, G. Lanckriet, and B. McFee (2013) Robust structural metric learning. In International Conference on Machine Learning, pp. 615–623. Cited by: §1.
  • [24] H. Liu, Z. Han, Y. Liu, and M. Gu (2019) Fast low-rank metric learning for large-scale and high-dimensional data. In Advances in Neural Information Processing Systems, pp. 817–827. Cited by: §1.
  • [25] K. Liu, L. Brand, H. Wang, and F. Nie (2019) Learning robust distance metric with side information via ratio minimization of orthogonally constrained -norm distances. In International Joint Conference on Artificial Intelligence, Cited by: §1.
  • [26] L. Luo and H. Huang (2018) Matrix variate Gaussian mixture distribution steered robust metric learning. In AAAI Conference on Artificial Intelligence, Cited by: §1.
  • [27] B. McFee and G. R. Lanckriet (2010) Metric learning to rank. In International Conference on Machine Learning, pp. 775–782. Cited by: §1.
  • [28] K. B. Petersen and M. S. Pedersen (2012) The matrix cookbook, nov 2012. URL http://www2. imm. dtu. dk/pubdb/p. php 3274. Cited by: Appendix B.
  • [29] Q. Qian, J. Tang, H. Li, S. Zhu, and R. Jin (2018) Large-scale distance metric learning with uncertainty. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8542–8550. Cited by: §1.
  • [30] K. Roth, T. Milbich, S. Sinha, P. Gupta, B. Ommer, and J. P. Cohen (2020) Revisiting training strategies and generalization performance in deep metric learning. In International Conference on Machine Learning, Cited by: §1.
  • [31] Y. Shi, A. Bellet, and F. Sha (2014) Sparse compositional metric learning. In AAAI Conference on Artificial Intelligence, Cited by: Appendix C, §2.2.
  • [32] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §1.
  • [33] K. Song, F. Nie, J. Han, and X. Li (2017) Parameter free large margin nearest neighbor for distance metric learning. In AAAI Conference on Artificial Intelligence, Cited by: §1.
  • [34] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: Table 5, §1, §2.1.
  • [35] R. Tibshirani (2015) CMU Machine Learning 10-725 Lecture Slides: Proximal Gradient Descent and Acceleration. Note: URL: https://www.stat.cmu.edu/~ryantibs/convexopt-S15/lectures/08-prox-grad.pdf. Last visited on 18/May/2020 Cited by: §E.2.
  • [36] M. J. Wainwright (2019) Metric entropy and its uses. In High-Dimensional Statistics: A Non-Asymptotic Viewpoint, Cambridge Series in Statistical and Probabilistic Mathematics, pp. 121–158. External Links: Document Cited by: §D.2, Definition 2.
  • [37] H. Wang, F. Nie, and H. Huang (2014) Robust distance metric learning via simultaneous -norm minimization and maximization. In International Conference on Machine Learning, pp. 1836–1844. Cited by: §1.
  • [38] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott (2019) Multi-similarity loss with general pair weighting for deep metric learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5022–5030. Cited by: §1.
  • [39] K. Q. Weinberger and L. K. Saul (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §E.2, §1, §1.
  • [40] S. Xiang, F. Nie, and C. Zhang (2008) Learning a Mahalanobis distance metric for data clustering and classification. Pattern Recognition 41 (12), pp. 3600–3612. Cited by: §1.
  • [41] E. P. Xing, M. I. Jordan, S. J. Russell, and A. Y. Ng (2003) Distance metric learning with application to clustering with side-information. In Advances in Neural Information Processing Systems, pp. 521–528. Cited by: §1.
  • [42] H. Xiong and X. Chen (2006) Kernel-based distance metric learning for microarray data classification. BMC Bioinformatics 7 (1), pp. 1–11. Cited by: §1.
  • [43] H. Xu and S. Mannor (2012) Robustness and generalization. Machine learning 86 (3), pp. 391–423. Cited by: §1, §2.4.
  • [44] J. Xu, L. Luo, C. Deng, and H. Huang (2018) Bilevel distance metric learning for robust image recognition. In Advances in Neural Information Processing Systems, pp. 4198–4207. Cited by: §1.
  • [45] Z. Yan, Y. Guo, and C. Zhang (2019) Adversarial margin maximization networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. External Links: Document, Link Cited by: §1.
  • [46] J. Yang, D. She, Y. Lai, and M. Yang (2018) Retrieving and classifying affective images via deep metric learning. In AAAI Conference on Artificial Intelligence, Cited by: §1.
  • [47] L. Yang and R. Jin (2006) Distance metric learning: a comprehensive survey. Michigan State Universiy 2 (2), pp. 4. Cited by: §1.
  • [48] H. Ye, D. Zhan, X. Si, and Y. Jiang (2017) Learning Mahalanobis distance metric: considering instance disturbance helps. In International Joint Conference on Artificial Intelligence, pp. 3315–3321. Cited by: §1, §3.1.1.

Appendix A Summary of main concepts

adversarial example a perturbed instance; the perturbation changes the label of the instance’s nearest neighbor (NN) in the feature space from being the same class to being a different class. In other words, the perturbation forces the NN classifier to produce an incorrect prediction [34].
support point () the adversarial example that is closest to the training instance in the original instance space
certified neighborhood the largest hypersphere that a training instance could be perturbed while keeping its NN in the feature space to be an instance of the same class
adversarial margin () the Euclidean distance between a training instance and its associated support point. It defines the radius of certified neighborhood.
Table 5: Terminology list

Appendix B Derivation of support point, adversarial margin, and gradient of perturbation loss

First, we define the hyperellipsoid via the quadratic form. An arbitrarily oriented hyperellipsoid, centered at , is defined by the solutions to the equation

where is a positive definite matrix. By the Cholesky decomposition, . Therefore, finding the support point of on the hyperellipsoid is equivalent to finding the point that defines the smallest hypersphere given by .

The optimization problem of Eq. 1 is equivalent to the following problem:

Applying the method of Lagrangian multiplier, we transform the above problem to the following Lagrangian function by introducing the Lagrangian multiplier and then solve it by setting the first partial derivatives to zero:

The Hessian matrix equals , which is positive definite, and hence is the minimum point. Replacing (identity matrix) gives Eq. 2.

The squared adversarial margin is calculated by first simplifying and then computing as follows:

Substituting gives Eq. 3.

Next, we derive the gradient of with respect to . When and (i.e. in the hyperspherical case), or , the gradient equals zero. When and , the gradient of equals the gradient of (i.e. in the hyperspherical case), which can be calculated by using the quotient rule and the derivative of trace [28]:

where denotes the trace operator. and are defined similarly. Substituting gives Eq. 2.2.

b.1 Extension for high-dimensional Data

Support point, adversarial margin and gradient of the perturbation loss with dimensionality reduction are derived by following the same principle as in Appendix B.

The method of Lagrangian multiplier is applied to derive a closed-form solution to the support point:

where denotes .

The squared adversarial margin is calculated from the definition of the hyperellipsoid:

The perturbation loss is defined similarly to Eq. 4 as follows: