Adversarial Metric Learning

02/09/2018 ∙ by Shuo Chen, et al. ∙ Nanjing University 0

In the past decades, intensive efforts have been put to design various loss functions and metric forms for metric learning problem. These improvements have shown promising results when the test data is similar to the training data. However, the trained models often fail to produce reliable distances on the ambiguous test pairs due to the distribution bias between training set and test set. To address this problem, the Adversarial Metric Learning (AML) is proposed in this paper, which automatically generates adversarial pairs to remedy the distribution bias and facilitate robust metric learning. Specifically, AML consists of two adversarial stages, i.e. confusion and distinguishment. In confusion stage, the ambiguous but critical adversarial data pairs are adaptively generated to mislead the learned metric. In distinguishment stage, a metric is exhaustively learned to try its best to distinguish both the adversarial pairs and the original training pairs. Thanks to the challenges posed by the confusion stage in such competing process, the AML model is able to grasp plentiful difficult knowledge that has not been contained by the original training pairs, so the discriminability of AML can be significantly improved. The entire model is formulated into optimization framework, of which the global convergence is theoretically proved. The experimental results on toy data and practical datasets clearly demonstrate the superiority of AML to the representative state-of-the-art metric learning methodologies.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The calculation of similarity or distance between a pair of data points plays a fundamental role in many machine learning and pattern recognition tasks such as retrieval

[Yang et al.2010], verification [Noroozi et al.2017], and classification [Yang et al.2016]. Therefore, “Metric Learning” [Bishop2006, Weinberger and Saul2009] was proposed to enable an algorithm to wisely acquire the appropriate distance metric so that the precise similarity between different examples can be faithfully reflected.

Figure 1: The comparison of traditional metric learning and our proposed model. (a) Traditional metric learning directly minimizes the loss to distinguish the training pairs. (b) Our proposed AML contains a distinguishment stage and a confusion stage, and the original training pairs and the adversarial pairs are jointly invoked to obtain an accurate . The confusion stage learns the adversarial pairs in the local region of , by maximizing as well as the distance regularizer .

In metric learning, the similarity between two example vectors

and is usually expressed by the distance function . Perhaps the most commonly used distance function is Mahalanobis distance, which has the form 111For simplicity, the notation of “square” on has been omitted and it will not influence the final output.. Here the symmetric positive definite (SPD) matrix should be learned by an algorithm to fit the similarity reflected by training data. By decomposing as , we know that Mahalanobis distance intrinsically calculates the Euclidean distance in a projected linear space rendered by the projection matrix , namely . Consequently, a large amount of models were proposed to either directly pursue the Mahalanobis matrix [Davis et al.2007, Zadeh et al.2016, Zhang and Zhang2017] or indirectly learn such a linear projection [Lu et al.2014, Harandi et al.2017]

. Furthermore, considering that above linear transformation is not flexible enough to characterize the complex data relationship, some recent works utilized the deep neural networks,


 Convolutional Neural Network (CNN)

[Simo-Serra et al.2015, Oh Song et al.2016], to achieve the purpose of non-linearity. Generally, the kernel method or CNN based nonlinear distance metrics can be summarized as , in which the output of neural network is denoted by the mapping .

However, above existing approaches simply learn the linear or non-linear metrics via designing different loss functions on the original training pairs. During the test phase, due to the distribution bias of training set and test set, some ambiguous data pairs that are difficult to be distinguished by the learned metric may appear, which will significantly impair the algorithm performance. To this end, we propose the Adversarial Metric Learning (AML) to learn a robust metric, which follows the idea of adversarial training [Goodfellow et al.2015, Li et al.2017], and is able to generate ambiguous but critical data pairs to enhance the algorithm robustness. As shown in Fig. 1, compared with the traditional metric learning methods that only distinguish the given training pairs, our AML learns the metric to distinguish both original training pairs and the generated adversarial pairs. Here, the adversarial data pairs are automatically synthesized by the algorithm to confuse the learned metric as much as possible. The adversarial pairs and the learned metric form the adversarial relationship and each of them tries to “beat” the other one. Specifically, adversarial pairs tend to introduce the ambiguous examples which are difficult for the learned metric to correctly decide their (dis)similarities (i.e. confusion stage), while the metric makes its effort to discriminate the confusing adversarial pairs (i.e. distinguishment stage). In this sense, the adversarial pairs are helpful for our model to acquire the accurate metric. To avoid the iterative competing, we convert the adversarial game to an optimization problem which has the optimal solution from the theoretical aspects. In the experiments, we show that the robust Mahalanobis metric learned by AML is superior to the state-of-the-art metric learning models on popular datasets with classification and verification tasks.

The most prominent advantage of our proposed AML is that the extra data pairs (i.e. adversarial pairs) are explored to boost the discriminability of the learned metric. In fact, several metric learning models have been proposed based on data augmentations [Ahmed et al.2015, Zagoruyko and Komodakis2015], or pair perturbations [Perrot and Habrard2015, Ye et al.2017]. However, the virtual data generated by these methods are largely based on the prior which may significantly differ from the practical test data, so their performances are rather limited. In contrast, the additional adversarial pairs in AML are consciously designed to mislead the learning metric, so they are formed in an intentional and realistic way. Specifically, to narrow the searching space of adversarial pairs, AML establishes the adversarial pairs within neighborhoods of original training pairs as shown in Fig. 2. Thanks to the learning on both real and generated pairs, the discriminability of our method can be substantially improved.

Figure 2: Generations of adversarial similar pairs and adversarial dissimilar pairs. Similar and dissimilar pairs are marked with orange balls and blue blocks, respectively. Hollow balls and blocks denote the original training examples, while filled balls and blocks denote the adversarial examples which are automatically generated by our model. Note that the generated two points constituting the adversarial similar pairs (i.e. ) are far from each other, which describe the extreme cases for two examples to be similar pairs. Similarly, the generated two points constituting the adversarial dissimilar pairs (i.e. ) are closely distributed, which depict the worst cases for two examples to be dissimilar pairs.

The main contributions of this paper are summarized as:

  • We propose a novel framework dubbed Adversarial Metric Learning (AML), which is able to generate adversarial data pairs in addition to the original given training data to enhance the model discriminability.

  • AML is converted to an optimization framework, of which the convergence is analyzed.

  • AML is empirically validated to outperform state-of-the-art metric learning models on typical applications.

2 Adversarial Metric Learning

We first introduce some necessary notations in Section 2.1, and then explain the optimization model of the proposed AML in Section 2.2. Finally, we provide the iterative solution as well as the convergence proof in Section 2.3 and Section 2.4, respectively.

2.1 Preliminaries

Let be the matrix of training example pairs, where consists of a pair of -dimensional training examples. Besides, we define a label vector of which the -th element represents the relationship of the pairwise examples recorded by the -th column of , namely if and are similar, and otherwise. Based on the supervised training data, the Mahalanobis metric is learned by minimizing the general loss function , namely


in which denotes the feasible set for , such as SPD constraint [Arsigny et al.2007], bounded constraint [Yang et al.2016], low-rank constraint [Harandi et al.2017], etc. In our proposed AML, generated adversarial pairs are denoted by the matrix , where represents the -th generated example pair. By setting to in this work, the distance between and is thus defined as the sum of the Mahalanobis distances between all the pairwise examples of and , i.e. .

2.2 Model Establishment

As mentioned in the Introduction, our AML algorithm alternates between learning the reliable distance metric (i.e. distinguishment stage) and generating the misleading adversarial data pairs (i.e. confusion stage), in which the latter is the core of AML to boost the learning performance. The main target of confusion stage is to produce the adversarial pairs to confuse the learned metric . That is to say, we should explore the pair of which the similarity predicted by is opposite to its true label. Fig. 2 intuitively plots the generations of . To achieve this effect, we search the data pairs in the neighborhood of to violate the results predicted by the learned metric. Specifically, the loss function is expected to be as large as possible, while the distance is preferred to be a small value in the following optimization objective


in which the regularizer coefficient is manually tuned to control the size of searching space. Since is found in the neighborhood of , the true label of is reasonably assumed as , i.e. the label of . It means that Eq. (2) tries to find data pairs , of which their metric results are opposite to their corresponding true labels . Therefore, such an optimization exploits the adversarial pairs to confuse the metric .

Nevertheless, Eq. (2) cannot be directly taken as a valid optimization problem, as it is not bounded which means that Eq. (2) might not have the optimal solution. To avoid this problem and achieve the same effect with Eq. (2), we convert the maximization of the loss w.r.t. the true labels to the minimization of the loss w.r.t. the opposite labels , because the opposite labels yield the opposite similarities when they are used to supervise the minimization of the loss function. Then the confusion stage is reformulated as


The optimal solution to the above problem always exists, because any loss function and distance operator have the minimal values.

By solving Eq. (3), we obtain the generated adversarial pairs recorded in the matrix , which can be employed to learn a proper metric. Since these confusing adversarial pairs are incorrectly predicted, the metric should exhaustively distinguish them to improve the discriminability. By combining the adversarial pairs in and the originally available training pairs in , the augmented training loss utilized in the distinguishment stage has a form of


where the regularizer coefficient is manually tuned to control the weights of the adversarial data. Furthermore, to improve both the distinguishment (i.e. Eq. (4)) and the confusion (i.e. Eq. (3)) during their formed adversarial game, they have to be optimized alternatively, i.e.


The straightforward implementation of Eq. (5) yet confronts two problems in the practical use. Firstly, Eq. (5) is iteratively performed, which greatly decreases the efficiency of the model. Secondly, the iterations with two different functions are not necessarily convergent [Ben-Tal et al.2009].

To achieve the similar effect of the direct alternation in Eq. (5) while avoiding the two disadvantages mentioned above, the iterative expression for is integrated to the optimization of . Therefore, our AML is ultimately expressed as a bi-level optimization problem [Bard2013], namely


in which denotes the optimal adversarial pairs matrix, and is required to be strictly quasi-convex222If a function satisfies for all and , then is strictly quasi-convex. . Note that the strictly quasi-convex property ensures the uniqueness of , and helps to make the problem well-defined.

2.3 Optimization

To implement Eq. (6), we instantiate the loss in and to obtain a specific learning model. To make

to be convex, here we employ the geometric-mean loss

[Zadeh et al.2016] which has an unconstrained form of


where the loss of dissimilar data pairs is expressed as to increase the distances between dissimilar examples. Moreover, we substitute the loss in Eq. (6) with , and impose the SPD constraint on for simplicity, namely . Then the detailed optimization algorithm for Eq. (6) is provided as follows.

Solving : We can directly obtain the closed-form solution (i.e. the optimal adversarial pairs ) for optimizing . Specifically, by using the convexity of , we let , and arrive at


which holds for any . It is clear that the equation system Eq. (8) has the unique solution


where . Such a closed-form solution means that the minimizer of can be directly expressed as


where is a mapping from to decided by Eq. (9). Hence Eq. (6) is equivalently converted to


which is an unconstrained optimization problem regarding the single variable .

Solving : The key point is to calculate the gradient of the second term . We substitute with Eq. (9) and obtain that


where , and is the eigen-decomposition of . Each term to be summed in the above Eq. (12) can be compactly written as , where and

is a differentiable function. The gradient of such a term can be obtained from the properties of eigenvalues and eigenvectors

[Bellman1997], namely and

By further leveraging the chain rule of function derivatives

[Petersen et al.2008], the gradient of Eq. (12) can be expressed as


in which


Finally, the gradient of in Eq. (11) equals to


in which the matrices and . It should be noticed that can be calculated efficiently for the SPD matrix , which only depends on the eigen-decomposition.

Now we can simply employ the gradient-based method for SPD optimization to solve our proposed model. By following the popular SPD algorithm in the existing metric learning models [Ye et al.2017, Luo and Huang2018], the projection operator is utilized to remain the SPD effect. Specifically, for a symmetric matrix , the projection from to truncates negative eigenvalues of , i.e.


It can be proved that the metric remains symmetry after the gradient descent, so the projection operator is leveraged in the gradient descent to find the optimal solution. The pseudo-code for solving Eq. (6) is provided in Algorithm 1, where the step size is recommended to be fixed to in our experiments.

Input: Training data pairs encoded in ; labels ; parameters , , .

Initialize: ; .


  1. Compute the gradient by Eq. (15);

  2. Update ;

  3. Update ;

Until Convergence.

Output: The converged .

Algorithm 1 Solving AML in Eq. (6) via gradient descent.

2.4 Convergence Analysis

Since our proposed bi-level optimization problem greatly differs from the traditional metric learning models, here we provide the theoretical analysis for the algorithm convergence.

Firstly, to ensure the definition of AML in Eq. (6) is valid, we prove that the optimal solution always exists uniquely by showing the strict (quasi-)convexity of when is instantiated by , namely

Theorem 1.

Assume that . Then is strictly convex.


Assume that , and . By invoking the SPD property of both and , we have


where , , and denotes or . Hence satisfies the definition of strictly convex function. Similarly, it is easy to check the convexity of w.r.t. , which completes the proof. ∎

Furthermore, as we employ the projection in gradient descent, it is necessary to demonstrate that any result has the orthogonal eigen-decomposition. Otherwise, cannot be executed and the SPD property of is not guaranteed. Therefore, in the following Theorem 2, we prove that the gradient matrix is symmetric, and thus any converged iteration points are always included in the feasible set .

Theorem 2.

For any differentiable functions , the matrix in Eq. (15) is symmetric.


Assume that the SPD matrix , where and are the matrices consisting of distinct eigenvalues and unit eigenvectors of , respectively. For in Eq. 13, we simply let . By using the Maclaurin’s formula [Russell1996] on each eigenvalues, namely , then equals to


Since the gradient of is symmetric for any [Petersen et al.2008], the summation of the gradient matrices is also symmetric and the proof is completed. ∎

Now it has been proved that remains symmetric during iterations, and thus the projection ensures the SPD property of , i.e., the constraint is always satisfied. It means that the gradient descent is always performed in the feasible region of the optimization problem. Then according to the theoretically proved convergence of the gradient descent method [Boyd and Vandenberghe2004], Algorithm 1 converges to the stationary point of Eq. (11).

3 Experiments

In this section, empirical investigations are conducted to validate the effectiveness of AML. In detail, we first visualize the mechanism of the proposed AML on a synthetic dataset. Then we compare the performance of the proposed method AML (Algorithm 1) with three classical metric learning methods (ITML [Davis et al.2007], LMNN [Weinberger and Saul2009] and FlatGeo [Meyer et al.2011]) and five state-of-the-art metric learning methods (RVML [Perrot and Habrard2015], GMML [Zadeh et al.2016], ERML [Yang et al.2016], DRML [Harandi et al.2017], and DRIFT [Ye et al.2017]) on seven benchmark classification datasets. Next, all methods are compared on three practical datasets related to face verification and image matching. Finally, the parametric sensitivity of AML is studied.

3.1 Experiments on Synthetic Dataset

We first demonstrate the effectiveness of AML on a synthetic dataset which contains training examples and test examples across two classes. The data points are sampled from a

-dimensional normal distribution, and are visualized by the first two principal components

[Abdi and Williams2010]. As shown in Figs. 3(a) and (b), the training set is clean, but the test examples belonging to two classes overlap in the intersection region and lead to many ambiguous test data pairs.

Since GMML [Zadeh et al.2016] shares the same loss function with our AML, and the only difference between GMML and AML is that AML utilizes the adversarial points while GMML does not, so here we only compare the results of GMML and AML to highlight the usefulness of our adversarial framework. The training and test results of both methods are projected to Euclidean space by using the learned metrics. It can be found that the traditional metric learning model GMML simply learns the optimal metric for training data, and thus its corresponding projection matrix directly maps the data points onto the horizontal-axis in the training set (Fig. 3(c)). However, such a learned metric is confused by the data points in the test set (Fig. 3(d)) as the two classes are very close to each other in the test set. As a result, the two classes are not well-separated by the learned metric. In contrast, the proposed AML not only produces very impressive result on the training set (Fig. 3(e)), but also generates very discriminative results on test set (Fig. 3(f)). The test data belonging to the same class is successfully grouped together while the examples of different classes are separated apart. This good performance of AML owes to the adversarial data pairs as shown by “” in Fig. 3(a). Such difficult yet critical training pairs effectively cover the ambiguous situations that may appear in the test set, and therefore enhancing the generalizability and discriminability of our AML.

Figure 3: Visual comparison of GMML and the proposed AML on synthetic dataset. Although the satisfactory training result is obtained by the traditional metric learning model GMML, it cannot well handle the test cases with ambiguous pairs. In contrast, our proposed AML shows good discriminability on both training and test sets. The reason lies in that the generated adversarial training data pairs help to boost the discriminability of AML.

3.2 Experiments on Classification

To evaluate the performances of various compared methods on classification task, we follow existing works [Xie and Xing2013, Lin et al.2017] and adopt the

-NN classifier (

) based on the learned metrics to investigate the classification error rate. The datasets are from the well-known UCI repository [Asuncion and Newman2007], which include Breast-Cancer, Vehicle, German-Credit, Image-Segment, Isolet, Letters and MNIST. The number of contained classes, examples and features are displayed in Table 1. We compare all methods over random trials. In each trial, of examples are randomly selected as the training examples, and the rest are used for testing. By following the recommendation in [Zadeh et al.2016], the training pairs are generated by randomly picking up pairs among the training examples. The parameters in our method such as and are tuned by searching the grid . The parameters for baseline algorithms are also carefully tuned to achieve the optimal results. The average classification error rates of compared methods are showed in Table 1, and we find that AML obtains the best results when compared with other methods in most cases.

Methods Breast-Cancer Vehicle German-Credit Image-Segment Isolet Letters MNIST References
FlatGeo JMLR 2011
AML Ours
Table 1: Classification error rates of -nearest neighbor classifier based on the metrics output by different methods. Three numbers below each datasets correspond to the feature dimensionality (), number of classes () and number of examples (). The best two results in each dataset are highlighted in red and blue, respectively.
Figure 4: ROC curves of different methods on (a) PubFig, (b) LFW and (c) MVS datasets. AUC values are presented in the legends.

3.3 Experiments on Verification

We also use two face datasets and one image matching dataset to evaluate the capabilities of all compared methods on image verification task. The PubFig face dataset [Nair and Hinton2010] consists of of pairs of images belonging to people, in which the first pairs are selected for training and the rest are used for testing. Similar experiments are performed on the LFW face dataset [Huang et al.2007] which includes unconstrained face images of individuals. The image matching dataset MVS [Brown et al.2011] consists of gray-scale image sampled from 3D reconstructions of the Statue of Liberty (LY), Notre Dame (ND) and Half Dome in Yosemite (YO). By following the settings in [Simo-Serra et al.2015], LY and ND are put together to form a training set with over image patch pairs, and patch pairs in YO are used for testing. The adopted features for above experiments are extracted by DSIFT [Cheung and Hamarneh2009] and Siamese-CNN [Zagoruyko and Komodakis2015] for face datasets (i.e. PubFig and LFW) and image patch dataset (i.e. MVC), respectively. We plot the Receiving Operator Characteristic (ROC) curve by changing the thresholds of different distance metrics. Then the values of Area Under Curve (AUC) are calculated to evaluate the performances quantitatively. From the ROC curves and AUC values in Fig. 4, it is clear to see that AML consistently outperforms other methods.

3.4 Parametric Sensitivity

In our proposed AML, there are two parameters which might influence the model performance. Parameter in Eq. (4) determines the weights between original training data and generated adversarial data, and parameter in Eq. (3) controls the size of neighborhood producting adversarial data.

Figure 5: Parametric Sensitivity on MNIST dataset. (a) Error rates under different values ; (b) Error rates under different values .

Intuitively, the growing of increases the importance of adversarial data, and the decrease of makes the model put more emphasize on the original training data. As shown in Fig. 5(a), here we change the value of and record the training error and test error on the MNIST dataset that has been used in Section 3.2. An interesting finding is that, the training error grows when increases in the range , but the test error consistently decreases at this time. This is because tuning up helps to alleviate the over-fitting problem, and thus the test data with distribution bias from the training data can be better dealt with. We also find that the training error and test error make a compromise when is around , and thus is an ideal choice for the parameter . Similarly, the parameter varies within and the corresponding training error and test error are recorded in Fig. 5(b). It is clear to find that renders the highest test accuracy and the performance is generally stable around , which mean that this parameter can be easily tuned for practical use.

4 Conclusion

In this paper, we propose a metric learning framework, named Adversarial Metric Learning (AML), which contains two important competing stages including confusion and distinguishment. The confusion stage adaptively generates adversarial data pairs to enhance the capability of learned metric to deal with the ambiguous test data pairs. To the best of our knowledge, this is the first work to introduce the adversarial framework to metric learning, and the visualization results demonstrate that the generated adversarial data critically enriches the knowledge for model training and thus making the learning algorithm acquire the more reliable and precise metric than the state-of-the-art methods. Furthermore, we show that such adversarial process can be compactly unified into a bi-level optimization problem, which is theoretically proved to have a globally convergent solver. Since the proposed AML framework is general in nature, it is very promising to apply AML to more deep neural networks based metric learning models for the future work.


  • [Abdi and Williams2010] Hervé Abdi and Lynne J Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):433–459, 2010.
  • [Ahmed et al.2015] Ejaz Ahmed, Michael Jones, and Tim K Marks.

    An improved deep learning architecture for person re-identification.

    In CVPR, pages 733–740, 2015.
  • [Arsigny et al.2007] Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM Journal on Matrix Analysis and Applications, 29(1):328–347, 2007.
  • [Asuncion and Newman2007] Arthur Asuncion and David Newman. Uci machine learning repository, 2007.
  • [Bard2013] Jonathan F Bard. Practical Bilevel Optimization: Algorithms and Applications, volume 30. Springer Science & Business Media, 2013.
  • [Bellman1997] Richard Bellman. Introduction to Matrix Analysis. SIAM, 1997.
  • [Ben-Tal et al.2009] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust Optimization. Princeton University Press, 2009.
  • [Bishop2006] Christopher M Bishop. Pattern Recognition and Machine Learning. springer, 2006.
  • [Boyd and Vandenberghe2004] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge university press, 2004.
  • [Brown et al.2011] Matthew Brown, Gang Hua, and Simon Winder. Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):43–57, 2011.
  • [Cheung and Hamarneh2009] Warren Cheung and Ghassan Hamarneh. N-sift: n-dimensional scale invariant feature transform. IEEE Transactions on Image Processing, 18(9):2012–2021, 2009.
  • [Davis et al.2007] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-theoretic metric learning. In ICML, pages 209–216, 2007.
  • [Goodfellow et al.2015] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, pages 212–227, 2015.
  • [Harandi et al.2017] Mehrtash Harandi, Mathieu Salzmann, and Richard Hartley. Joint dimensionality reduction and metric learning: A geometric take. In ICML, pages 1943–1950, 2017.
  • [Huang et al.2007] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller.

    Labeled faces in the wild: A database for studying face recognition in unconstrained environments.

    Technical report, Technical Report 07-49, University of Massachusetts, Amherst, 2007.
  • [Li et al.2017] Zheng Li, Yu Zhang, and Ying Wei. End-to-end adversarial memory network for cross-domain sentiment classification. In IJCAI, pages 2133–2140, 2017.
  • [Lin et al.2017] Liang Lin, Guangrun Wang, Wangmeng Zuo, Xiangchu Feng, and Lei Zhang. Cross-domain visual matching via generalized similarity measure and feature learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1089–1102, 2017.
  • [Lu et al.2014] Jiwen Lu, Xiuzhuang Zhou, Yap-Pen Tan, Yuanyuan Shang, and Jie Zhou. Neighborhood repulsed metric learning for kinship verification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2):331–345, 2014.
  • [Luo and Huang2018] Lei Luo and Heng Huang. Matrix variate gaussian mixture distribution steered robust metric learning. In AAAI, pages 933–940, 2018.
  • [Meyer et al.2011] Gilles Meyer, Silvère Bonnabel, and Rodolphe Sepulchre. Regression on fixed-rank positive semidefinite matrices: a riemannian approach. Journal of Machine Learning Research, 12(Feb):593–625, 2011.
  • [Nair and Hinton2010] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.
  • [Noroozi et al.2017] Vahid Noroozi, Lei Zheng, Sara Bahaadini, Sihong Xie, and Philip S Yu. Seven: Deep semi-supervised verification networks. In IJCAI, pages 133–140, 2017.
  • [Oh Song et al.2016] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In CVPR, pages 4004–4012, 2016.
  • [Perrot and Habrard2015] Michael Perrot and Amaury Habrard. Regressive virtual metric learning. In NIPS, pages 1810–1818, 2015.
  • [Petersen et al.2008] Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University of Denmark, 7:15, 2008.
  • [Russell1996] Bertrand Russell. The Principles of Mathematics. WW Norton & Company, 1996.
  • [Simo-Serra et al.2015] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In ICCV, pages 118–126, 2015.
  • [Weinberger and Saul2009] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb):207–244, 2009.
  • [Xie and Xing2013] Pengtao Xie and Eric P Xing. Multi-modal distance metric learning. In IJCAI, pages 1806–1812, 2013.
  • [Yang et al.2010] Liu Yang, Rong Jin, Lily Mummert, Rahul Sukthankar, Adam Goode, Bin Zheng, Steven CH Hoi, and Mahadev Satyanarayanan.

    A boosting framework for visuality-preserving distance metric learning and its application to medical image retrieval.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1):30–44, 2010.
  • [Yang et al.2016] Xun Yang, Meng Wang, Luming Zhang, and Dacheng Tao. Empirical risk minimization for metric learning using privileged information. In IJCAI, pages 2266–2272, 2016.
  • [Ye et al.2017] Han-Jia Ye, De-Chuan Zhan, Xue-Min Si, and Yuan Jiang. Learning mahalanobis distance metric: Considering instance disturbance helps. In IJCAI, pages 866–872, 2017.
  • [Zadeh et al.2016] Pourya Zadeh, Reshad Hosseini, and Suvrit Sra. Geometric mean metric learning. In ICML, pages 2464–2471, 2016.
  • [Zagoruyko and Komodakis2015] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. In ICCV, pages 4353–4361, 2015.
  • [Zhang and Zhang2017] Jie Zhang and Lijun Zhang. Efficient stochastic optimization for low-rank distance metric learning. In AAAI, pages 933–940, 2017.