1 Introduction
The calculation of similarity or distance between a pair of data points plays a fundamental role in many machine learning and pattern recognition tasks such as retrieval
[Yang et al.2010], verification [Noroozi et al.2017], and classification [Yang et al.2016]. Therefore, “Metric Learning” [Bishop2006, Weinberger and Saul2009] was proposed to enable an algorithm to wisely acquire the appropriate distance metric so that the precise similarity between different examples can be faithfully reflected.In metric learning, the similarity between two example vectors
and is usually expressed by the distance function . Perhaps the most commonly used distance function is Mahalanobis distance, which has the form ^{1}^{1}1For simplicity, the notation of “square” on has been omitted and it will not influence the final output.. Here the symmetric positive definite (SPD) matrix should be learned by an algorithm to fit the similarity reflected by training data. By decomposing as , we know that Mahalanobis distance intrinsically calculates the Euclidean distance in a projected linear space rendered by the projection matrix , namely . Consequently, a large amount of models were proposed to either directly pursue the Mahalanobis matrix [Davis et al.2007, Zadeh et al.2016, Zhang and Zhang2017] or indirectly learn such a linear projection [Lu et al.2014, Harandi et al.2017]. Furthermore, considering that above linear transformation is not flexible enough to characterize the complex data relationship, some recent works utilized the deep neural networks,
e.g.Convolutional Neural Network (CNN)
[SimoSerra et al.2015, Oh Song et al.2016], to achieve the purpose of nonlinearity. Generally, the kernel method or CNN based nonlinear distance metrics can be summarized as , in which the output of neural network is denoted by the mapping .However, above existing approaches simply learn the linear or nonlinear metrics via designing different loss functions on the original training pairs. During the test phase, due to the distribution bias of training set and test set, some ambiguous data pairs that are difficult to be distinguished by the learned metric may appear, which will significantly impair the algorithm performance. To this end, we propose the Adversarial Metric Learning (AML) to learn a robust metric, which follows the idea of adversarial training [Goodfellow et al.2015, Li et al.2017], and is able to generate ambiguous but critical data pairs to enhance the algorithm robustness. As shown in Fig. 1, compared with the traditional metric learning methods that only distinguish the given training pairs, our AML learns the metric to distinguish both original training pairs and the generated adversarial pairs. Here, the adversarial data pairs are automatically synthesized by the algorithm to confuse the learned metric as much as possible. The adversarial pairs and the learned metric form the adversarial relationship and each of them tries to “beat” the other one. Specifically, adversarial pairs tend to introduce the ambiguous examples which are difficult for the learned metric to correctly decide their (dis)similarities (i.e. confusion stage), while the metric makes its effort to discriminate the confusing adversarial pairs (i.e. distinguishment stage). In this sense, the adversarial pairs are helpful for our model to acquire the accurate metric. To avoid the iterative competing, we convert the adversarial game to an optimization problem which has the optimal solution from the theoretical aspects. In the experiments, we show that the robust Mahalanobis metric learned by AML is superior to the stateoftheart metric learning models on popular datasets with classification and verification tasks.
The most prominent advantage of our proposed AML is that the extra data pairs (i.e. adversarial pairs) are explored to boost the discriminability of the learned metric. In fact, several metric learning models have been proposed based on data augmentations [Ahmed et al.2015, Zagoruyko and Komodakis2015], or pair perturbations [Perrot and Habrard2015, Ye et al.2017]. However, the virtual data generated by these methods are largely based on the prior which may significantly differ from the practical test data, so their performances are rather limited. In contrast, the additional adversarial pairs in AML are consciously designed to mislead the learning metric, so they are formed in an intentional and realistic way. Specifically, to narrow the searching space of adversarial pairs, AML establishes the adversarial pairs within neighborhoods of original training pairs as shown in Fig. 2. Thanks to the learning on both real and generated pairs, the discriminability of our method can be substantially improved.
The main contributions of this paper are summarized as:

We propose a novel framework dubbed Adversarial Metric Learning (AML), which is able to generate adversarial data pairs in addition to the original given training data to enhance the model discriminability.

AML is converted to an optimization framework, of which the convergence is analyzed.

AML is empirically validated to outperform stateoftheart metric learning models on typical applications.
2 Adversarial Metric Learning
We first introduce some necessary notations in Section 2.1, and then explain the optimization model of the proposed AML in Section 2.2. Finally, we provide the iterative solution as well as the convergence proof in Section 2.3 and Section 2.4, respectively.
2.1 Preliminaries
Let be the matrix of training example pairs, where consists of a pair of dimensional training examples. Besides, we define a label vector of which the th element represents the relationship of the pairwise examples recorded by the th column of , namely if and are similar, and otherwise. Based on the supervised training data, the Mahalanobis metric is learned by minimizing the general loss function , namely
(1) 
in which denotes the feasible set for , such as SPD constraint [Arsigny et al.2007], bounded constraint [Yang et al.2016], lowrank constraint [Harandi et al.2017], etc. In our proposed AML, generated adversarial pairs are denoted by the matrix , where represents the th generated example pair. By setting to in this work, the distance between and is thus defined as the sum of the Mahalanobis distances between all the pairwise examples of and , i.e. .
2.2 Model Establishment
As mentioned in the Introduction, our AML algorithm alternates between learning the reliable distance metric (i.e. distinguishment stage) and generating the misleading adversarial data pairs (i.e. confusion stage), in which the latter is the core of AML to boost the learning performance. The main target of confusion stage is to produce the adversarial pairs to confuse the learned metric . That is to say, we should explore the pair of which the similarity predicted by is opposite to its true label. Fig. 2 intuitively plots the generations of . To achieve this effect, we search the data pairs in the neighborhood of to violate the results predicted by the learned metric. Specifically, the loss function is expected to be as large as possible, while the distance is preferred to be a small value in the following optimization objective
(2) 
in which the regularizer coefficient is manually tuned to control the size of searching space. Since is found in the neighborhood of , the true label of is reasonably assumed as , i.e. the label of . It means that Eq. (2) tries to find data pairs , of which their metric results are opposite to their corresponding true labels . Therefore, such an optimization exploits the adversarial pairs to confuse the metric .
Nevertheless, Eq. (2) cannot be directly taken as a valid optimization problem, as it is not bounded which means that Eq. (2) might not have the optimal solution. To avoid this problem and achieve the same effect with Eq. (2), we convert the maximization of the loss w.r.t. the true labels to the minimization of the loss w.r.t. the opposite labels , because the opposite labels yield the opposite similarities when they are used to supervise the minimization of the loss function. Then the confusion stage is reformulated as
(3) 
The optimal solution to the above problem always exists, because any loss function and distance operator have the minimal values.
By solving Eq. (3), we obtain the generated adversarial pairs recorded in the matrix , which can be employed to learn a proper metric. Since these confusing adversarial pairs are incorrectly predicted, the metric should exhaustively distinguish them to improve the discriminability. By combining the adversarial pairs in and the originally available training pairs in , the augmented training loss utilized in the distinguishment stage has a form of
(4) 
where the regularizer coefficient is manually tuned to control the weights of the adversarial data. Furthermore, to improve both the distinguishment (i.e. Eq. (4)) and the confusion (i.e. Eq. (3)) during their formed adversarial game, they have to be optimized alternatively, i.e.
(5) 
The straightforward implementation of Eq. (5) yet confronts two problems in the practical use. Firstly, Eq. (5) is iteratively performed, which greatly decreases the efficiency of the model. Secondly, the iterations with two different functions are not necessarily convergent [BenTal et al.2009].
To achieve the similar effect of the direct alternation in Eq. (5) while avoiding the two disadvantages mentioned above, the iterative expression for is integrated to the optimization of . Therefore, our AML is ultimately expressed as a bilevel optimization problem [Bard2013], namely
(6) 
in which denotes the optimal adversarial pairs matrix, and is required to be strictly quasiconvex^{2}^{2}2If a function satisfies for all and , then is strictly quasiconvex. . Note that the strictly quasiconvex property ensures the uniqueness of , and helps to make the problem welldefined.
2.3 Optimization
To implement Eq. (6), we instantiate the loss in and to obtain a specific learning model. To make
to be convex, here we employ the geometricmean loss
[Zadeh et al.2016] which has an unconstrained form of(7) 
where the loss of dissimilar data pairs is expressed as to increase the distances between dissimilar examples. Moreover, we substitute the loss in Eq. (6) with , and impose the SPD constraint on for simplicity, namely . Then the detailed optimization algorithm for Eq. (6) is provided as follows.
Solving : We can directly obtain the closedform solution (i.e. the optimal adversarial pairs ) for optimizing . Specifically, by using the convexity of , we let , and arrive at
(8) 
which holds for any . It is clear that the equation system Eq. (8) has the unique solution
(9) 
where . Such a closedform solution means that the minimizer of can be directly expressed as
(10) 
where is a mapping from to decided by Eq. (9). Hence Eq. (6) is equivalently converted to
(11) 
which is an unconstrained optimization problem regarding the single variable .
Solving : The key point is to calculate the gradient of the second term . We substitute with Eq. (9) and obtain that
(12) 
where , and is the eigendecomposition of . Each term to be summed in the above Eq. (12) can be compactly written as , where and
is a differentiable function. The gradient of such a term can be obtained from the properties of eigenvalues and eigenvectors
[Bellman1997], namely andBy further leveraging the chain rule of function derivatives
[Petersen et al.2008], the gradient of Eq. (12) can be expressed as(13) 
in which
(14) 
Finally, the gradient of in Eq. (11) equals to
(15) 
in which the matrices and . It should be noticed that can be calculated efficiently for the SPD matrix , which only depends on the eigendecomposition.
Now we can simply employ the gradientbased method for SPD optimization to solve our proposed model. By following the popular SPD algorithm in the existing metric learning models [Ye et al.2017, Luo and Huang2018], the projection operator is utilized to remain the SPD effect. Specifically, for a symmetric matrix , the projection from to truncates negative eigenvalues of , i.e.
(16) 
It can be proved that the metric remains symmetry after the gradient descent, so the projection operator is leveraged in the gradient descent to find the optimal solution. The pseudocode for solving Eq. (6) is provided in Algorithm 1, where the step size is recommended to be fixed to in our experiments.
2.4 Convergence Analysis
Since our proposed bilevel optimization problem greatly differs from the traditional metric learning models, here we provide the theoretical analysis for the algorithm convergence.
Firstly, to ensure the definition of AML in Eq. (6) is valid, we prove that the optimal solution always exists uniquely by showing the strict (quasi)convexity of when is instantiated by , namely
Theorem 1.
Assume that . Then is strictly convex.
Proof.
Assume that , and . By invoking the SPD property of both and , we have
(17) 
where , , and denotes or . Hence satisfies the definition of strictly convex function. Similarly, it is easy to check the convexity of w.r.t. , which completes the proof. ∎
Furthermore, as we employ the projection in gradient descent, it is necessary to demonstrate that any result has the orthogonal eigendecomposition. Otherwise, cannot be executed and the SPD property of is not guaranteed. Therefore, in the following Theorem 2, we prove that the gradient matrix is symmetric, and thus any converged iteration points are always included in the feasible set .
Theorem 2.
For any differentiable functions , the matrix in Eq. (15) is symmetric.
Proof.
Assume that the SPD matrix , where and are the matrices consisting of distinct eigenvalues and unit eigenvectors of , respectively. For in Eq. 13, we simply let . By using the Maclaurin’s formula [Russell1996] on each eigenvalues, namely , then equals to
(18) 
Since the gradient of is symmetric for any [Petersen et al.2008], the summation of the gradient matrices is also symmetric and the proof is completed. ∎
Now it has been proved that remains symmetric during iterations, and thus the projection ensures the SPD property of , i.e., the constraint is always satisfied. It means that the gradient descent is always performed in the feasible region of the optimization problem. Then according to the theoretically proved convergence of the gradient descent method [Boyd and Vandenberghe2004], Algorithm 1 converges to the stationary point of Eq. (11).
3 Experiments
In this section, empirical investigations are conducted to validate the effectiveness of AML. In detail, we first visualize the mechanism of the proposed AML on a synthetic dataset. Then we compare the performance of the proposed method AML (Algorithm 1) with three classical metric learning methods (ITML [Davis et al.2007], LMNN [Weinberger and Saul2009] and FlatGeo [Meyer et al.2011]) and five stateoftheart metric learning methods (RVML [Perrot and Habrard2015], GMML [Zadeh et al.2016], ERML [Yang et al.2016], DRML [Harandi et al.2017], and DRIFT [Ye et al.2017]) on seven benchmark classification datasets. Next, all methods are compared on three practical datasets related to face verification and image matching. Finally, the parametric sensitivity of AML is studied.
3.1 Experiments on Synthetic Dataset
We first demonstrate the effectiveness of AML on a synthetic dataset which contains training examples and test examples across two classes. The data points are sampled from a
dimensional normal distribution, and are visualized by the first two principal components
[Abdi and Williams2010]. As shown in Figs. 3(a) and (b), the training set is clean, but the test examples belonging to two classes overlap in the intersection region and lead to many ambiguous test data pairs.Since GMML [Zadeh et al.2016] shares the same loss function with our AML, and the only difference between GMML and AML is that AML utilizes the adversarial points while GMML does not, so here we only compare the results of GMML and AML to highlight the usefulness of our adversarial framework. The training and test results of both methods are projected to Euclidean space by using the learned metrics. It can be found that the traditional metric learning model GMML simply learns the optimal metric for training data, and thus its corresponding projection matrix directly maps the data points onto the horizontalaxis in the training set (Fig. 3(c)). However, such a learned metric is confused by the data points in the test set (Fig. 3(d)) as the two classes are very close to each other in the test set. As a result, the two classes are not wellseparated by the learned metric. In contrast, the proposed AML not only produces very impressive result on the training set (Fig. 3(e)), but also generates very discriminative results on test set (Fig. 3(f)). The test data belonging to the same class is successfully grouped together while the examples of different classes are separated apart. This good performance of AML owes to the adversarial data pairs as shown by “” in Fig. 3(a). Such difficult yet critical training pairs effectively cover the ambiguous situations that may appear in the test set, and therefore enhancing the generalizability and discriminability of our AML.
3.2 Experiments on Classification
To evaluate the performances of various compared methods on classification task, we follow existing works [Xie and Xing2013, Lin et al.2017] and adopt the
NN classifier (
) based on the learned metrics to investigate the classification error rate. The datasets are from the wellknown UCI repository [Asuncion and Newman2007], which include BreastCancer, Vehicle, GermanCredit, ImageSegment, Isolet, Letters and MNIST. The number of contained classes, examples and features are displayed in Table 1. We compare all methods over random trials. In each trial, of examples are randomly selected as the training examples, and the rest are used for testing. By following the recommendation in [Zadeh et al.2016], the training pairs are generated by randomly picking up pairs among the training examples. The parameters in our method such as and are tuned by searching the grid . The parameters for baseline algorithms are also carefully tuned to achieve the optimal results. The average classification error rates of compared methods are showed in Table 1, and we find that AML obtains the best results when compared with other methods in most cases.Methods  BreastCancer  Vehicle  GermanCredit  ImageSegment  Isolet  Letters  MNIST  References 

ITML  ICML 2007  
LMNN  JMLR 2009  
FlatGeo  JMLR 2011  
RVML  NIPS 2015  
GMML  ICML 2016  
ERML  IJCAI 2016  
DRML  ICML 2017  
DRIFT  IJCAI 2017  
AML  Ours 
3.3 Experiments on Verification
We also use two face datasets and one image matching dataset to evaluate the capabilities of all compared methods on image verification task. The PubFig face dataset [Nair and Hinton2010] consists of of pairs of images belonging to people, in which the first pairs are selected for training and the rest are used for testing. Similar experiments are performed on the LFW face dataset [Huang et al.2007] which includes unconstrained face images of individuals. The image matching dataset MVS [Brown et al.2011] consists of grayscale image sampled from 3D reconstructions of the Statue of Liberty (LY), Notre Dame (ND) and Half Dome in Yosemite (YO). By following the settings in [SimoSerra et al.2015], LY and ND are put together to form a training set with over image patch pairs, and patch pairs in YO are used for testing. The adopted features for above experiments are extracted by DSIFT [Cheung and Hamarneh2009] and SiameseCNN [Zagoruyko and Komodakis2015] for face datasets (i.e. PubFig and LFW) and image patch dataset (i.e. MVC), respectively. We plot the Receiving Operator Characteristic (ROC) curve by changing the thresholds of different distance metrics. Then the values of Area Under Curve (AUC) are calculated to evaluate the performances quantitatively. From the ROC curves and AUC values in Fig. 4, it is clear to see that AML consistently outperforms other methods.
3.4 Parametric Sensitivity
In our proposed AML, there are two parameters which might influence the model performance. Parameter in Eq. (4) determines the weights between original training data and generated adversarial data, and parameter in Eq. (3) controls the size of neighborhood producting adversarial data.
Intuitively, the growing of increases the importance of adversarial data, and the decrease of makes the model put more emphasize on the original training data. As shown in Fig. 5(a), here we change the value of and record the training error and test error on the MNIST dataset that has been used in Section 3.2. An interesting finding is that, the training error grows when increases in the range , but the test error consistently decreases at this time. This is because tuning up helps to alleviate the overfitting problem, and thus the test data with distribution bias from the training data can be better dealt with. We also find that the training error and test error make a compromise when is around , and thus is an ideal choice for the parameter . Similarly, the parameter varies within and the corresponding training error and test error are recorded in Fig. 5(b). It is clear to find that renders the highest test accuracy and the performance is generally stable around , which mean that this parameter can be easily tuned for practical use.
4 Conclusion
In this paper, we propose a metric learning framework, named Adversarial Metric Learning (AML), which contains two important competing stages including confusion and distinguishment. The confusion stage adaptively generates adversarial data pairs to enhance the capability of learned metric to deal with the ambiguous test data pairs. To the best of our knowledge, this is the first work to introduce the adversarial framework to metric learning, and the visualization results demonstrate that the generated adversarial data critically enriches the knowledge for model training and thus making the learning algorithm acquire the more reliable and precise metric than the stateoftheart methods. Furthermore, we show that such adversarial process can be compactly unified into a bilevel optimization problem, which is theoretically proved to have a globally convergent solver. Since the proposed AML framework is general in nature, it is very promising to apply AML to more deep neural networks based metric learning models for the future work.
References
 [Abdi and Williams2010] Hervé Abdi and Lynne J Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):433–459, 2010.

[Ahmed et al.2015]
Ejaz Ahmed, Michael Jones, and Tim K Marks.
An improved deep learning architecture for person reidentification.
In CVPR, pages 733–740, 2015.  [Arsigny et al.2007] Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. Geometric means in a novel vector space structure on symmetric positivedefinite matrices. SIAM Journal on Matrix Analysis and Applications, 29(1):328–347, 2007.
 [Asuncion and Newman2007] Arthur Asuncion and David Newman. Uci machine learning repository, 2007.
 [Bard2013] Jonathan F Bard. Practical Bilevel Optimization: Algorithms and Applications, volume 30. Springer Science & Business Media, 2013.
 [Bellman1997] Richard Bellman. Introduction to Matrix Analysis. SIAM, 1997.
 [BenTal et al.2009] Aharon BenTal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust Optimization. Princeton University Press, 2009.
 [Bishop2006] Christopher M Bishop. Pattern Recognition and Machine Learning. springer, 2006.
 [Boyd and Vandenberghe2004] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge university press, 2004.
 [Brown et al.2011] Matthew Brown, Gang Hua, and Simon Winder. Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):43–57, 2011.
 [Cheung and Hamarneh2009] Warren Cheung and Ghassan Hamarneh. Nsift: ndimensional scale invariant feature transform. IEEE Transactions on Image Processing, 18(9):2012–2021, 2009.
 [Davis et al.2007] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Informationtheoretic metric learning. In ICML, pages 209–216, 2007.
 [Goodfellow et al.2015] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, pages 212–227, 2015.
 [Harandi et al.2017] Mehrtash Harandi, Mathieu Salzmann, and Richard Hartley. Joint dimensionality reduction and metric learning: A geometric take. In ICML, pages 1943–1950, 2017.

[Huang et al.2007]
Gary B Huang, Manu Ramesh, Tamara Berg, and Erik LearnedMiller.
Labeled faces in the wild: A database for studying face recognition in unconstrained environments.
Technical report, Technical Report 0749, University of Massachusetts, Amherst, 2007.  [Li et al.2017] Zheng Li, Yu Zhang, and Ying Wei. Endtoend adversarial memory network for crossdomain sentiment classification. In IJCAI, pages 2133–2140, 2017.
 [Lin et al.2017] Liang Lin, Guangrun Wang, Wangmeng Zuo, Xiangchu Feng, and Lei Zhang. Crossdomain visual matching via generalized similarity measure and feature learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1089–1102, 2017.
 [Lu et al.2014] Jiwen Lu, Xiuzhuang Zhou, YapPen Tan, Yuanyuan Shang, and Jie Zhou. Neighborhood repulsed metric learning for kinship verification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2):331–345, 2014.
 [Luo and Huang2018] Lei Luo and Heng Huang. Matrix variate gaussian mixture distribution steered robust metric learning. In AAAI, pages 933–940, 2018.
 [Meyer et al.2011] Gilles Meyer, Silvère Bonnabel, and Rodolphe Sepulchre. Regression on fixedrank positive semidefinite matrices: a riemannian approach. Journal of Machine Learning Research, 12(Feb):593–625, 2011.
 [Nair and Hinton2010] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.
 [Noroozi et al.2017] Vahid Noroozi, Lei Zheng, Sara Bahaadini, Sihong Xie, and Philip S Yu. Seven: Deep semisupervised verification networks. In IJCAI, pages 133–140, 2017.
 [Oh Song et al.2016] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In CVPR, pages 4004–4012, 2016.
 [Perrot and Habrard2015] Michael Perrot and Amaury Habrard. Regressive virtual metric learning. In NIPS, pages 1810–1818, 2015.
 [Petersen et al.2008] Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University of Denmark, 7:15, 2008.
 [Russell1996] Bertrand Russell. The Principles of Mathematics. WW Norton & Company, 1996.
 [SimoSerra et al.2015] Edgar SimoSerra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc MorenoNoguer. Discriminative learning of deep convolutional feature point descriptors. In ICCV, pages 118–126, 2015.
 [Weinberger and Saul2009] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb):207–244, 2009.
 [Xie and Xing2013] Pengtao Xie and Eric P Xing. Multimodal distance metric learning. In IJCAI, pages 1806–1812, 2013.

[Yang et al.2010]
Liu Yang, Rong Jin, Lily Mummert, Rahul Sukthankar, Adam Goode, Bin Zheng,
Steven CH Hoi, and Mahadev Satyanarayanan.
A boosting framework for visualitypreserving distance metric learning and its application to medical image retrieval.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1):30–44, 2010.  [Yang et al.2016] Xun Yang, Meng Wang, Luming Zhang, and Dacheng Tao. Empirical risk minimization for metric learning using privileged information. In IJCAI, pages 2266–2272, 2016.
 [Ye et al.2017] HanJia Ye, DeChuan Zhan, XueMin Si, and Yuan Jiang. Learning mahalanobis distance metric: Considering instance disturbance helps. In IJCAI, pages 866–872, 2017.
 [Zadeh et al.2016] Pourya Zadeh, Reshad Hosseini, and Suvrit Sra. Geometric mean metric learning. In ICML, pages 2464–2471, 2016.
 [Zagoruyko and Komodakis2015] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. In ICCV, pages 4353–4361, 2015.
 [Zhang and Zhang2017] Jie Zhang and Lijun Zhang. Efficient stochastic optimization for lowrank distance metric learning. In AAAI, pages 933–940, 2017.