I Introduction
In mathematics, a metric is a function that defines distances among data samples. It plays a crucial role in many machine learning and data mining algorithms, e.g.
means clustering andNN classifier. The Euclidean distance is the most commonly used metric, which essentially assign all feature components with equal weights. Learning a customized metric from the training samples to grant larger weights to more discriminative features can often significantly improve the performance of associated classifiers
[4, 28].Based on the form of the learned metric, distance metric learning (DML) solutions can be categorized into linear and nonlinear groups [28]
. Linear models focus on estimating a “best” affine transformation to deform the data space, such that the resulted pairwise distances would very well agree with the supervisory information brought by training samples. Many early works have concentrated on linear methods as they are easy to use, convenient to optimize and robust to overfitting
[4]. However, when applied to process data with nonlinear structures, linear models show inherently limited separation capability.Recent years have seen great efforts to generalize linear DML models for nonlinear cases. Such extensions have been pushed forward mainly along two directions: kernelization and localization. Kernelization [23, 14] embeds the input data into a higher dimensional space, hoping that the transformed samples would be more separable under the new domain. While effective in extending some linear DML models, kernelization solutions are prone to overfitting [4], and their utilization is inherently limited by the sizes of the kernel matrices [8].
Localization focuses on forming a global distance function through the integration of multiple local metrics estimated within local neighborhoods or class memberships. How, where and when the local metrics are computed and blended are the major issues [18] of this procedure. Straightforward piecewise linearization has been utilized on several global linear DML solutions, including NCA [13] and LMNN [26], to develop their nonlinear versions (msNCA [10] and mmLMNN [25]). The nonlinear metrics in GLML [16] and PLML [24] are globally learned, on top of basis neighborhood metrics obtained through the minimization of the nearest neighbor (NN) classification errors. To alleviate overfitting and impose general regularity across the learned metrics, metric/matrix regularization has been employed in several solutions [10, 16, 24].
With the exception of PLML, most localization solutions combine component metrics in a rather primitive manner — usually the distance between a pair of data samples is computed as the geodesic going through multiple classes/neighborhoods, with no smoothing control around the neighborhood boundaries. Theoretic discussions are generally lacking as to how the sharp changes across the boundaries would affect the overall classification.
Solutions that directly integrate nonlinear space transformations with classifiers have also been proposed. Shi et al. combines Thinplate spline (TPS) with NN and SVM [20, 5, 21] and Zhang et al. [30, 31]
utilize coherent point drifting (CPD) models with Laplacian SVM for semisupervised learning. Significant improvements over the corresponding linear models have been reported.
Integrating DML with deep learning models, especially convolutional neural networks (CNNs), has been attempted in several latest studies
[11, 12, 22, 29, 9, 7]. In [22, 29, 9, 7], CNNs are utilized to replace the “handcraft” feature engineering step, and different architectures have been explored to fully optimize the overall deep metric learning networks. Although many of the deep DML models produce stateoftheart results, they require a large amount of training data, as a prerequisite, to perform effectively.In this paper, we propose a novel piecewise linearization for local metrics fusion. Unlike most linearization DML methods, our model carries out metric merging based on the velocities of individual transformations at each data point. Such velocities are generated through a geodesic interpolation, on a Lie group, between the identity transformation and the target transformation. The resulted overall nonlinear motion is guaranteed to be a diffeomorphism (a bijective map that is invertible and differentiable). We term our solution nonlinear metric learning through Geodesic Polylinear Interpolation (MLGPI) model.
The remaining of the paper is organized as follows. Section 2 presents some preliminaries for metric learning, followed by the description of LMNN and its nonlinear extension mmLMNN. Section 3 introduces our MLGPI solution, as well as its advantages over the existing models. Experimental results are presented in section 4, and section 5 concludes this paper.
Ii Preliminaries
A metric space is a set that has a notion of distance between every pair of points. The goal of metric learning is to learn a “better” metric with the aid of the training samples . Let the metric to be sought denoted by
, controlled by certain parameter vector
. For the Mahalanobis metric, is a positive semidefinite (PSD) matrix , where is the number of features. If a metric keeps the sameclass pairs close while pushing those in different classes far away, it would likely be a good approximation to the underlying semantic metric.Iia Global linear model: LMNN
LMNN is one of the most widelyused Mahalanobis metric learning algorithms, and has been the subject of many extensions [23, 17, 15, 6, 25]. Our MLGPI model is inspired and implemented based on mmLMNN, one of the nonlinear extensions of LMNN. Therefore, we briefly review these two models here.
Unlike many other global metric learning methods, LMNN defines the constraints in a local neighborhood, where the “pull force” within the classequivalent data and the “push force” for the classnonequivalent data (the ”imposters”) are optimized to lead a balanced tradeoff. Let be the set of sameclass pairs, and be that of differentclass pairs. Formally, the constraints used in LMNN are defined in the following way:

classequivalent constraint in a neighborhood:
; 
classnonequivalent constraint in a neighborhood:
; 
Relative triplets in a neighborhood:
.
Then, the Mahalanobis metric is learned through the following convex objective:
s.t.  
where controls the “pull/push” tradeoff. A tailored numerical solver based on gradient descent and bookkeeping strategy is utilized, enabling LMNN to perform efficiently in practice.
IiB Nonlinear extension through piecewise linearization
To achieve nonlinear extension of linear metric learning models, piecewise linearization is a simple yet popular solution, which is commonly chosen by perclass methods [25, 10, 24, 18]. Taking mmLMNN and PLML as examples, the idea of piecewise linearization is rather simple: in order to learn separate Mahalanobis metrics in different parts of the data space, the original training data are firstly partitioned into disjoint clusters based on either spatial means or class labels. Different metrics at the cluster centers are learned simultaneously.
In mmLMNN, data points at the testing stage are mapped with different component metrics before the NN classification decision is made. While straightforward and generally effective, this scheme has the tendency to artificially displace the decision boundaries, especially when the component metrics are with rather different scales. This effect is illustrated in Fig. 1. The first row shows two classes of points, red and blue, as well as the associated NN decision boundary (figuratively speaking, not the exact boundary), which is roughly the middle line separating two boundary group points. When the data space is transformed with unbalanced component transformations on the two halves, e.g., is the identity transformation while doubles the horizontal dimension of the red side, the samples around the original decision boundary (shadow area) will be classified into the blue class, as their distances to the red circles have been doubled. This means the new decision boundary (the new middle line) is artificially shifted due to the unbalanced metrics. Test samples, if falling into the shadow area, will be misclassified.
In PLML, at each instance , the local metric is parameterized through weighted linear combination [24]:
(1) 
is the weight of the cluster metric for the instance . Using Eqn.(1), the squared distance of to is:
(2) 
where is the squared Mahalanobis distance between and under the cluster metric . In other words, pairwise distances, as well as the combination of component metrics, is conducted through weighted averaging of the associated displacements. In principle, this simple strategy could be applied to any Mahalanobis metric learning algorithms, extending a global solution to solve nonlinear cases. However, this approach has an inherent drawback: although all the component metrics are invertible, the resulting global metric is not necessarily invertible in general. Fig. 2 (left) shows the direct fusion of two estimated linear metrics, where a folding around the boundary area occurs. Foldings in space indicate that two different data points could be mapped to a same point after transformation, which would generate inconsistent classification decisions.
Iii MLGPI: nonlinear metric learning through geodesic polylinear interpolation
As we explain in the previous section, to merge local metrics with different scales, the decision boundaries between classes could be artificially displaced, leading to erratic classification results. A smooth transition could provide a remedy. PLML model takes the smoothness issue into consideration, and imposes a manifold regularization step to ensure local metrics to vary smoothly across the entire data domain. However, the distance between each data pair is obtained through weighted summations over the individual distances for involved bases. While the weights are smooth, such distance summation does not ensure diffeomorphic and smoothness in the overall distance field.
Our solution for smooth transitions is based on a Lie group geodesic interpolation approach that has been utilized in motion interpolation [1, 19] and image registration [3, 2] research. With the component linear metrics, we rely on the averaging of infinitesimal displacements, or velocities, to generate the combined transformation. Each transformation is modeled as an instance in a high dimensional Lie group, and motion velocity induced by can be calculated through a constant speed interpolation from the identity transformation to
. With the weighted velocity at each data point, the global metric is subsequently obtained by integrating an Ordinary Differential Equation (ODE). The result is guaranteed for invertibility and smoothness
[2]. Fig. 2 (right) shows the fusion of the same linear metrics (mentioned in last section), through velocity approach. The resulted transformation does not fold and remains invertible, which is in a great contrast with the fusion through displacements, as in Fig. 2 (left).The overall procedure of our MLGPI can be decomposed into four steps:

Step 1: Derive velocity vectors for individual linear metrics. Let be the total number of component metrics. For component metrics at a data point , a family of velocity vector fields parameterized by a time parameter (), need to be derived. represents the velocity incurred by , therefore it should satisfy a consistency property: when integrated from time and , the accumulated transformation should start at the identity transformation and end at .

Step 2: Fuse the velocity vectors in step 1 according to a weighting function .
(3) 
Step 3: Integration along the velocity ODE. Under this velocity framework, the motion destination of each point through the combined global transformation is obtained via the integration of Eqn. (3) between and , with the initial condition .

Step 4: The distance of two data point to through the global transformation will be the Euclidean distance between their respective destinations.
Iiia Velocity vectors and weighting function
The mapping of points to through a linear transformation
can be interpreted in infinitely many different ways. One “reasonable” path is to regard the transformations as instances in a Lie transformation group, and carry out the morphing along the geodesic from the identity matrix
(starting transformation, at ) to (destination transformation at ). Within this morphing procedure, each point is moving with certain a velocity . An illustration of this interpolation scheme is given in Fig. 3.As many metric learning solutions, including LMNN and mmLMNN, output real invertible matrices, it is justified to only consider component transformations in the general linear group (, ). As a manifold, (, ) is not connected but rather has two connected components: the matrices with positive determinant ((, )) and the ones with negative determinant ((, R)). Since the identity matrix is in , we specify the transformation group to be (, ) and only consider the matrices with positive determinants.
For a Lie group , consider two element matrices . We desire to find an interpolation between the two elements, according to a time parameter . Define a function that will perform the interpolation:
The function can be obtained by transforming the interpolation operation into the tangent space at the identity , performing a linear combination there, and then transforming the resulting tangent vector back onto the manifold. First consider the group element that takes to :
Now compute the corresponding Lie algebra vector and assume the motion is with a constant speed on the tangent space :
(4) 
Then transform back into the manifold using the exponential map, yielding an intermediate transformation at time :
(5) 
Combining these three steps leads to a solution for :
Note that the intermediate transformation is always on the manifold, due to the operation of the exponential map.
Step 1 of our MLGPI model is a straightforward application of the above derivations. Our goal is to estimate the velocity incurred by transform at point and time . The transformation travels with constant speed on (, ) through the geodesic from identity matrix to , so and in the above derivations. For a point transformed to after time through an intermediate transformation , let be the original location. Since is a linear transformation,
(6) 
Calculate the derivative of w.r.t. to , we obtain the velocity:
(7) 
Maintain matrices in A matrix with negative determinant would flip the data space. Most global linear metric learning algorithms allow the estimated linear transformation to be in (, ), as a flipping does not affect the classification that follows. For piecewise linear models, however, flippings could impose a serious problem. When neighboring metrics fall in (, ) and (, ) respectively, which are disconnected subgroups of (, ), merging or averaging such matrices would lead to disastrous results. In our MLGPI model, we specify all the component metrics to be in (, ). To this end, we modified mmLMNN and adopted a procedure similar to the projected gradient approach utilized in [27]. At each iteration of mmLMNN, we check the determinant of the estimated transformations . If any of them falls in (, ), we project it back to (,
) by changing the sign of one of the Jordan eigenvalues.
Weight functions With the velocities estimated from individual metrics, combination can be conducted through certain weighting function. Weight functions model influence in space of each component metric, and partly control the sharpness of transitions among the fused linear transformations. A desired weight function should ensure a smooth transition across class/region boundaries. In this work, we utilize radial distance functions for such control, where the influence of each transformation is gradually reduced as the distance away from the class center grows. Let be the computed class center (group mean in spatial coordinates). The weight functions we choose in all the experiments of this paper take the form of . serves as an attenuation constant that controls the rate of influence reduction, and it can be roughly set to the radius of the data points in the same class. is then normalized as .
Fig. 5 shows an example of metric fusion through our MLGPI model. The first row are two component linear transformations: rotations of opposite angles of magnitude radians around the centers of the respective regions. Fig. 5.c shows the combined pointwise velocities estimated in Step 1, and the overall transformation computed through MLGPI is shown in Fig. 5.d. As evident, the combined transformation field is smooth and invertible.
(a) Linear metric 1  (b) Linear metric 2 
(c) Combined velocities  (d) Overall transformation 
Iv Experiments and Results
In this section, we present evaluation and comparison results of applying our proposed MLGPI nonlinear DML methods on both synthetic and real datasets.
Iva Synthetic data: effects on decision boundaries
As mentioned before, piecewise linearization strategies commonly result in boundary shifts in classification due to the lack of smooth transitions across class boundaries. Our model, MLGPI, ensuring a diffeomorphic transformation, can avoid this problem.
To verify this claim, an experiment with synthetic data is designed as shown in Fig 6. Two classes of samples are distributed in the original space with a “stripe” type. Compared to “stripes” in Class 2, the distribution density in Class 1 is lower, which might decrease the “pulling force” within Class 1. This imbalance may result in boundary shifts between classes.
Fig.7 shows the two component transformations generated in MLGPI and how they are fused. The two arrows indicate the general directions of the forces generated by the two affine transformations. With the weighted summation of the instantaneous velocities, the deformation across the entire data domain is smooth, which ensure the class boundary to be updated with spatial consistency.
The resulted separating lines from mmLMNN and our MLGPI model are shown in Fig. 6. It is clear that the NN boundary determined by MLGPI is closer to the ideal one, and has no displacement (erosion) into Class 2 area as mmLMNN does, which also implies that the classification rate by MLGPI will be higher than mmLMNN.
IvB Real data: application to Alzheimer’s Disease (AD) staging
Alzheimer’s disease (AD) and its early stage, mild cognitive impairment (MCI), is a serious threat to more than five million elderly people in US. Identifying intermediate biomarkers of AD is of great importance for diagnosis and prognosis of the disease. We apply our proposed MLGPI model on AD staging problem to demonstrate the practical usefulness of our algorithm.
321 subjects from the Alzheimer’s Disease Neuroimaging Initiative cohort (56 AD, 104 MCI, and 161 normal controls) were used as the input data. Sixteen features, including both volume and shape information for several important subcortical structures are extracted, and ranked. To better compare the performance, as well as identify the effectiveness of the models, we focus on a much simplified feature set that consists of the first and second most discriminative features (the volume information of Hippocampus and Entorhinal).
To evaluate our proposed MLGPI model for the AD staging problem, the ternary classification experiment (to separate AD/MCI/NC simultaneously) were conducted, with comparison with mmLMNN. A leave10%out 10fold crossvalidation paradigm is adopted for each model. The best, worst and average classification rates from the 10 validations were computed and included in Table I
. Our MLGPI performed much better than mmLMNN on all three measurements. Nevertheless, we want to point out that even though the absolute values for all of the performance measures obtained by MLGPI and mmLMNN are relatively low (which can be improved later with refined feature extraction steps, such as more accurate Hippocampus atrophy estimation), the relative improvements made by MLGPI over mmLMNN could still be an indirect indication on the better performance achieved by MLGPI.
Results  

Classifier  DML Method  Mean  Max  Min 
mmLMNN  0.37380.0766  0.5152  0.2812  
NN  MLGPI  0.45480.0738  0.5758  0.3636 
V Conclusions and discussion
In this paper, we proposed a novel nonlinear metric learning model through piecewise linearization. Unlike existing models, our solution merges linear component metrics based on the velocities instead of displacements. The setup ensures the resulted transformation to be diffeomorphic, which enables a smooth transition crossing the boundaries among classes. Generating inherently smooth component metrics with tensor regularization, as well as refining the velocity combination, are the planned future work.
References
 [1] M. Alexa, “Linear combination of transformations,” in ACM Transactions on Graphics (TOG), vol. 21, no. 3. ACM, 2002, pp. 380–387.
 [2] V. Arsigny, O. Commowick, N. Ayache, and X. Pennec, “A fast and logeuclidean polyaffine framework for locally linear registration,” Journal of Mathematical Imaging and Vision, vol. 33, no. 2, pp. 222–238, 2009.
 [3] V. Arsigny, X. Pennec, and N. Ayache, “Polyrigid and polyaffine transformations: a novel geometrical tool to deal with nonrigid deformations–application to the registration of histological slices,” Medical image analysis, vol. 9, no. 6, pp. 507–523, 2005.
 [4] A. Bellet, A. Habrard, and M. Sebban, “A survey on metric learning for feature vectors and structured data,” arXiv preprint arXiv:1306.6709, 2013.
 [5] Y. Chen, B. Shi, C. D. Smith, and J. Liu, “Nonlinear feature transformation and deep fusion for alzheimer’s disease staging analysis,” in International Workshop on Machine Learning in Medical Imaging. Springer, 2015, pp. 304–312.
 [6] M. Der and L. K. Saul, “Latent coincidence analysis: A hidden variable model for distance metric learning.” in NIPS, 2012, pp. 3239–3247.

[7]
X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg, “Matchnet: Unifying
feature and metric learning for patchbased matching,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2015, pp. 3279–3286.  [8] Y. He, W. Chen, Y. Chen, and Y. Mao, “Kernel density metric learning,” in Data Mining (ICDM), 2013 IEEE 13th International Conference on. IEEE, 2013, pp. 271–280.
 [9] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in International Workshop on SimilarityBased Pattern Recognition. Springer, 2015, pp. 84–92.
 [10] Y. Hong, Q. Li, J. Jiang, and Z. Tu, “Learning a mixture of sparse distance metrics for classification and dimensionality reduction,” in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp. 906–913.
 [11] J. Hu, J. Lu, and Y.P. Tan, “Discriminative deep metric learning for face verification in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1875–1882.
 [12] ——, “Deep transfer metric learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 325–333.
 [13] S. Jacob Goldberger and R. Geoff Hinton, “Neighbourhood components analysis,” NIPS, 2004.
 [14] J. T. Kwok and I. W. Tsang, “Learning with idealized kernels,” in ICML, 2003, pp. 400–407.
 [15] N. Nguyen and Y. Guo, “Metric learning: A support vector approach,” in Machine Learning and Knowledge Discovery in Databases. Springer, 2008, pp. 125–136.
 [16] Y.K. Noh, B.T. Zhang, and D. D. Lee, “Generative local metric learning for nearest neighbor classification,” in Advances in Neural Information Processing Systems, 2010, pp. 1822–1830.
 [17] K. Park, C. Shen, Z. Hao, and J. Kim, “Efficiently learning a distance metric for large margin nearest neighbor classification.” in AAAI, 2011.
 [18] D. Ramanan and S. Baker, “Local distance functions: A taxonomy, new algorithms, and an evaluation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 4, pp. 794–806, 2011.
 [19] J. Rossignac and Á. Vinacua, “Steady affine motions and morphs,” ACM Transactions on Graphics (TOG), vol. 30, no. 5, p. 116, 2011.
 [20] B. Shi, Y. Chen, K. Hobbs, C. D. Smith, and J. Liu, “Nonlinear metric learning for alzheimer’s disease diagnosis with integration of longitudinal neuroimaging features,” BMVC, vol. 1, pp. 138–1, 2015.
 [21] B. Shi, Y. Chen, P. Zhang, C. D. Smith, J. Liu, A. D. N. Initiative et al., “Nonlinear feature transformation and deep fusion for alzheimer’s disease staging analysis,” Pattern recognition, vol. 63, pp. 487–498, 2017.
 [22] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identificationverification,” in Advances in neural information processing systems, 2014, pp. 1988–1996.
 [23] L. Torresani and K.c. Lee, “Large margin component analysis,” Advances in neural information processing systems, vol. 19, p. 1385, 2007.
 [24] J. Wang, A. Kalousis, and A. Woznica, “Parametric local metric learning for nearest neighbor classification,” in Advances in Neural Information Processing Systems, 2012, pp. 1601–1609.
 [25] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” Journal of Machine Learning Research, vol. 10, no. Feb, pp. 207–244, 2009.
 [26] K. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Advances in Neural Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J. Platt, Eds. Cambridge, MA: MIT Press, 2006.
 [27] E. P. Xing et al., “Distance metric learning with application to clustering with sideinformation,” NIPS, pp. 521–528, 2003.
 [28] L. Yang and R. Jin, “Distance metric learning: A comprehensive survey,” Michigan State Universiy, vol. 2, 2006.
 [29] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learning for person reidentification,” in Pattern Recognition (ICPR), 2014 22nd International Conference on. IEEE, 2014, pp. 34–39.
 [30] P. Zhang, B. Shi, C. D. Smith, and J. Liu, “Nonlinear metric learning for semisupervised learning via coherent point drifting,” in 15th IEEE International Conference on Machine Learning and Applications, ICMLA 2016, Anaheim, CA, USA, December 1820, 2016, 2016, pp. 314–319.
 [31] ——, “Nonlinear feature space transformation to improve the prediction of MCI to AD conversion,” in MICCAI 2017, 2017, pp. 12–20.
Comments
There are no comments yet.