I Introduction
Classification is a fundamental area in machine learning. For classification, it is crucial to appropriately measure the distance between instances. One of the established classifier, the nearest neighbor (NN) classifier, classifies a new instance into the class of the training instance with the shortest distance.
In practice it is often difficult to handcraft a wellsuited and adaptive distance metric. To mitigate this issue, metric learning has been proposed to enable learning a metric automatically from the data available. Metric learning with a convex objective function was first proposed in the pioneering work of [1]. The large margin intuition was introduced into the research of metric learning by the seminal “large margin metric learning” (LMML) [2] and “large margin nearest neighbor” (LMNN) [3]. Besides the large margin approach, other inspiring metric learning strategies have been developed, such as nonlinear metrics [4, 5], localized strategies [6, 7, 8] and scalable/efficient algorithms [9, 10]
. Metric learning has also been adopted by many other learning tasks, such as semisupervised learning
[11][12], multitask/crossdomain learning [13, 14], AUC optimization [15] and distributed approaches [16].On top of the methodological and applied advancement of metric learning, some theoretical progress has also been made recently, in particular on deriving different types of generalization bounds for metric learning [17, 18, 19, 20]. These developments have theoretically justified the performance of metric learning algorithms. However, they generally lack a geometrical link with the classification margin, not as interpretable as one may expect (e.g. like the clear relationship between margin and
in support vector machines (SVM)).
Besides the interclass margin, the intraclass dispersion is also crucial to classification [21, 22, 23]. The intraclass dispersion is especially important for metric learning, because different metrics may lead to similar interclass margins and quite different intraclass dispersion. As illustrated in Figure 1, although the margins in those different metric spaces are exactly the same, the classification becomes more difficult as the margin ratio decreases. Therefore, the seminal work of [1] and many later work made efforts to consider the interclass margin and the intraclass dispersion at the same time.
In this paper, we aim to propose a new concept, the Lipschitz margin ratio, to integrate both interclass and intraclass properties, and through maximizing the Lipschitz margin ratio we aim to propose a new metric learning framework to enable the enhancement of the generalization ability of a classifier. These two novelties are our main contributions to be made in this work.
To achieve these two aims and present our contributions in a wellstructured way, we organize the rest of this paper as follows. Firstly, in Section II we discuss the relationship between the distancebased classification / metric learning and Lipschitz functions. We show that a Lipschitz extension, which is a distancebased function, can be regarded as a generalized nearest neighbor model, which enjoys great representation ability. Then, in Section III we introduce the Lipschitz margin ratio, and we point out that its associated learning bound indicates the desirability of maximizing the Lipschitz margin ratio, for enhancing the generalization ability of Lipschitz extensions. Consequently in Section IV, we propose a new metric learning framework through maximizing the Lipschitz margin ratio. Moreover, we prove that many well known metric learning algorithms can be shown as special cases of the proposed framework. Then for illustrative purposes, we implement the framework for learning the squared Mahalanobis metric. The method is presented in Section IVC, and its experimental results in Section V, which demonstrate the superiority of the proposed method. Finally, we draw conclusions and discuss future work in Section VI. For the convenience of readers, some theoretical proofs are deferred to the Appendix.
Ii Lipschitz Functions and Distancebased Classifiers
Iia Definition of Lipschitz Functions
To start with, we will review the definitions of Lipschitz functions, the Lipschitz constant and the Lipschitz set.
Definition 1.
[24] Let be a metric space. A function is called Lipschitz continuous if ,
The Lipschitz constant of a Lipschitz function is
and function is also called a Lipschitz function if its Lipschitz constant is . Meanwhile, all Lipschitz functions construct the Lipschitz set
From the definitions, we can observe that the Lipschitz constant is fundamentally connected with the metric ; and that the Lipschitz functions have specified a family of “smooth” functions, whose change of output values can be bounded by the distances in the input space.
IiB Lipschitz Extensions and Distancebased Classifiers
Distancebased classifiers are the classifiers that are based on certain kinds of distance metrics. Most of distancebased classifiers stem from the nearest neighbors (NN) classifier. To decide the class label of a new instance, the NN classifier compares the distances between the new instance and the training instances.
In binary classification tasks, a Lipschitz function is commonly used as the classification function and the instance is then classified according to the sign of . Using Theorem 1, we shall present a family of Lipschitz functions, called Lipschitz extensions. We shall also show that Lipschitz extensions present a distancebased classifier, and that a special case of Lipschitz extensions returns exactly the same classification result as the NN classifier.
Theorem 1.
[25, 26, 24, 27] (McShaneWhitney Extension Theorem) Given a function defined on a finite subset , there exist a family of functions which coincide with on , are defined on the whole space , and have the same Lipschitz constant as . Additionally, it is possible to explicitly construct in the following form and they are called Lipschitz extensions of :
where ,
Theorem 1 can be readily validated by calculating the values of and on the finite points . The bound of the Lipschitz constant of and can be proved on the basis of the Lemmas in Appendix.
Theorem 1 clearly shows that Lipschitz extensions are distancebased function. Moreover, we can illustrate the relationship between Lipschitz extension functions and empirical risk as follows.
Assume is the set of training instances of a classification task . If there are no such that while their labels (i.e. no overlap between training instances from different classes), setting would result in zero empirical risk, and would be a Lipschitz function with Lipschitz constant ,
where the existence of such a function , i.e. the Lipschitz extensions, is guaranteed by Theorem 1.
That is, when doing classification, if we set of Lipschitz extension to be larger than , zero empirical risk could be obtained. In other words, as distancebased functions, Lipschitz extensions enjoy excellent representation ability for classification tasks.
Moreover, if we set as , Lipschitz extensions will have exactly the same classification results as the NN classifier:
Proposition 1.
[27] The function defined above has the same sign, i.e. has the same classification results, as that of the NN classifier.
Iii Lipschitz Margin Ratio
In the previous section, we show that Lipschitz extensions can be viewed as a distancebased classifier, and its representation ability is so strong that zero empirical error can be obtained under mild conditions. In this section, we shall propose the Lipschitz margin ratio to control the model complexity of the Lipschitz functions and hence improve its generalization ability. To start with, we propose an intuitive way to understand the Lipschitz margin and the Lipschitz margin ratio. Then, learning bounds of the Lipschitz margin ratio will be presented.
Iiia Lipschitz Margin
We define the training set of class as , where ; the decision boundary of classification function as . The margin used in [27] is equivalent to the Lipschitz margin defined below.
Definition 2.
The Lipschitz margin is the distance between the training sets and :
(1) 
The relationship between the Lipschitz margin and the Lipschitz constant is established as follows.
Proposition 2.
For any Lipschitz function satisfying and ,
(2) 
Proof.
Let and denote the nearest instances from different classes, i.e.
It is straightforward to see
where the first inequality follows from the definition of the Lipschitz constant; and the second inequality is for the reason that and , then . ∎
The proposition shows that the Lipschitz margin can be lower bounded by the multiplicative inverse Lipschitz constant.
The Lipschitz margin is closely related to the margin adopted in SVM (the distance between the hyperplane
and the training instances ),As illustrated in Figure 2, the Lipschitz margin is also suitable for the classification of nonlinearly separable classes. The relationship between these two types of margins are described via the following proposition.
Proposition 3.
In the Euclidean space, let be any continuous function which correctly classifies all the training instances, i.e. , then
Proof.
In the Euclidean space,
and is the Euclidean distance.
Let and denote the nearest instances from different classes, i.e.
where .
We define a connected set , which indicates the line segment between and . Because , and for any continuous function , it maps connected sets into connected sets, there exists , such that . According to the definition of , we can see . Therefore,
where the second equality follows from the connectedness property of . ∎
IiiB Lipschitz Margin Ratio
The Lipschitz margin discussed above effectively depicts the interclass relationship. However, as we mentioned before, when we learn the metrics, different metrics will result in different intraclass dispersion and it is also important to consider intraclass properties. Hence we propose the Lipschitz margin ratio to incorporate both the interclass and intraclass properties into metric learning.
We start with defining the diameter of a metric space:
Definition 3.
[24] The diameter of a metric space is defined as
The Lipschitz margin ratio is then defined as the ratio between the margin and (i.e. the diameter) or (i.e. the sum of intraclass dispersion), as follows.
Definition 4.
The Diameter Lipschitz Margin Ratio () and the IntraClass Dispersion Lipschitz Margin Ratio () in a metric space are defined as
The relationship between and can be established via the following proposition.
Proposition 4.
In a metric space ,
and
Proof.
: See Appendix A ∎
In this inequality, and indicate the maximum intraclass distances, and indicates the interclass margin. Therefore, this inverse margin ratio penalty will push the learner to select a metric which pulls the instances from the same class closer (small ) and enlarges the margin between the instances from different classes (large ). In a very simple (linearly separable onedimensional) case, as illustrated in Figure 3, can be decomposed into intraclass dispersion (, ) and interclass margin () directly.
Then we can bound the Lipschitz margin ratio using the Lipschitz constant and the diameter of metric space:
Proposition 5.
For any Lipschitz function satisfying and ,
Proof.
The inequalities can be obtained by substituting the result of Proposition 2. ∎
Based on this proposition, although it is not possible to calculate the exact value of the Lipschitz margin ratio in most cases, we can use or as a surrogate. For example, in the objective function of metric learning by maximizing Lipschitz margin ratio, we can maximize or or equivalently minimize or .
Furthermore, in some cases we may be more interested in the local properties rather than the global ones (see also Section 4.2). In those cases we can define the local Lipschitz margin ratio as follows.
Definition 5.
The local Lipschitz margin ratio with subset and metric is defined as
where indicates the local training set of class and .
IiiC Learning Bounds of the Lipschitz Margin Ratio
In the section above, we have defined the Lipschitz margin ratio, which is a measure of model complexity. In this section, we shall establish the effectiveness of the Lipschitz margin ratio through showing the relationship between its lower bound and the generalization ability.
Definition 6.
[28] For a metric space , let be the smallest number such that every ball in can be covered by balls of half the radius. Then is called the doubling constant of and the doubling dimension of is .
As presented in [28], a low Euclidean dimension implies a low doubling dimension (Euclidean metrics of dimension have doubling dimension ); a low doubling dimension is more general than a low Euclidean dimension and can be utilized to measure the ‘dimension’ of a general metric space.
Definition 7.
We say that shatters , if there exists witness , such that, for every , there exists such that
Fatshattering dimension is defined as follows
Theorem 2.
[28] Let be the collection of real valued functions over with the Lipschitz constant at most . Define and let
be some probability distribution on
. Suppose that are drawn from independently according to . Then for any that classifies a sample of size correctly, we have with probability at leastFurthermore, if is correct on all but examples, we have with probability at least
(3) 
Proposition 6.
In classification problems, when , , then can be bounded by the surrogate of Lipschitz Margin Ratio as follows:
(4) 
Proof.
Corollary 1.
Under the condition that , the following bounds for the surrogate margin ratios holds. If is correct on all but examples, we have with probability at least
(5) 
where or .
The above learning bound illustrates the relationship between the generalization error (i.e. the difference between the expected error and the empirical error ) and the surrogate inverse Lipschitz margin ratio or . Therefore, reducing the value of surrogate inverse Lipschitz margin ratio would help reduce the gap between the empirical error and the expected error, which implies an improvement in the generalization ability of the model. In other words, the learning bound indicates that minimizing inverse Lipschitz margin ratio would be an effective way to enhance the generalization ability and control model complexity.
Iv Metric Learning via Maximizing the Lipschitz Margin Ratio
From previous sections, we have seen that Lipschitz functions have the following desirable properties relevant to metric learning:

(Close relationship with metrics) The definitions of the Lipschitz constant, Lipschitz functions and Lipschitz extensions have natural relationship with metrics.

(Strong representation ability) Lipschitz functions, in particular Lipschitz extensions, could obtain small empirical risks, and hence illustrate the representational capability of Lipschitz functions.

(Good generalization ability) Complexity of Lipschitz functions could be controlled by penalizing the Lipschitz margin ratio.
Therefore, it is reasonable for us to conduct metric learning with the Lipschitz functions and control the model complexity by maximizing (the lower bound of) the Lipschitz margin ratio.
Iva Learning Framework
Similarly to other structure risk minimization approaches, we minimize the empirical risk and maximize (the lower bound of) the Lipschitz margin ratio in the proposed framework. To estimate (the lower bound of) the Lipschitz margin ratio, we may either

use training instances to estimate the Lipschitz constant and the diameters , and obtain and ; or

adopt the upper bounds of and by applying the properties of the classifier and metric space , and obtain and .
The optimization problem could be formulated as follows:
(6) 
where indicates the number of training instances; denotes the parameters of the classification function ; is the hinge loss; is a tradeoff parameter which balances the empirical risk term and the generalization ability term . and , and ， from the LRatio term, will be replaced by either the empirically estimated values and or the theoretical upper bounds and .
Empirical estimates of and can be added as constraints
Then the objective function of minimizing becomes
(7) 
where the penalty of tries to maximize the interclass margin (via minimizing ) and minimize the overall diameter (via minimizing ).
The objective function to minimize becomes
or we can minimize an upper bound of as
(8) 
where the penalty terms of or tries to maximize the interclass margin (via minimizing ) and minimize the intraclass dispersion (via minimizing or ) at the same time.
IvB Relationship with other Metric Learning Methods
Some widely adopted metric learning algorithms can be shown as special cases of the proposed framework.
As presented in Appendix C, based on our framework, the penalty term of LMML [2] could be interpreted as an upper bound of margin ratio; and this framework could suggest a reasonable strategy for choosing the target neighbors and the imposter neighbors in LMML. Also as discussed in Appendix D, we can see that the penalty term of LMNN [3] could be interpreted as an upper bound of .
IvC Applying the Framework for Learning the Squared Mahalanobis Metric
We now apply the proposed framework to learn the squared Mahalanobis metric,
where is the set of positive semidefinite matrices. A Lipschitz extension function is selected as the classifier:
(9) 
In binary classification tasks, let indicate the label of , .
Based on the framework of (6) and (7), firstly we propose an optimization formula which penalizes the :
(10) 
At first glance, the optimization problem seems quite complex. However, based on the smoothness assumption, balanced class assumption () and some equivalent transformations, as illustrated in Appendix E, the following optimization problem can be obtained:
(11) 
Intuitively speaking, the first set of inequality constraints indicate that the distances between samples from different classes should be large; and the third set of inequality constraints indicate that the estimated diameter should be small.
Based on the framework in (6) and (8), we can also propose an optimization formula which penalizes the upper bound of :
(12) 
The only difference between (10) and (12) lies on the selected instance pairs to estimate : (10) utilizes all instance pairs to estimate the diameter of all the training instances, while (12) utilizes the instances pairs with the same label to estimate the maximum intraclass dispersion. Similarly to the transformations from (10) to (11), the following optimization problem can be obtained:
(13) 
V Experiments
To evaluate the performance of our proposed methods, we compare them with four widely adopted distancebased algorithms: Nearest Neighbor (NN), Large Margin Nearest Neighbor (LMNN) [3], Maximally Collapsing Metric Learning (MCML) [29] and Neighborhood Components Analysis (NCA) [30]. Under our framework, we have implemented Lip (based on the diameter Lipschitz margin ratio), Lip (based on the intraclass Lipschitz margin ratio), Lip(P) (ADMMbased fast Lip), Lip(P) (ADMMbased fast Lip).
Our proposed Lip, Lip are implemented using the cvx toolbox^{1}^{1}1http://cvxr.com/ in MATLAB with the solver of SeDuMi [31]. The in our algorithm is fixed at and the in the ADMM algorithm is fixed at . The LMNN, MCML and NCA are from the dimension reduction toolbox^{2}^{2}2https://lvdmaaten.github.io/drtoolbox/.
In the experimente, we focus on the most representative task, binary classification. Eight publicly available datasets from the websites of UCI^{3}^{3}3https://archive.ics.uci.edu/ml/datasets.html and LibSVM^{4}^{4}4https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html are adopted to evaluate the performance, namely Statlog/LibSVM Australian Credit Approval (Australian), UCI/LibSVM Original Breast Cancer Wisconsin (Cancer), UCI/LibSVM Pima Indians Diabetes (Diabetes), UCI Echocardiogram (Echo), UCI Fertility (Fertility), LibSVM Fourclass (Fourclass), UCI Haberman’s Survival (Haberman) and UCI Congressional Voting Records (Voting). For each dataset, instances are randomly selected as training samples, the rest as test samples. This process is repeated times and the mean accuracy is reported.
datasets  Lip  Lip(P)  Lip  Lip(P)  NN  LMNN  MCML  NCA 

Australian  
Cancer  
Diabetes  
Echo  
Fertility  
Fourclass  
Haberman  
Voting  
# of best  2  2  3  0  0  2  2  0 
As shown in Table I
, the proposed algorithms Lip achieve the best mean accuracy on four datasets and equally best with MCML on one dataset. The Lip outperforms 1NN and NCA on seven datasets and LMNN and MCML on five datasets. The only dataset that the Lip performs worse than all other methods is Fertility, in which our method potentially suffers from withinclass outliers and hence has a large intraclass dispersion. Apart from this dataset, LMNN or MCML outperforms the Lip by only a small performance gap, less than
. Such encouraging results demonstrate the effectiveness of the proposed framework.Vi Conclusions and Future Work
In this paper, we have presented that the representation ability of Lipschitz functions is very strong and the complexity of the Lipschitz functions in a metric space can be controlled by penalizing the Lipschitz margin ratio. Based on these desirable properties, we have proposed a new metric learning framework via maximizing the Lipschitz margin ratio. An application of this framework for learning the squared Mahalanobis metric has been implemented and the experiment results are encouraging.
The diameter Lipschitz margin ratio or the intraclass Lipschitz margin ratio in the optimization function is equivalent to an adaptive regularization. In other words, since we encourage samples to stay close within the same class, samples which locate near the class boundary are valued more than those in the center. Therefore, the performance of our method may deteriorate under the existence of outliers and this problem has been reported on the dataset Fertility. We aim to develop more robust methods in our future work.
The local property within a dataset could vary dramatically, and hence it is worthwhile to develop an algorithm based on local Lipschitz margin ratio. One option is to follow the idea of LMNN, learning a general metric but considering different local Lipschitz margin ratio; or we can learn a separate metric on each local area.
References
 [1] E. P. Xing, M. I. Jordan, S. Russell, and A. Y. Ng, “Distance metric learning with application to clustering with sideinformation,” in Advances in Neural Information Processing Systems, 2002, pp. 505–512.
 [2] M. Schultz and T. Joachims, “Learning a distance metric from relative comparisons,” Advances in Neural Information Processing Systems, p. 41, 2004.
 [3] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” The Journal of Machine Learning Research, vol. 10, pp. 207–244, 2009.
 [4] D. Kedem, S. Tyree, F. Sha, G. R. Lanckriet, and K. Q. Weinberger, “Nonlinear metric learning,” in Advances in Neural Information Processing Systems, 2012, pp. 2573–2581.

[5]
J. Hu, J. Lu, and Y.P. Tan, “Discriminative deep metric learning for face
verification in the wild,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2014, pp. 1875–1882.  [6] Y. Dong, B. Du, L. Zhang, L. Zhang, and D. Tao, “LAM3L: Locally adaptive maximum margin metric learning for visual data classification,” Neurocomputing, vol. 235, pp. 1–9, 2017.
 [7] W. Wang, B.G. Hu, and Z.F. Wang, “Globality and locality incorporation in distance metric learning,” Neurocomputing, vol. 129, pp. 185–198, 2014.
 [8] Y. Noh, B. Zhang, and D. Lee, “Generative local metric learning for nearest neighbor classification,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 1, p. 106, 2018.

[9]
C. Shen, J. Kim, F. Liu, L. Wang, and A. Van Den Hengel, “Efficient dual
approach to distance metric learning,”
IEEE transactions on neural networks and learning systems
, vol. 25, no. 2, pp. 394–406, 2014. 
[10]
Q. Qian, R. Jin, J. Yi, L. Zhang, and S. Zhu, “Efficient distance metric learning by adaptive sampling and minibatch stochastic gradient descent (SGD),”
Machine Learning, vol. 99, no. 3, pp. 353–372, 2015.  [11] S. Ying, Z. Wen, J. Shi, Y. Peng, J. Peng, and H. Qiao, “Manifold preserving: An intrinsic approach for semisupervised distance metric learning,” IEEE transactions on neural networks and learning systems, 2017.
 [12] H. Jia, Y.m. Cheung, and J. Liu, “A new distance metric for unsupervised learning of categorical data,” IEEE transactions on neural networks and learning systems, vol. 27, no. 5, pp. 1065–1079, 2016.
 [13] Y. Luo, Y. Wen, and D. Tao, “Heterogeneous multitask metric learning across multiple domains,” IEEE transactions on neural networks and learning systems, 2017.
 [14] W. Wang, H. Wang, C. Zhang, and Y. Gao, “Crossdomain metric and multiple kernel learning based on information theory,” Neural computation, no. Early Access, pp. 1–36, 2018.
 [15] J. Huo, Y. Gao, Y. Shi, and H. Yin, “Crossmodal metric learning for AUC optimization,” IEEE Transactions on Neural Networks and Learning Systems, 2018.
 [16] J. Li, X. Lin, X. Rui, Y. Rui, and D. Tao, “A distributed approach toward discriminative distance metric learning,” IEEE transactions on neural networks and learning systems, vol. 26, no. 9, pp. 2111–2122, 2015.
 [17] R. Jin, S. Wang, and Y. Zhou, “Regularized distance metric learning: Theory and algorithm,” in Advances in neural information processing systems, 2009, pp. 862–870.
 [18] Z.C. Guo and Y. Ying, “Guaranteed classification via regularized similarity learning,” Neural computation, vol. 26, no. 3, pp. 497–522, 2014.
 [19] N. Verma and K. Branson, “Sample complexity of learning Mahalanobis distance metrics,” in Advances in Neural Information Processing Systems, 2015, pp. 2584–2592.
 [20] Q. Cao, Z.C. Guo, and Y. Ying, “Generalization bounds for metric and similarity learning,” Machine Learning, vol. 102, no. 1, pp. 115–132, 2016.
 [21] R. Flamary, M. Cuturi, N. Courty, and A. Rakotomamonjy, “Wasserstein discriminant analysis,” arXiv preprint arXiv:1608.08063, 2016.
 [22] H. Do and A. Kalousis, “Convex formulations of radiusmargin based support vector machines,” in International Conference on Machine Learning, 2013, pp. 169–177.
 [23] T. Jebara and P. K. Shivaswamy, “Relative margin machines,” in Advances in Neural Information Processing Systems, 2009, pp. 1481–1488.
 [24] N. Weaver and N. Weaver, Lipschitz Algebras. World Scientific, 1999.
 [25] E. J. McShane, “Extension of range of functions,” Bulletin of the American Mathematical Society, vol. 40, no. 12, pp. 837–842, 1934.
 [26] H. Whitney, “Analytic extensions of differentiable functions defined in closed sets,” Transactions of the American Mathematical Society, vol. 36, no. 1, pp. 63–89, 1934.
 [27] U. v. Luxburg and O. Bousquet, “Distancebased classification with Lipschitz functions,” The Journal of Machine Learning Research, vol. 5, pp. 669–695, 2004.
 [28] L.A. Gottlieb, A. Kontorovich, and R. Krauthgamer, “Efficient classification for metric data,” Information Theory, IEEE Transactions on, vol. 60, no. 9, pp. 5750–5759, 2014.
 [29] A. Globerson and S. Roweis, “Metric learning by collapsing classes,” in Advances in Neural Information Processing Systems, vol. 18, 2005, pp. 451–458.
 [30] J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov, “Neighbourhood components analysis,” in Advances in neural information processing systems, 2005, pp. 513–520.
 [31] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optimization Methods and Software, vol. 11, no. 14, pp. 625–653, 1999.
 [32] N. Parikh, S. P. Boyd et al., “Proximal algorithms,” Foundations and Trends in optimization, vol. 1, no. 3, pp. 127–239, 2014.

[33]
G.B. Ye, Y. Chen, and X. Xie, “Efficient variable selection in support vector
machines via the alternating direction method of multipliers,” in
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, 2011, pp. 832–840.
Comments
There are no comments yet.