# Metric Learning via Maximizing the Lipschitz Margin Ratio

In this paper, we propose the Lipschitz margin ratio and a new metric learning framework for classification through maximizing the ratio. This framework enables the integration of both the inter-class margin and the intra-class dispersion, as well as the enhancement of the generalization ability of a classifier. To introduce the Lipschitz margin ratio and its associated learning bound, we elaborate the relationship between metric learning and Lipschitz functions, as well as the representability and learnability of the Lipschitz functions. After proposing the new metric learning framework based on the introduced Lipschitz margin ratio, we also prove that some well known metric learning algorithms can be shown as special cases of the proposed framework. In addition, we illustrate the framework by implementing it for learning the squared Mahalanobis metric, and by demonstrating its encouraging results on eight popular datasets of machine learning.

## Authors

• 11 publications
• 8 publications
• 37 publications
• 32 publications
09/05/2012

### Robustness and Generalization for Metric Learning

Metric learning has attracted a lot of interest over the last decade, bu...
10/15/2019

### Notes on Lipschitz Margin, Lipschitz Margin Training, and Lipschitz Margin p-Values for Deep Neural Network Classifiers

We provide a local class purity theorem for Lipschitz continuous, half-r...
10/17/2016

### Efficient Metric Learning for the Analysis of Motion Data

We investigate metric learning in the context of dynamic time warping (D...
02/26/2021

### Moreau-Yosida f-divergences

Variational representations of f-divergences are central to many machine...
04/05/2018

### Large Scale Local Online Similarity/Distance Learning Framework based on Passive/Aggressive

Similarity/Distance measures play a key role in many machine learning, p...
12/03/2018

### Rademacher Complexity and Generalization Performance of Multi-category Margin Classifiers

One of the main open problems in the theory of multi-category margin cla...
07/05/2017

### DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer

We have witnessed rapid evolution of deep neural network architecture de...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Classification is a fundamental area in machine learning. For classification, it is crucial to appropriately measure the distance between instances. One of the established classifier, the nearest neighbor (NN) classifier, classifies a new instance into the class of the training instance with the shortest distance.

In practice it is often difficult to handcraft a well-suited and adaptive distance metric. To mitigate this issue, metric learning has been proposed to enable learning a metric automatically from the data available. Metric learning with a convex objective function was first proposed in the pioneering work of [1]. The large margin intuition was introduced into the research of metric learning by the seminal “large margin metric learning” (LMML) [2] and “large margin nearest neighbor” (LMNN) [3]. Besides the large margin approach, other inspiring metric learning strategies have been developed, such as nonlinear metrics [4, 5], localized strategies [6, 7, 8] and scalable/efficient algorithms [9, 10]

. Metric learning has also been adopted by many other learning tasks, such as semi-supervised learning

[11][12], multi-task/cross-domain learning [13, 14], AUC optimization [15] and distributed approaches [16].

On top of the methodological and applied advancement of metric learning, some theoretical progress has also been made recently, in particular on deriving different types of generalization bounds for metric learning [17, 18, 19, 20]. These developments have theoretically justified the performance of metric learning algorithms. However, they generally lack a geometrical link with the classification margin, not as interpretable as one may expect (e.g. like the clear relationship between margin and

in support vector machines (SVM)).

Besides the inter-class margin, the intra-class dispersion is also crucial to classification [21, 22, 23]. The intra-class dispersion is especially important for metric learning, because different metrics may lead to similar inter-class margins and quite different intra-class dispersion. As illustrated in Figure 1, although the margins in those different metric spaces are exactly the same, the classification becomes more difficult as the margin ratio decreases. Therefore, the seminal work of [1] and many later work made efforts to consider the inter-class margin and the intra-class dispersion at the same time.

In this paper, we aim to propose a new concept, the Lipschitz margin ratio, to integrate both inter-class and intra-class properties, and through maximizing the Lipschitz margin ratio we aim to propose a new metric learning framework to enable the enhancement of the generalization ability of a classifier. These two novelties are our main contributions to be made in this work.

To achieve these two aims and present our contributions in a well-structured way, we organize the rest of this paper as follows. Firstly, in Section II we discuss the relationship between the distance-based classification / metric learning and Lipschitz functions. We show that a Lipschitz extension, which is a distance-based function, can be regarded as a generalized nearest neighbor model, which enjoys great representation ability. Then, in Section III we introduce the Lipschitz margin ratio, and we point out that its associated learning bound indicates the desirability of maximizing the Lipschitz margin ratio, for enhancing the generalization ability of Lipschitz extensions. Consequently in Section IV, we propose a new metric learning framework through maximizing the Lipschitz margin ratio. Moreover, we prove that many well known metric learning algorithms can be shown as special cases of the proposed framework. Then for illustrative purposes, we implement the framework for learning the squared Mahalanobis metric. The method is presented in Section IV-C, and its experimental results in Section V, which demonstrate the superiority of the proposed method. Finally, we draw conclusions and discuss future work in Section VI. For the convenience of readers, some theoretical proofs are deferred to the Appendix.

## Ii Lipschitz Functions and Distance-based Classifiers

### Ii-a Definition of Lipschitz Functions

To start with, we will review the definitions of Lipschitz functions, the Lipschitz constant and the Lipschitz set.

###### Definition 1.

[24] Let be a metric space. A function is called Lipschitz continuous if ,

 |f(x1)−f(x2)|≤CρX(x1,x2).

The Lipschitz constant of a Lipschitz function is

 L(f)=inf{C∈R|∀x1,x2∈X,|f(x1)−f(x2)|≤CρX(x1,x2)}=supx1,x2∈X:x1≠x2|f(x1)−f(x2)|ρX(x1,x2),

and function is also called a -Lipschitz function if its Lipschitz constant is . Meanwhile, all -Lipschitz functions construct the -Lipschitz set

 L-Lip(X)={f:X→R;L(f)≤L}.

From the definitions, we can observe that the Lipschitz constant is fundamentally connected with the metric ; and that the Lipschitz functions have specified a family of “smooth” functions, whose change of output values can be bounded by the distances in the input space.

### Ii-B Lipschitz Extensions and Distance-based Classifiers

Distance-based classifiers are the classifiers that are based on certain kinds of distance metrics. Most of distance-based classifiers stem from the nearest neighbors (NN) classifier. To decide the class label of a new instance, the NN classifier compares the distances between the new instance and the training instances.

In binary classification tasks, a Lipschitz function is commonly used as the classification function and the instance is then classified according to the sign of . Using Theorem 1, we shall present a family of Lipschitz functions, called Lipschitz extensions. We shall also show that Lipschitz extensions present a distance-based classifier, and that a special case of Lipschitz extensions returns exactly the same classification result as the NN classifier.

###### Theorem 1.

[25, 26, 24, 27] (McShane-Whitney Extension Theorem) Given a function defined on a finite subset , there exist a family of functions which coincide with on , are defined on the whole space , and have the same Lipschitz constant as . Additionally, it is possible to explicitly construct in the following form and they are called -Lipschitz extensions of :

 Uα(x)=αU1(x)+(1−α)U2(x),

where ,

 U1(x) =¯¯¯u(x)=infa∈A{u(a)+Lρ(x,a)}, U2(x) =u––(x)=supa∈A{u(a)−Lρ(x,a)}.

Theorem 1 can be readily validated by calculating the values of and on the finite points . The bound of the Lipschitz constant of and can be proved on the basis of the Lemmas in Appendix.

Theorem 1 clearly shows that Lipschitz extensions are distance-based function. Moreover, we can illustrate the relationship between Lipschitz extension functions and empirical risk as follows.

Assume is the set of training instances of a classification task . If there are no such that while their labels (i.e. no overlap between training instances from different classes), setting would result in zero empirical risk, and would be a Lipschitz function with Lipschitz constant ,

 L0=supi,j|ti−tj|ρ(xi,xj),

where the existence of such a function , i.e. the Lipschitz extensions, is guaranteed by Theorem 1.

That is, when doing classification, if we set of Lipschitz extension to be larger than , zero empirical risk could be obtained. In other words, as distance-based functions, Lipschitz extensions enjoy excellent representation ability for classification tasks.

Moreover, if we set as , Lipschitz extensions will have exactly the same classification results as the NN classifier:

###### Proposition 1.

[27] The function defined above has the same sign, i.e. has the same classification results, as that of the NN classifier.

## Iii Lipschitz Margin Ratio

In the previous section, we show that Lipschitz extensions can be viewed as a distance-based classifier, and its representation ability is so strong that zero empirical error can be obtained under mild conditions. In this section, we shall propose the Lipschitz margin ratio to control the model complexity of the Lipschitz functions and hence improve its generalization ability. To start with, we propose an intuitive way to understand the Lipschitz margin and the Lipschitz margin ratio. Then, learning bounds of the Lipschitz margin ratio will be presented.

### Iii-a Lipschitz Margin

We define the training set of class as , where ; the decision boundary of classification function as . The margin used in [27] is equivalent to the Lipschitz margin defined below.

###### Definition 2.

The Lipschitz margin is the distance between the training sets and :

 L-Margin=D(S1,S−1)=minxi∈S−1,xj∈S1ρ(xi,xj). (1)

The relationship between the Lipschitz margin and the Lipschitz constant is established as follows.

###### Proposition 2.

For any -Lipschitz function satisfying and ,

 L-Margin≥2L(f). (2)
###### Proof.

Let and denote the nearest instances from different classes, i.e.

 ρ(xn,xm)=D(S1,S−1)=minxi∈S−1,xj∈S1ρ(xi,xj).

It is straightforward to see

 2L(f)≤2|f(xn)−f(xm)|/ρ(xn,xm)≤ρ(xn,xm)=D(S1,S−1),

where the first inequality follows from the definition of the Lipschitz constant; and the second inequality is for the reason that and , then . ∎

The proposition shows that the Lipschitz margin can be lower bounded by the multiplicative inverse Lipschitz constant.

The Lipschitz margin is closely related to the margin adopted in SVM (the distance between the hyperplane

and the training instances ),

 D(S,Hf)=minxi∈S,h∈Hfρ(xi,h),

As illustrated in Figure 2, the Lipschitz margin is also suitable for the classification of non-linearly separable classes. The relationship between these two types of margins are described via the following proposition.

###### Proposition 3.

In the Euclidean space, let be any continuous function which correctly classifies all the training instances, i.e. , then

 D(S1,S−1)≥2D(S,H).
###### Proof.

In the Euclidean space,

 D(S1,S−1)=minxi∈S−1,xj∈S+1ρE(xi,xj),
 D(S,Hf)=minxi∈S,h∈HfρE(xi,h),

and is the Euclidean distance.

Let and denote the nearest instances from different classes, i.e.

 ρE(xn,xm)=D(S1,S−1)=minxi∈S−1,xj∈S+1ρE(xi,xj),

where .

We define a connected set , which indicates the line segment between and . Because , and for any continuous function , it maps connected sets into connected sets, there exists , such that . According to the definition of , we can see . Therefore,

 D(S,Hf)=minxi∈S,h∈HfρE(xi,h)≤minxi∈SρE(xi,z)≤ρE(xn,z)+ρE(xm,z)2=ρE(xn,xm)2=D(S1,S−1)2,

where the second equality follows from the connectedness property of . ∎

### Iii-B Lipschitz Margin Ratio

The Lipschitz margin discussed above effectively depicts the inter-class relationship. However, as we mentioned before, when we learn the metrics, different metrics will result in different intra-class dispersion and it is also important to consider intra-class properties. Hence we propose the Lipschitz margin ratio to incorporate both the inter-class and intra-class properties into metric learning.

###### Definition 3.

[24] The diameter of a metric space is defined as

 diam(X,ρ)=supxi,xj∈Xρ(xi,xj).

The Lipschitz margin ratio is then defined as the ratio between the margin and (i.e. the diameter) or (i.e. the sum of intra-class dispersion), as follows.

###### Definition 4.

The Diameter Lipschitz Margin Ratio () and the Intra-Class Dispersion Lipschitz Margin Ratio () in a metric space are defined as

 L-RatioDiam=D(S1,S−1)diam(X,ρ)=minxi∈S−1,xj∈S1ρ(xi,xj)supxi,xj∈Xρ(xi,xj),
 L-RatioIntra=D(S1,S−1)diam(S1,ρ)+diam(S−1,ρ)=minxi∈S−1,xj∈S1ρ(xi,xj)supxi,xj∈S1ρ(xi,xj)+supxi,xj∈S−1ρ(xi,xj).

The relationship between and can be established via the following proposition.

###### Proposition 4.

In a metric space ,

 diam(X,ρ)≤diam(S1,ρ)+diam(S−1,ρ)+D(S−1,S1)

and

###### Proof.

: See Appendix -A

In this inequality, and indicate the maximum intra-class distances, and indicates the inter-class margin. Therefore, this inverse margin ratio penalty will push the learner to select a metric which pulls the instances from the same class closer (small ) and enlarges the margin between the instances from different classes (large ). In a very simple (linearly separable one-dimensional) case, as illustrated in Figure 3, can be decomposed into intra-class dispersion (, ) and inter-class margin () directly.

Then we can bound the Lipschitz margin ratio using the Lipschitz constant and the diameter of metric space:

###### Proposition 5.

For any -Lipschitz function satisfying and ,

 L-RatioDiam≥2Ldiam(X,ρ),
 L-RatioIntra≥2Ldiam(S1,ρ)+Ldiam(S−1,ρ).
###### Proof.

The inequalities can be obtained by substituting the result of Proposition 2. ∎

Based on this proposition, although it is not possible to calculate the exact value of the Lipschitz margin ratio in most cases, we can use or as a surrogate. For example, in the objective function of metric learning by maximizing Lipschitz margin ratio, we can maximize or or equivalently minimize or .

Furthermore, in some cases we may be more interested in the local properties rather than the global ones (see also Section 4.2). In those cases we can define the local Lipschitz margin ratio as follows.

###### Definition 5.

The local Lipschitz margin ratio with subset and metric is defined as

 Local-RatioDiam=L-Margindiam(Sl,ρl)=D(Sl1,Sl−1)diam(Sl,ρl),
 Local-RatioIntra=%L−Margindiam(S1,ρ)+diam(S−1,ρ)=D(Sl1,Sl−1)diam(Sl1,ρl)+diam(Sl−1,ρl),

where indicates the local training set of class and .

### Iii-C Learning Bounds of the Lipschitz Margin Ratio

In the section above, we have defined the Lipschitz margin ratio, which is a measure of model complexity. In this section, we shall establish the effectiveness of the Lipschitz margin ratio through showing the relationship between its lower bound and the generalization ability.

###### Definition 6.

[28] For a metric space , let be the smallest number such that every ball in can be covered by balls of half the radius. Then is called the doubling constant of and the doubling dimension of is .

As presented in [28], a low Euclidean dimension implies a low doubling dimension (Euclidean metrics of dimension have doubling dimension ); a low doubling dimension is more general than a low Euclidean dimension and can be utilized to measure the ‘dimension’ of a general metric space.

###### Definition 7.

We say that -shatters , if there exists witness , such that, for every , there exists such that

 ϵt(fϵ(xt)−st)≥γ

Fat-shattering dimension is defined as follows

 fatγ(F)=max{n;∃x1,…,xn∈X,s.t. F γ shatters x1,…,xn}.
###### Theorem 2.

[28] Let be the collection of real valued functions over with the Lipschitz constant at most . Define and let

be some probability distribution on

. Suppose that are drawn from independently according to . Then for any that classifies a sample of size correctly, we have with probability at least

 P{(x,t):sign[f(x)]≠t}≤2n(Dlog2(34en/D)log2(578n)+log2(4/δ)).

Furthermore, if is correct on all but examples, we have with probability at least

 P{(x,t):sign[f(x)]≠t}≤kn+√2n(Dlog2(34en/D)log2(578n)+log2(4/δ)). (3)
###### Proposition 6.

In classification problems, when , , then can be bounded by the surrogate of Lipschitz Margin Ratio as follows:

 (4)
###### Proof.

The first inequality has been proved in [28]. We prove the second inequality here. Because , we have

 LD(S−1,S1)=2.

It follows that

 Ldiam(X,ρ)≤L(diam(S1,ρ)+diam(S−1,ρ)+D(S−1,S1))=L((diam(S1,ρ)+diam(S−1,ρ))+2,

where the first inequality is based on Proposition 4. Meanwhile, because , the second inequality holds. ∎

###### Corollary 1.

Under the condition that , the following bounds for the surrogate margin ratios holds. If is correct on all but examples, we have with probability at least

 P{(x,t):sign[f(x)]≠t}≤kn+√2n((16C)ddim(X)log2(34en/(16C)ddim(X))log2(578n)+log2(4/δ)), (5)

where or .

###### Proof.

Substitute the inequalities of Proposition 6 into Theorem 2. ∎

The above learning bound illustrates the relationship between the generalization error (i.e. the difference between the expected error and the empirical error ) and the surrogate inverse Lipschitz margin ratio or . Therefore, reducing the value of surrogate inverse Lipschitz margin ratio would help reduce the gap between the empirical error and the expected error, which implies an improvement in the generalization ability of the model. In other words, the learning bound indicates that minimizing inverse Lipschitz margin ratio would be an effective way to enhance the generalization ability and control model complexity.

## Iv Metric Learning via Maximizing the Lipschitz Margin Ratio

From previous sections, we have seen that Lipschitz functions have the following desirable properties relevant to metric learning:

• (Close relationship with metrics) The definitions of the Lipschitz constant, Lipschitz functions and Lipschitz extensions have natural relationship with metrics.

• (Strong representation ability) Lipschitz functions, in particular Lipschitz extensions, could obtain small empirical risks, and hence illustrate the representational capability of Lipschitz functions.

• (Good generalization ability) Complexity of Lipschitz functions could be controlled by penalizing the Lipschitz margin ratio.

Therefore, it is reasonable for us to conduct metric learning with the Lipschitz functions and control the model complexity by maximizing (the lower bound of) the Lipschitz margin ratio.

### Iv-a Learning Framework

Similarly to other structure risk minimization approaches, we minimize the empirical risk and maximize (the lower bound of) the Lipschitz margin ratio in the proposed framework. To estimate (the lower bound of) the Lipschitz margin ratio, we may either

• use training instances to estimate the Lipschitz constant and the diameters , and obtain and ; or

• adopt the upper bounds of and by applying the properties of the classifier and metric space , and obtain and .

The optimization problem could be formulated as follows:

 minξ,a,ρ1/L-Ratio+α∑Ni=1ξis.t.tif(xi;a,ρ)≥1−ξiξi≥0i=1,…,N, (6)

where indicates the number of training instances; denotes the parameters of the classification function ; is the hinge loss; is a trade-off parameter which balances the empirical risk term and the generalization ability term . and , and ， from the L-Ratio term, will be replaced by either the empirically estimated values and or the theoretical upper bounds and .

Empirical estimates of and can be added as constraints

 f(xi;a,ρ)−f(xj;a,ρ)ρ(xi,xj)≤^L,
 ρ(xi,xj)≤^diam(X,ρ),where xi∈S,xj∈S,ρ(xi,xj)≤^diam(S1,ρ),where xi∈S1,xj∈S1,ρ(xi,xj)≤^diam(S−1,ρ),where xi∈S−1,xj∈S−1.

Then the objective function of minimizing becomes

 minξ,a,ρ,^L,^diam^L^diam(X,ρ)+α∑Ni=1ξi, (7)

where the penalty of tries to maximize the inter-class margin (via minimizing ) and minimize the overall diameter (via minimizing ).

The objective function to minimize becomes

 minξ,a,ρ,^L,^diam^L(^diam(S1,ρ)+^diam(S−1,ρ))+αN∑i=1ξi,

or we can minimize an upper bound of as

 minξ,a,ρ,^L,^diam2^Lmax(^diam(S1,ρ),^diam(S−1,ρ))+αN∑i=1ξi, (8)

where the penalty terms of or tries to maximize the inter-class margin (via minimizing ) and minimize the intra-class dispersion (via minimizing or ) at the same time.

### Iv-B Relationship with other Metric Learning Methods

Some widely adopted metric learning algorithms can be shown as special cases of the proposed framework.

As presented in Appendix -C, based on our framework, the penalty term of LMML [2] could be interpreted as an upper bound of margin ratio; and this framework could suggest a reasonable strategy for choosing the target neighbors and the imposter neighbors in LMML. Also as discussed in Appendix -D, we can see that the penalty term of LMNN [3] could be interpreted as an upper bound of .

### Iv-C Applying the Framework for Learning the Squared Mahalanobis Metric

We now apply the proposed framework to learn the squared Mahalanobis metric,

 ρM(xi,xj)=(xi−xj)TM(xi−xj),M∈M+,

where is the set of positive semi-definite matrices. A Lipschitz extension function is selected as the classifier:

 f(x;a,ρ)=U1/2(x)=12mini=1,…,N(ai+LρM(x,xi))+12maxi=1,…,N(ai−LρM(x,xi)). (9)

In binary classification tasks, let indicate the label of , .

Based on the framework of (6) and (7), firstly we propose an optimization formula which penalizes the :

 mina,ξ,M,^diam,^L^L^diam+α∑Ni=1ξis.t.|ai−aj|ρM(xi,xj)≤^LρM(xi,xj)≤^diamtiai=1−ξiξi≥0,M∈M+xi∈S,xj∈S. (10)

At first glance, the optimization problem seems quite complex. However, based on the smoothness assumption, balanced class assumption () and some equivalent transformations, as illustrated in Appendix -E, the following optimization problem can be obtained:

 minξ,M′,dcd+∑ξijs.t.ρM′(xi,xj)≥2−ξijxi and xj are instance pairs with different labelsρM′(xm,xn)≤dξij≥0,M′∈M+xm,xn∈S. (11)

Intuitively speaking, the first set of inequality constraints indicate that the distances between samples from different classes should be large; and the third set of inequality constraints indicate that the estimated diameter should be small.

Based on the framework in (6) and (8), we can also propose an optimization formula which penalizes the upper bound of :

 mina,ξ,M,^diam,^L^L^diam+α∑Ni=1ξis.t.|ai−aj|ρM(xi,xj)≤^LρM(xm,xn)≤^diamxm and xn are instance pairs with the same labeltiai=1−ξiξi≥0,M∈M+xi,xj∈S. (12)

The only difference between (10) and (12) lies on the selected instance pairs to estimate : (10) utilizes all instance pairs to estimate the diameter of all the training instances, while (12) utilizes the instances pairs with the same label to estimate the maximum intra-class dispersion. Similarly to the transformations from (10) to (11), the following optimization problem can be obtained:

 minξ,M′,dcd+∑ξijs.t.ρM′(xi,xj)≥2−ξi−ξjxi and xj are instance pairs with different labelsρM′(xm,xn)≤dxm and xn are instance pairs with the same labelξi≥0,M′∈M+. (13)

In order to solve (11) and (13) more efficiently, alternating direction methods of multipliers (ADMM) have been adopted (see Algorithm 1), and the detailed derivation of the ADMM algorithm is presented in Appendix -F.

## V Experiments

To evaluate the performance of our proposed methods, we compare them with four widely adopted distance-based algorithms: Nearest Neighbor (NN), Large Margin Nearest Neighbor (LMNN) [3], Maximally Collapsing Metric Learning (MCML) [29] and Neighborhood Components Analysis (NCA) [30]. Under our framework, we have implemented Lip (based on the diameter Lipschitz margin ratio), Lip (based on the intra-class Lipschitz margin ratio), Lip(P) (ADMM-based fast Lip), Lip(P) (ADMM-based fast Lip).

Our proposed Lip, Lip are implemented using the cvx toolbox in MATLAB with the solver of SeDuMi [31]. The in our algorithm is fixed at and the in the ADMM algorithm is fixed at . The LMNN, MCML and NCA are from the dimension reduction toolbox.

In the experimente, we focus on the most representative task, binary classification. Eight publicly available datasets from the websites of UCI and LibSVM444https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html are adopted to evaluate the performance, namely Statlog/LibSVM Australian Credit Approval (Australian), UCI/LibSVM Original Breast Cancer Wisconsin (Cancer), UCI/LibSVM Pima Indians Diabetes (Diabetes), UCI Echocardiogram (Echo), UCI Fertility (Fertility), LibSVM Fourclass (Fourclass), UCI Haberman’s Survival (Haberman) and UCI Congressional Voting Records (Voting). For each dataset, instances are randomly selected as training samples, the rest as test samples. This process is repeated times and the mean accuracy is reported.

As shown in Table I

, the proposed algorithms Lip achieve the best mean accuracy on four datasets and equally best with MCML on one dataset. The Lip outperforms 1-NN and NCA on seven datasets and LMNN and MCML on five datasets. The only dataset that the Lip performs worse than all other methods is Fertility, in which our method potentially suffers from within-class outliers and hence has a large intra-class dispersion. Apart from this dataset, LMNN or MCML outperforms the Lip by only a small performance gap, less than

. Such encouraging results demonstrate the effectiveness of the proposed framework.

## Vi Conclusions and Future Work

In this paper, we have presented that the representation ability of Lipschitz functions is very strong and the complexity of the Lipschitz functions in a metric space can be controlled by penalizing the Lipschitz margin ratio. Based on these desirable properties, we have proposed a new metric learning framework via maximizing the Lipschitz margin ratio. An application of this framework for learning the squared Mahalanobis metric has been implemented and the experiment results are encouraging.

The diameter Lipschitz margin ratio or the intra-class Lipschitz margin ratio in the optimization function is equivalent to an adaptive regularization. In other words, since we encourage samples to stay close within the same class, samples which locate near the class boundary are valued more than those in the center. Therefore, the performance of our method may deteriorate under the existence of outliers and this problem has been reported on the dataset Fertility. We aim to develop more robust methods in our future work.

The local property within a dataset could vary dramatically, and hence it is worthwhile to develop an algorithm based on local Lipschitz margin ratio. One option is to follow the idea of LMNN, learning a general metric but considering different local Lipschitz margin ratio; or we can learn a separate metric on each local area.

## References

• [1] E. P. Xing, M. I. Jordan, S. Russell, and A. Y. Ng, “Distance metric learning with application to clustering with side-information,” in Advances in Neural Information Processing Systems, 2002, pp. 505–512.
• [2] M. Schultz and T. Joachims, “Learning a distance metric from relative comparisons,” Advances in Neural Information Processing Systems, p. 41, 2004.
• [3] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” The Journal of Machine Learning Research, vol. 10, pp. 207–244, 2009.
• [4] D. Kedem, S. Tyree, F. Sha, G. R. Lanckriet, and K. Q. Weinberger, “Non-linear metric learning,” in Advances in Neural Information Processing Systems, 2012, pp. 2573–2581.
• [5] J. Hu, J. Lu, and Y.-P. Tan, “Discriminative deep metric learning for face verification in the wild,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, 2014, pp. 1875–1882.
• [6] Y. Dong, B. Du, L. Zhang, L. Zhang, and D. Tao, “LAM3L: Locally adaptive maximum margin metric learning for visual data classification,” Neurocomputing, vol. 235, pp. 1–9, 2017.
• [7] W. Wang, B.-G. Hu, and Z.-F. Wang, “Globality and locality incorporation in distance metric learning,” Neurocomputing, vol. 129, pp. 185–198, 2014.
• [8] Y. Noh, B. Zhang, and D. Lee, “Generative local metric learning for nearest neighbor classification,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 1, p. 106, 2018.
• [9] C. Shen, J. Kim, F. Liu, L. Wang, and A. Van Den Hengel, “Efficient dual approach to distance metric learning,”

IEEE transactions on neural networks and learning systems

, vol. 25, no. 2, pp. 394–406, 2014.
• [10]

Q. Qian, R. Jin, J. Yi, L. Zhang, and S. Zhu, “Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD),”

Machine Learning, vol. 99, no. 3, pp. 353–372, 2015.
• [11] S. Ying, Z. Wen, J. Shi, Y. Peng, J. Peng, and H. Qiao, “Manifold preserving: An intrinsic approach for semisupervised distance metric learning,” IEEE transactions on neural networks and learning systems, 2017.
• [12] H. Jia, Y.-m. Cheung, and J. Liu, “A new distance metric for unsupervised learning of categorical data,” IEEE transactions on neural networks and learning systems, vol. 27, no. 5, pp. 1065–1079, 2016.
• [13] Y. Luo, Y. Wen, and D. Tao, “Heterogeneous multitask metric learning across multiple domains,” IEEE transactions on neural networks and learning systems, 2017.
• [14] W. Wang, H. Wang, C. Zhang, and Y. Gao, “Cross-domain metric and multiple kernel learning based on information theory,” Neural computation, no. Early Access, pp. 1–36, 2018.
• [15] J. Huo, Y. Gao, Y. Shi, and H. Yin, “Cross-modal metric learning for AUC optimization,” IEEE Transactions on Neural Networks and Learning Systems, 2018.
• [16] J. Li, X. Lin, X. Rui, Y. Rui, and D. Tao, “A distributed approach toward discriminative distance metric learning,” IEEE transactions on neural networks and learning systems, vol. 26, no. 9, pp. 2111–2122, 2015.
• [17] R. Jin, S. Wang, and Y. Zhou, “Regularized distance metric learning: Theory and algorithm,” in Advances in neural information processing systems, 2009, pp. 862–870.
• [18] Z.-C. Guo and Y. Ying, “Guaranteed classification via regularized similarity learning,” Neural computation, vol. 26, no. 3, pp. 497–522, 2014.
• [19] N. Verma and K. Branson, “Sample complexity of learning Mahalanobis distance metrics,” in Advances in Neural Information Processing Systems, 2015, pp. 2584–2592.
• [20] Q. Cao, Z.-C. Guo, and Y. Ying, “Generalization bounds for metric and similarity learning,” Machine Learning, vol. 102, no. 1, pp. 115–132, 2016.
• [21] R. Flamary, M. Cuturi, N. Courty, and A. Rakotomamonjy, “Wasserstein discriminant analysis,” arXiv preprint arXiv:1608.08063, 2016.
• [22] H. Do and A. Kalousis, “Convex formulations of radius-margin based support vector machines,” in International Conference on Machine Learning, 2013, pp. 169–177.
• [23] T. Jebara and P. K. Shivaswamy, “Relative margin machines,” in Advances in Neural Information Processing Systems, 2009, pp. 1481–1488.
• [24] N. Weaver and N. Weaver, Lipschitz Algebras.   World Scientific, 1999.
• [25] E. J. McShane, “Extension of range of functions,” Bulletin of the American Mathematical Society, vol. 40, no. 12, pp. 837–842, 1934.
• [26] H. Whitney, “Analytic extensions of differentiable functions defined in closed sets,” Transactions of the American Mathematical Society, vol. 36, no. 1, pp. 63–89, 1934.
• [27] U. v. Luxburg and O. Bousquet, “Distance-based classification with Lipschitz functions,” The Journal of Machine Learning Research, vol. 5, pp. 669–695, 2004.
• [28] L.-A. Gottlieb, A. Kontorovich, and R. Krauthgamer, “Efficient classification for metric data,” Information Theory, IEEE Transactions on, vol. 60, no. 9, pp. 5750–5759, 2014.
• [29] A. Globerson and S. Roweis, “Metric learning by collapsing classes,” in Advances in Neural Information Processing Systems, vol. 18, 2005, pp. 451–458.
• [30] J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov, “Neighbourhood components analysis,” in Advances in neural information processing systems, 2005, pp. 513–520.
• [31] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optimization Methods and Software, vol. 11, no. 1-4, pp. 625–653, 1999.
• [32] N. Parikh, S. P. Boyd et al., “Proximal algorithms,” Foundations and Trends in optimization, vol. 1, no. 3, pp. 127–239, 2014.
• [33] G.-B. Ye, Y. Chen, and X. Xie, “Efficient variable selection in support vector machines via the alternating direction method of multipliers,” in

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

, 2011, pp. 832–840.