# Learning Local Metrics and Influential Regions for Classification

The performance of distance-based classifiers heavily depends on the underlying distance metric, so it is valuable to learn a suitable metric from the data. To address the problem of multimodality, it is desirable to learn local metrics. In this short paper, we define a new intuitive distance with local metrics and influential regions, and subsequently propose a novel local metric learning method for distance-based classification. Our key intuition is to partition the metric space into influential regions and a background region, and then regulate the effectiveness of each local metric to be within the related influential regions. We learn local metrics and influential regions to reduce the empirical hinge loss, and regularize the parameters on the basis of a resultant learning bound. Encouraging experimental results are obtained from various public and popular data sets.

## Authors

• 11 publications
• 17 publications
• 8 publications
• 32 publications
02/26/2020

### Supervised Categorical Metric Learning with Schatten p-Norms

Metric learning has been successful in learning new metrics adapted to n...
03/28/2018

### Active Metric Learning for Supervised Classification

Clustering and classification critically rely on distance metrics that p...
02/21/2019

### Reduced-Rank Local Distance Metric Learning for k-NN Classification

We propose a new method for local distance metric learning based on samp...
11/16/2020

### A New Similarity Space Tailored for Supervised Deep Metric Learning

We propose a novel deep metric learning method. Differently from many wo...
06/19/2020

### Classifier uncertainty: evidence, potential impact, and probabilistic treatment

Classifiers are often tested on relatively small data sets, which should...
04/05/2018

### Large Scale Local Online Similarity/Distance Learning Framework based on Passive/Aggressive

Similarity/Distance measures play a key role in many machine learning, p...
06/01/2021

### Quantifying the Similarity of Planetary System Architectures

The planetary systems detected so far already exhibit a wide diversity o...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Classification is a fundamental task in the field of machine learning. While deep learning classifiers have obtained superior performance on numerous applications, they generally require a large amount of labeled data. For small data sets, traditional classification algorithms remain valuable.

The nearest neighbor (NN) classifier is one of the oldest established methods for classification, which compares the distances between a new instance and the training instances. However, with different metrics, the performance of NN would be quite different. Hence it is very beneficial if we can find a well-suited and adaptive distance metric for specific applications. To this end, metric learning is an appealing technique. It enables the algorithms to automatically learn a metric from the available data. Metric learning with a convex objective function was first proposed in the seminal work of Xing et al. [1]. After that, many other metric learning methods have been developed and widely adopted, such as the Large Marin Nearest Neighbor (LMNN) [2] and the Information Theoretic Metric Learning [3]. Some theoretical work has also been proposed for metric learning, especially on deriving different generalization bounds [4, 5, 6, 7] and deep networks have been used to represent nonlinear metrics  [8, 9]. In addition, metric learning methods have bee developed for specific purposes, including multi-output tasks [10], multi-view learning [11]

, medical image retrieval

[12], kinship verification tasks [13, 14]

[15], tracking problems [16] and so on.

Most aforementioned methods use a single metric for the whole metric space and thus may not be well-suited for data sets with multimodality. To solve this problem, local metric learning algorithms have been proposed [17, 2, 18, 19, 20, 21, 22, 23, 24].

Most of these localized algorithms can be categorized into two groups: 1) Each data point or cluster of data points has a local metric . This, however, results in an asymmetric distance as illustrated in [18], i.e.  would cause . 2) Each line segment or cluster of line segments has a local metric, i.e. . The definitions of , such as in [20] where is defined as to guarantee the symmetry and or

is based on the posterior probability that the point

belongs to the th Gaussian cluster in a Gaussian mixture (GMM), are nonetheless not very intuitive.

In this short paper, we define an intuitive, new symmetric distance, and a novel local metric learning method. By splitting the metric space into influential regions and a background region, we define the distance between any two points as the sum of lengths of line segments in each region, as illustrated in Figure 1. Building multiple influential regions solves the multimodality issues; and learning a suitable local metric in each influential region improves class separability, as shown in Figure 2.

To establish our new distance and local metric learning method, we first define some key concepts, namely influential regions, local metrics and line segments, which lead to the definition of the new distance. Then we calculate the distance by discussing the geometric relationship between line segment and influential regions. After that, we use the proposed local metric to build a novel classifier and study its learnablity. The penalty terms from the derived learning bound, together with the empirical hinge loss, form an optimization problem, which is solved via gradient descent due to the non-convexity. Finally we experiment the proposed local metric learning algorithm on 14 publicly available data sets. On eight of these data sets, the proposed algorithm achieves the best performance, much better than the state-of-the-art metric learning competitors.

## 2 Definitions of Influential Regions, Local Metrics and Distance

In this section, we will first define influential regions , and the background region . With a local metric for each region and , the distance between and will be defined as the sum of lengths of line segments in each influential region and the background region, as illustrated in Figure 1. Since the metric is defined with respect to line segments, the distance is symmetric, i.e. .

To simplify the calculation required later, we restrict the shape of each influential region to be a ball.

###### Definition 1.

Influential regions are defined to be any set of -balls inside the metric space:

 A={As,s=1,…,S},

where denotes the number of influential regions; , in which denotes a ball with the center at and radius of ; the location of each influential region is determined by the Euclidean distance; and points construct a set with the following form

 {x|(os−x)T(os−x)≤r2s}. (1)
###### Definition 2.

Background region is defined to be the region excluding influential regions:

 B=U−⋃s=1,…,SAs,

where indicates the universe set.

Throughout this paper, the distance between two points and is equivalent to the length of line segment , i.e. . Length in influential regions and the background region will be defined separately with respective metrics.

###### Definition 3.

Each influential region has its own local metric . The length of a line segment inside an influential region is defined as111Since influential regions are restricted to be ball-shaped and a ball is a convex set, the line segment would lie in the ball for any two point and inside the ball.

 l(¯¯¯¯¯¯¯¯¯¯xixj;M(As))=DM(As)(xi,xj)=√(xi−xj)TM(As)(xi−xj). (2)

To make illustrations more intuitive, the distance adopted in this paper will be based on the Mahalanobis distance222This is different the usually adopted squared Mahalanobis distance and enjoys convenience when solving the optimization problem..

###### Definition 4.

The background region has a background metric . For any two points and , the length of a line segment is defined as

 l(¯¯¯¯¯¯¯¯¯¯xixj;M(B)) = DM(B)(xi,xj) = √(xi−xj)TM(B)(xi−xj).

We make two remarks here:

1. While the metrics and will be learned inside influential regions and the background region, the Euclidean distance is used to determine the location of influential regions.

2. For and , the distance between and is generally different from . It is because some parts of the line segment may lie in influential regions so their lengths should be calculated via the related local metrics.

To calculate the distance between any and , we need to consider the relationship between the line segment and influential regions, which can be simplified as one of the following three cases: no-intersection, tangent and with-intersection.

###### Definition 5.

The intersection of a line segment and an influential region is denoted as . In the case of no-intersection, ; in the case of tangent, , where is the tangent point; in the case of with-intersection, , where is the maximum sub-line segment of inside , is the point which lies closer to and is the point which lies closer to . On the other hand, the intersection of a line segment and the background region B is defined as

 B∩¯¯¯¯¯¯¯¯¯¯xixj=¯¯¯¯¯¯¯¯¯¯xixj−⋃s=1…S(As∩¯¯¯¯¯¯¯¯¯¯xixj), (3)

where is the union of intersections between the line segment and all influential regions. It could also be understood as a set of non-overlapping line segments333This could be easily proved by recursively combining any overlapping line segments until no overlapping one is found..

Accordingly, the length of line segment can be calculated through the length of intersection.

###### Definition 6.

The length of intersection of a line segment and an influential region is defined as . In the case of tangent or no-intersection, ; in the case of with-intersection, it is defined to be the length of , i.e. . On the other hand, the length of the intersection of a line segment and the background region is defined as

 l(B∩¯¯¯¯¯¯¯¯¯¯xixj;M(B))=l(¯¯¯¯¯¯¯¯¯¯xixj;M(B))−l(⋃s=1…S(As∩¯¯¯¯¯¯¯¯¯¯xixj);M(B)). (4)
###### Definition 7.

The length of line segment is defined as

 l(¯¯¯¯¯¯¯¯¯¯xixj;M(¯¯¯¯¯¯¯¯¯¯xixj))=√(xi−xj)TM(¯¯¯¯¯¯¯¯¯¯xixj)(xi−xj)=l(B∩¯¯¯¯¯¯¯¯¯¯xixj;M(B))+∑sl(As∩¯¯¯¯¯¯¯¯¯¯xixj;M(As)), (5)

where is the metric of the line segment . will be simplified as afterwards.

## 3 Calculation of Distances

### 3.1 Calculation of the length of intersection with influential regions

We will first provide an intuitive explanation of calculating the length of intersection with influential regions, as illustrated in Figure 3. If the line does not intersect with or is the tangent to the influential ball, the length is zero. This is equivalent to identifying the start and end points of line and the ball, , via one variable quadratic equation. If the line intersects with the ball, we will calculate the length by considering the relationship between the intersection of the line and the influential ball, i.e. , and the intersection of the line segment and the influential ball, i.e. . can be obtained based on points and the constraint that the start and end points should be on the linear segment .

###### Definition 8.

The intersection points of the line and the influential region are represented as and , where , and are called the intersection coefficients between the line and . The intersection points of the line segment and the influential region are represented as and , where and are called the intersection coefficients between the line segment and . is called the intersection ratio.

###### Proposition 1.

The length of intersection between line segment and the influential region , with the intersection points and intersection coefficients , is

 l(A∩¯¯¯¯¯¯¯¯¯¯xixj;M(As))=√(q−p)TM(As)(q−p)=γ√(xi−xj)TM(As)(xi−xj). (6)

As shown in the above proposition, the length of intersection can be calculated given the local metric and , where the latter term can be obtained from and .

Now we discuss the computation of , which can be divided into two steps.

1) Calculate the intersection points of the line and the ball: and , i.e.  and .

The coefficients and could be easily solved through the following quadratic equation with one variable:

 ∥xi+λ(xj−xi)−os∥22=r2s, (7)

with ; and when , the solutions to the above equation are

 λsu,ij =−b−√Δ2a=−2(xj−xi)T(xi−os)−√Δ2(xj−xi)T(xj−xi), λsv,ij =−b+√Δ2a=−2(xj−xi)T(xi−os)+√Δ2(xj−xi)T(xj−xi).

Hence the two intersection points between the ball and the line become

 usij =xi+λsu,ij(xj−xi), vsij =xi+λsv,ij(xj−xi).

For simplicity, the superscript and subscript for , , , and will be discarded if no confusion is caused.

2) Calculate the intersection points of the line segment and the ball: and , i.e.  and .

We check the number of solutions to (7). If (7) has 0 or 1 solution, the line has no intersection or is tangent to the region, and thus . If it has two solutions, the intersection between the line and the ball is a line segment . Based on the value of 444If and only if the value of or lies in the range of , the corresponding point lies inside the line segment ., we can obtain the relationship between and and get the values of and from

 λp =min(max(λu,0),1), λq =min(max(λv,0),1).

A summary of the notation used in this section is listed in Table I; the details of the distance calculation are illustrated in Figure 3 and Table II.

### 3.2 Calculation of the length of intersection with local metrics

###### Proposition 2.

In the case of non-overlapping influential regions, i.e. ,

 DM(xixj)≜l(¯¯¯¯¯¯¯¯¯¯xixj;M(¯¯¯¯¯¯¯¯¯¯xixj))=γb√(xi−xj)TM(B)(xi−xj)+∑sγs√(xi−xj)TM(As)(xi−xj)=(1−∑sγs)√(xi−xj)TM(B)(xi−xj)+∑sγs√(xi−xj)TM(As)(xi−xj), (8)

where is defined as the intersection ratio of the background region, and in the non-overlapping case .

Proposition 2 suggests that the distance can be obtained once we have metrics (, ) and the intersection ratio . As all calculations are in closed form, the computation is efficient.

In the case of overlapping influential regions, we have the same formula as (8):

 DM(xixj)≜l(¯¯¯¯¯¯¯¯¯¯xixj;M(¯¯¯¯¯¯¯¯¯¯xixj))=γb√(xi−xj)TM(B)(xi−xj)+∑sγs√(xi−xj)TM(As)(xi−xj). (9)

The calculation of in (9) is slightly different from that in (8). In the following sections, we use an approximation of for simplicity: .

## 4 Classifier and Learnability

Lipschitz continuous functions are a family of smooth functions which are learnable [25]. In this paper, we select Lipschitz continuous functions as the classifiers. Based on the resultant learning bounds, we obtain the terms to regularize in order to improve the generalization ability.

### 4.1 Classifier

In the Euclidean space, it is intuitive to see the following classifier gives the same classification results as 1-NN:

 f(x)=minDset(x,X−)−minDset(x,X+),

where indicates that belongs to negative class and indicates that belongs to positive class; is the set that contains the Euclidean distance values between and any instance of the negative or positive class, and indicates the Euclidean distance between and .

K-NN considers more nearby instances and hence is more robust than 1-NN. A similar extension to consider more nearby instances based on the above equation is as follows:

 (10)

where denotes the sum of the minimal elements of the set. This function will be used as the classifier in our algorithm.

### 4.2 Learnability of the Classifier with Local Metrics

We will discuss the learnability of functions based on the Lipschitz constant, which characterizes the smoothness of a function. The smaller the value of Lipschitz constant, the more smooth the function is.

###### Definition 9.

([26]) The Lipschitz constant of a function is

 Lip(f)=inf{C∈R|∀xi,xj∈X,ρY(f(xi),f(xj))≤CρX(xi,xj)}=supxi,xj∈X:xi≠xjρY(f(xi),f(xj))ρX(xi,xj).
###### Proposition 3.

([26]) Let and , then
(a) ;
(b) ;
(c) , where is a constant.

###### Proposition 4.

Let the Lipschitz constant of , then the Lipschitz constant of is bounded by .

###### Proof.

 sumKmin{fk(xi)}=sumKmin{fk(xj+(xi−xj))}≤sumKmin{fk(xj)+Lk∥xi−xj∥}≤sumKmin{fk(xj)+(maxkLk)∥xi−xj∥}=sumKmin{fk(xj)}+K(maxkLk)∥xi−xj∥.

Therefore,

 sumKmin{fk(xi)}−sumKmin{fk(xj)}≤K(maxkLk)∥xi−xj∥.

Based on the definition of Lipschitz constant, the proposition is proved. ∎

###### Lemma 1.

With distance defined with (9), the Lipschitz constant of the classifier illustrated by (10) is bound by , where denotes the matrix Frobenius norm.

###### Proof.

Let denote the Mahalanobis distance with metric :

 dM(x,xk)=√(x−xk)TM(x−xk).

With the identity matrix

, is the Euclidean distance.

The Lipschitz constant of is bounded by as follows:

 Lip(f1)=f1(x)−f1(xk)dI(x,xk)=dM(x,xk)−dM(x,xk)dI(x,xk)≤dM(x,xk)dI(x,xk)≤dI(x,xk)∥M∥FdI(x,xk)=∥M∥F,

where the first inequality follows the triangle inequality of distance, and the second inequality is based on the fact that matrix Frobenius norm is consistent with the vector

norm555The consistence between a matrix norm and a vector norm indicates , where is a matrix, is a vector, is a matrix norm and is a vector norm., i.e.

 ∥(x−xk)TM(x−xk)∥2≤∥x−xk∥22∥M∥F.

According to the definition of distance in (9), we have

 DM(x,xk)≤∑sDM(As)(x,xk)+DM(B)(x,xk);

and it follows Proposition 3 that

 Lip(DM(x,xk))≤∑s∥M(As)∥F+∥M(B)∥.

Based on the Lipschitz constant of and the composition property illustrated Proposition 4,

 Lip(sumKmin{DM(x,xk),k=1,…K})≤K{∑s∥M(As)∥F+∥M(B)∥}.

Finally, based on Proposition 3, in (10) is bounded by . ∎

Combining the results of Proposition 1 and the Corollary 6 of [25], we can obtain the following Corollary.

###### Corollary 1.

Let metric space have doubling dimension and let be the collection of real valued functions over with the Lipschitz constant at most . Then for any that classifies a sample of size correctly, if is correct on all but examples, we have with probability at least

 P{(x,t):sign[f(x)]≠t}≤kn+√2n(clog2(34en/c)log2(578n)+log2(4/δ)), (11)

where

denotes the diameter of the space and denotes doubling dimension666The detailed definition can be found in [25].

The above learning bound illustrates the generalization ability, i.e. the difference between the expected error and the empirical error . Based on the bound, reducing the value of would help reduce the gap between the empirical error and the expected error. In other words, the learning bound indicates that regularizing would help improve the generalization ability of the classifier.

## 5 Optimization Problem

### 5.1 Objective Function

Based on the discussion in previous sections, with hinge loss and the regularization terms of