DeepAI

# Ground Metric Learning

Transportation distances have been used for more than a decade now in machine learning to compare histograms of features. They have one parameter: the ground metric, which can be any metric between the features themselves. As is the case for all parameterized distances, transportation distances can only prove useful in practice when this parameter is carefully chosen. To date, the only option available to practitioners to set the ground metric parameter was to rely on a priori knowledge of the features, which limited considerably the scope of application of transportation distances. We propose to lift this limitation and consider instead algorithms that can learn the ground metric using only a training set of labeled histograms. We call this approach ground metric learning. We formulate the problem of learning the ground metric as the minimization of the difference of two polyhedral convex functions over a convex set of distance matrices. We follow the presentation of our algorithms with promising experimental results on binary classification tasks using GIST descriptors of images taken in the Caltech-256 set.

• 65 publications
• 7 publications
11/08/2019

### Ground Metric Learning on Graphs

Optimal transport (OT) distances between probability distributions are p...
12/14/2018

### A Tutorial on Distance Metric Learning: Mathematical Foundations, Algorithms and Software

This paper describes the discipline of distance metric learning, a branc...
02/21/2019

### Reduced-Rank Local Distance Metric Learning for k-NN Classification

We propose a new method for local distance metric learning based on samp...
06/10/2021

### Distance Metric Learning through Minimization of the Free Energy

Distance metric learning has attracted a lot of interest for solving mac...
02/11/2021

### Unsupervised Ground Metric Learning using Wasserstein Eigenvectors

Optimal Transport (OT) defines geometrically meaningful "Wasserstein" di...
06/04/2013

### Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances

Optimal transportation distances are a fundamental family of parameteriz...
06/20/2012

### Statistical Translation, Heat Kernels and Expected Distances

High dimensional structured data such as text and images is often poorly...

## 1 Introduction

We consider in this paper the problem of supervised metric learning on normalized histograms. Normalized histograms arise frequently in natural language processing, computer vision, bioinformatics and more generally areas involving complex datatypes. Objects of interest in such areas are usually simplified and each represented as a bag of smaller features. The occurrence frequencies of each of these features in the considered object can then be represented as a histogram. For instance, the representation of images as histograms of pixel colors, SIFT or GIST features

(Lowe 1999, Oliva and Torralba 2001, Douze et al. 2009); texts as bags-of-words or topic allocations (Joachims 2002, Blei et al. 2003, Blei and Lafferty 2009); sequences as -grams counts (Leslie et al. 2002) and graphs as histograms of subgraphs (Kashima et al. 2003) all follow this principle.

Various distances have been proposed in the statistics and machine learning literatures to compare two histograms (Deza and Deza 2009, §14)(Rachev 1991). Our focus is in this paper is on the family of transportation distances, which is both well motivated theoretically (Villani 2003, §7)(Rachev 1991, §5) and works well empirically (Rubner et al. 1997; 2000, Pele and Werman 2009). Transportation distances are particularly popular in computer vision, where, after the influential work of Rubner et al. (1997), they were called Earth Mover’s Distances (EMD).

Transportation distances in machine learning can be thought of as meta-distances that build upon a metric on the features to form a distance on histograms of features. Such a metric, which is known in the computer vision literature as the ground metric111Since the terms metric and distance are interchangeable mathematically speaking, we will always use the term metric for a metric between features and the term distance for the resulting transportation distance between histograms, or more generally any other distance on histograms., is the unique parameter of transportation distances. In their seminal paper, Rubner et al. (2000) argue that, “in general, the ground distance can be any distance and will be chosen according to the problem at hand”. To our knowledge, the ground metric has always been considered a priori in all applications of EMD in machine learning. To be more precise, EMD has only been applied to datasets where such a metric was available and motivated by prior knowledge. We argue that this is problematic in two senses: first, this restriction limits the application of transportation distances to problems where such a knowledge exists. Second, even when such an a priori knowledge is available, we argue that there cannot be a “universal” ground metric that will be suitable for all learning problems involving histograms on such features. As with all parameters in machine learning algorithms, the ground metric should be selected adaptively. Our goal in this paper is to propose ground metric learning algorithms to do so.

This paper is organized as follows: After providing some background on transportation distances in Section 2, we propose in Section 3 a criterion – a difference of convex function – to select a ground metric given a training set of histograms and a similarity measure between these histograms. We then show how to obtain local minima for that criterion using a subgradient descent algorithm in Section 4. We propose different starting points to initialize this descent in Section 5. We provide a review of other relevant distances and metric learning techniques in Section 6, in particular Mahalanobis metric learning techniques (Xing et al. 2003, Weinberger et al. 2006, Weinberger and Saul 2009, Davis et al. 2007) which have inspired much of this work. We provide empirical evidence in Section 7 that the distances proposed in this paper compare favorably to competing techniques. We conclude this paper in Section 8 by providing a few research avenues.

### Notations

We use upper case letters for matrices. Bold upper case letters are used for larger matrices; lower case letters

are used for scalar numbers or vectors of

. An upper case letter and its bold lower case stand for the same matrix written in matrix form or vector form by stacking successively all its column vectors from the left-most on the top to the right-most at the bottom. The notations and stand respectively for the strict upper and lower triangular parts of expressed as vectors of size . The order in which these elements are enumerated must be coherent in the sense that the upper triangular part of expressed as a vector must be equal to . Finally we use the Frobenius dot-product for both matrix and vector representations, written as .

## 2 Optimal Transportation Between Histograms

We recall in this section a few basic facts about mass transportation for two histograms. A more general and technical introduction is provided by Villani (2003, Introduction & §7); practical insights and motivation for its application in machine learning can be found in Rubner et al. (2000); a recent review of different extensions and particular cases of EMD can be found in (Pele and Werman 2009, §2).

### 2.1 Transportation Polytopes

For two scalar histograms and of sum and dimension , represented in the following as column vectors and of the canonical simplex of dimension , the polytope of transportation plans that map to is the set of matrices with coefficients in such that their row and columns marginals are equal to and respectively, that is, writing for the column vector of ones,

 U(r,c)={F∈Rd×d+|F1d=r,F⊤1d=c}.

is a polytope of dimension in the general case where and have positive coordinates. is also known in the operations research and statistical literatures as the set of transportation plans (Rachev and Rüschendorf 1998)

and contingency tables or two-way tables with fixed margins

(Diaconis and Efron 1985). Given two histograms and , we define the following function of a real matrix :

 Grc(A)def=minX∈U(r,c)⟨A,X⟩. (1)

Equation (1

) describes a linear program whose feasible set is defined by

and and whose cost is parameterized by . is a positive homogeneous function, that is for . When is a matrix taken in the pointed, convex and polyhedral cone of metric matrices,

 Mdef={M∈Rd×d:∀1≤i,j,k≤d,Mii=0,Mij=Mji,Mij≤Mik+Mkj}⊂Rd×d+,

the quantity is known as the Kantorovich-Rubinstein distance (Villani 2003, §7) between and . To highlight the fact that can also be seen as a the evaluation of a function of and parameterized by , we will use the notation .

### 2.2 Transportation Distances

The function parameterized by has the following properties: since has a null diagonal is always zero; by nonnegativity of , ; by symmetry of , is itself a symmetric function in its two arguments. More generally, is a distance between histograms whenever is itself a metric, namely whenever  (Villani 2003, Theo. 7.3). The distance bears many names and has many variations: 1-Wasserstein, Monge-Kantorovich, Mallow’s (Mallows 1972, Levina and Bickel 2001), Earth Mover’s (Rubner et al. 2000) in vision applications. Rubner et al. (2000) and more recently Pele and Werman (2009) have also proposed to extend the transportation distance to compare un-normalized histograms. Simply put, these extensions compute a distance between two unnormalized histograms and by combining any difference in the total mass of and with the optimal transportation plan that can carry the whole mass of onto if or onto if . We will not consider such extensions in this work; we believe however that the approaches proposed later in this paper can be extended to handle EMD distances for unnormalized histograms.

can be computed as the solution of the following Linear Program (LP),

 dM(r,c)=minimizemTxsubject toAx=[rc]∗x≥0 (P0)

where is the matrix that encodes the row-sum and column-sum constraints for to be in as

 A=[11×d⊗IdId⊗11×d]∗,

is Kronecker’s product and the lower subscript in a matrix (resp. a vector) means that its last line (resp. element) has been removed. This modification is carried out to make sure that all constraints described by are independent, or equivalently that is not rank deficient. This LP can be solved using the network simplex (Ford and Fulkerson 1962) or through more specialized network flow algorithms (Ahuja et al. 1993, §9).

### 2.3 Properties of Grc

Because its feasible set is a bounded polytope and its objective is linear, Problem (P0) has an optimal solution in the finite set of extreme points of . is thus the minimum of a finite set of linear functions and is by extension piecewise linear and concave (Boyd and Vandenberghe 2004, §3.2.3). Its gradient is equal to whenever an optimal solution to Problem (P0) is unique (Bertsimas and Tsitsiklis 1997, §5.4). More generally and regardless of the uniqueness of , any optimal solution of Problem (P0) is in the sub-differential of at  (Bertsimas and Tsitsiklis 1997, Lem.11.4). We use this property later in Section 4 when we optimize the criteria considered in the section below.

## 3 Criteria for Ground Metric Learning

We define in this section a family of criteria to quantify the relevance of a ground metric for a given task, using a training dataset of histograms with additional information.

### 3.1 Training Set: Histograms and Side Information

Suppose now that we are given a family of histograms in the canonical simplex with a corresponding similarity matrix which quantifies how similar and are: is large and positive whenever and describe similar objects and small and negative for dissimilar objects. We assume that this similarity is symmetric, . The similarity of an object with itself will not be considered in the following, so we simply set for .

Let us give more intuition on how these weights may be set in practice. In the most simple case, these weights may reflect a class taxonomy and be set to whenever and come from the same class and for two different classes. This is the setting we consider in our experiments later in this paper. Such weights may be also inferred from a hierarchical taxonomy: each weight corresponding to two histograms could for instance reflect how close the respective classes of these histograms lie in the tree of classes.

Let us introduce more notations before moving on to the next section. Since and we restrict the set of pairs of indices to

 Idef={(i,j)|i,j∈{1,⋯,n},i

We also introduce two subsets of :

 E+def={(i,j)∈I|ωij>0};E−def={(i,j)∈I|ωij<0},

the subsets of similar and dissimilar histograms. Finally, we write for the functions.

### 3.2 A Local Criterion to Select the Ground Metric

We propose to formulate the ground metric learning problem as finding a metric such that the transportation distance induced by this metric agrees with the weights . More precisely, this criterion will favor metrics for which, for a given pair of similar histograms and (namely ), the resulting distance is small. Conversely, for a given pair of dissimilar histograms and (namely ), the resulting distance should be large. The criterion should balance these two requirements, and in particular favor ground metrics for which these two ideas hold for pairs such that is large.

From a formal perspective, any criterion to select should consider the family of pairs

 {(ω11,G11(M)),⋯,(ωn−1n,Gn−1n(M))}.

Since the ordering of the histograms should not influence the criterion, only symmetric functions of the couples of variables above should be considered. We propose in this paper a family of simple criteria: the average value of ,

 C∞(M)def=2∑(i,j)∈IωijGij(M),

and a restriction of such an average to neighboring points,

 Ck(M)def=n∑i=1S+ik(M)+S−ik(M),

where for each index , the weighted sums of distances of its similar and dissimilar neighbors are considered respectively in

 S+ik(M)def=∑j∈N+ikωijGij(M), and S−ik(M)def=∑j∈N−ikωijGij(M). (2)

These sums are computed using the sets and , which stand for the indices of any nearest neighbours of using distance , not necessarily unique, and whose indices are taken respectively in the subsets and . We adopt the convention that whenever is larger than the cardinality of , and follow the same convention for . This convention makes our notation consistent with the definition of since one can indeed check that for large enough, and notably for . Since the techniques we propose below apply to both cases where is finite or infinite, we only consider in the following an extended index and its corresponding criterion .

### 3.3 Metrics of Interest

Because is positively homogeneous, the problem of minimizing over the pointed cone is either unbounded or trivially solved for . To get around this issue, we restrict our search to the intersection of and the unit sphere in of a suitable matrix norm, that is

 M1=M∩B1, (3)

where is defined by an arbitrary matrix norm such that the unit ball is convex. Using criteria , learning a ground metric from the family of points and weights boils down to finding an optimal solution to problem

 minM∈M1Ck(M). (P1)

Problem (P1) has variables, one for each upper-diagonal term in . The feasible set is closed, convex, bounded and is piecewise linear. As a consequence, Problem (P1) admits at least one optimal solution.

## 4 Ck as a Difference of Convex Functions

Since each function is concave, can be cast as a Difference of Convex (DC) functions (Horst and Thoai 1999):

 Ck(M)def=S−k(M)−−S+k(M)

where both and are convex, by virtue of the convexity of each of the terms and defined in Equation (2). This follows from the concavity of each function and the fact that such functions are weighted by negative factors, for and for . We propose in this section an algorithm to obtain a local minimizer of Problem (P1) that takes advantage of this decomposition.

### 4.1 Subdifferentiability of Ck

The gradient of computed at a given metric matrix is

 ∇Ck(M)=∇S−k(M)−−∇S+k(M),

where

 ∇S−k(M)=n∑i=1∑j∈N+ikωijX⋆ij,−∇S+k(M)=n∑i=1∑j∈N−ik−ωijX⋆ij,

whenever all solutions to the linear programs considered in are unique and whenever the set of nearest neighbors of each histogram is unique. More generally, by virtue of the property that any optimal solution is in the sub-differential of at we have that

 n∑i=1∑j∈N+ikωijX⋆ij∈∂S−k(M),n∑i=1∑j∈N+ik−ωijX⋆ij∈∂−S+k(M),

regardless of the unicity of the nearest neighbors of each histogram . The details of the computation of and one of its subgradients are given in Algorithm (1). The computations for follow the same route; we use the abbreviation to consider either of these two cases in our algorithm outline.

### 4.2 Localized Linearization of the Concave Part of Ck

We describe in Algorithm (2) a simple approach to minimize locally based on a projected subgradient descent and a local linearization of the concave part of . Algorithm (2) runs a subgradient descent on using two nested loops. In the first loop parameterized with variable , (the concave part of ) and a point in its subdifferential are computed using the current metric . Using this value and the subgradient , the concave part of can be locally approximated by its first order Taylor expansion,

 Ck(M)≈S−k(M)+S+k(Mp)+∇T+(M−Mp).

This approximation is convex, larger than and can be minimized in an inner loop using a projected gradient descent. When this convex function has been minimized up to a sufficient precision, we obtain a point

 Mp+1∈argminM∈M1S−k(M)+S+k(Mp)+∇T+(M−Mp).

We increment repeat the step described above. The algorithm terminates when sufficient progress in the outer loop has been realized, at which point either the matrix computed in the last iteration, or that for which the objective has been minimal so far, is returned as the output of the algorithm.

Algorithm (2) fits the description of simplified DC algorithms (Tao and An 1997, §4.2) to minimize a difference of convex functions in the case where either or is a convex polyhedral function. In this paper both functions and are convex polyhedral; The overall quality of this local minima is directly linked to the quality of the initial point . Choosing a good is thus a crucial factor of our approach. We provide a few options to define in Section 5.

## 5 Initial Points

Algorithm (2) converges to a local minima of in . We argue that this local solution can only provide a good approximation of the global minima if the initial point itself is already a good initial guess, not too far from the global optimum of .

### 5.1 The l1 Distance as a Transportation Distance

The distance between histograms can provide an educated guess to define an initial point to optimize . Indeed, the distance can be interpreted as the Kantorovich-Rubinstein distance222Rigorously speaking, both distances are equal up to a factor 2, that is seeded with the uniform ground metric defined as . Because the distance is itself a popular distance to compare histograms, we consider in our experiments to initialize Algorithm (2). This starting point does not, however, exploit the information provided by the histograms and weights . In order to do so, we approximate by a linear function of in Section 5.2, and show that a minimizer of this approximation can provide a better way of setting , as shown later in the experimental section.

### 5.2 Linear Approximations to Ck

We propose to form an initial point by replacing the optimization underlying the computation of each distance by a dot product,

 Gij(M)=minX∈U(ri,rj)⟨M,X⟩≈⟨M,Ξij⟩ (4)

where is a matrix. We discuss several choices to define matrices later in Section 5.3. We use these approximations to define the criteria

 χ∞(M)=2∑(i,j)∈Iωij⟨M,Ξij⟩=2⟨M,∑(i,j)∈IωijΞij⟩ (5)

in the case where , or

 χk(M)=n∑i=1∑j∈N−ik(M1)⟨M,Ξij⟩+∑j∈N+ik(M1)⟨M,Ξij⟩=⟨M,n∑i=1∑j∈N−ik(M1)ωijΞij+∑j∈N+ik(M1)ωijΞij⟩ (6)

when . Note that when is finite, and only in that case, the nearest neighbors of each histogram need to be selected first with a metric; we use for this purpose. Although this trick may not be satisfactory, we observe that similar approaches have been used by Weinberger and Saul (2009) to seed their algorithms with near neighbors in the initial phase of their optimization. Note that such a trick is not needed when which, in practice, seems to yield better results as explained later in the experimental section. In both cases where and , is a linear function of which, for our purpose, needs to be minimized over :

 minM∈M1χk(M)=minM∈M1⟨M,Ξk⟩ (P2)

where is a matrix equal to the relevant sums in the right hand sides of either Equation (5) or (6) depending on the value of . This problem has a linear objective and a convex feasible set. If the norm defining the unit ball in Equation (3) is the norm, that is the sum of the absolute values of all coefficients in a matrix, then Problem (P2) is a linear program with constraints. For large , this formulation might be intractable. We propose to consider instead the norm unit ball for which yields an alternative form for Problem (P1), where the constraint is replaced by a regularization term

 minM∈Mλ⟨M,Ξk⟩+∥M∥22=minM∈M∥M+λ2Ξk∥22,λ>0 (P3)

Brickell et al. (2008, Algorithm 3.1) have proposed recently a triangle fixing algorithm to solve problems of the form

 minM∈M∥M−H∥2, (P4)

where is a pseudo-distance, that is a symmetric, zero on the diagonal and nonnegative matrix. It is however straightforward to check that each of these three conditions, although intuitive when considering the metric nearness problem as defined in (Brickell et al. 2008, §2), are not necessary for Algorithm (3.1) in (Brickell et al. 2008, §3) to converge. This algorithm is not only valid for non-symmetric matrices as pointed out by the authors themselves, but it is also applicable to matrices with negative entries and non-zero diagonal entries. Problem (P3) can thus be solved by replacing by in Problem (P4) regardless of the sign of the entries of .

We conclude this section by mentioning that other approaches can be considered to minimize the dot product using alternative norms and methods. Frangioni et al. (2005) propose for instance to handle linear programs in the intersection between the cone of distances and the set of polyhedral constraints which defines what is known as the metric polytope. These approaches are however more involved computationally and we leave such extensions for future work.

### 5.3 Representative Tables

The techniques presented in Section 5.2 above build upon a linear approximation of each function as by selecting a particular matrix such that . We propose in this section to obtain such an approximation by considering an arbitrary and representative transportation table in . More precisely, we propose to use a simple proxy for the optimal transportation distance: the dot-product of with a matrix that lies at the center of .

#### 5.3.1 Independence Table

Many candidate tables can qualify as valid centers of general polytopes as discussed for instance in (Boyd and Vandenberghe 2004, §8.5). There is, however, a particular table in which is easy to compute and which has been considered as a central point of in previous work: the independence table  (Good 1963). The table , which is trivially in because and , is also the maximal entropy table in , that is the table which maximizes

 h(X)def=−d∑p,q=1XpqlogXpq. (7)

Using the independence table to approximate , that is using the approximation

 minF∈U(ri,rj)⟨M,F⟩≈rTiMrj,

yields the averages independence tables,

 Ξk={∑ijωijrirTj, if k=∞.∑ni=1∑j∈N−ik(M1)ωijrirTj+∑j∈N+ik(M1)ωijrirTj, if k<∞. (8)

Note however that this approximation tends to overestimate substantially the distance between two similar histograms. Indeed, it is easy to check that is positive whenever is a definite distance and has positive entropy. In the case where all coordinates of are equal to , is simply .

#### 5.3.2 Typical Table

Barvinok (2010) argues more recently that most transportation tables are close to the so-called typical table of the transportation polytope and not, as was hinted by Good (1963), to the independence table. We briefly explain the concentration result obtained in (Barvinok 2010, §1.5)Barvinok proves that, under the condition that and do not have too small coefficients, for any table sampled uniformly on the difference between the sum of a subset of coefficients of and the sum of the same coefficients in the typical table

is small with high-probability. Writing

for a set of indices, and for the corresponding sum of coefficients, we have that for sets big enough,

 P{X∈U(r,c),(1−ε)σS(T)≤σS(X)≤(1+ε)σS(T))}≥1−2de−κd

where and depends on the smoothness of and , that is the magnitude of their smallest coefficients. The typical table of two histograms and is defined (Barvinok 2010, §1.2) as the table in the polytope which maximizes the concave function

 g(X)=d∑p,q=1(Xpq+1)ln(Xpq+1)−Xpqln(Xpq). (9)

Computing the typical table directly is not computationally tractable for large values of Barvinok (2009, p.350) provides however a different characterization of as

 Tij=[e−up−vq1−e−up−vq]p,q≤d,

where the vectors and in are the unique minimizers of the convex program

 minu,v>0rTiu+rTjv−d∑p,q=1log(1−e−up−vq). (P5)

Both gradient and Hessian of the objective of Problem (P5) have a simple form. The Hessian can be expressed in a block form where the diagonal blocks are themselves diagonal matrices and the off-diagonal blocks are of rank . and can thus be easily computed using second-order methods. The resulting matrices are thus

 Ξk={∑ijωijTij, if k=∞.∑ni=1∑j∈N−ik(M1)ωijTij+∑j∈N+ik(M1)ωijTij, if k<∞. (10)

We also note that, as for the independence table, will be significantly larger than when and are similar.

Let us provide some intuition on the idea behind using the average typical table in Problem (P3). The solution to Problem (P3) will be a metric which will have high coefficients for any pair of features such that is negative, namely a pair such that the value of of a transportation plan sampled uniformly in each is typically high on average for all mismatched histograms pairs. will be on the contrary small for a pair of features such that the value of is typically high in transportations plans between similar histograms.

To recapitulate the results of this section, we propose to approximate by a linear function and compute its minimum in the intersection of the unit ball and the cone of matrices. This linear objective can be efficiently minimized using a set of tools proposed by (Brickell et al. 2008) adapted to our problem. In order to do so, the unit ball considered to define the feasible set must be the unit ball of the Frobenius norm of matrices. In order to propose such an approximation, we have used the independence and typical tables as representative points of the polytopes . The successive steps of the computations that yield an initial point are spelled out in Algorithm (3).

## 6 Related Work

### 6.1 Metrics on the Probability Simplex

Deza and Deza (2009, §14) provide an exhaustive list of metrics for probability measures, most of which apply to probability measures on and . When narrowed down to distances for probabilities on unordered discrete sets – the dominant case in machine learning applications – Rubner et al. (2000, §2) propose to split the most commonly used distances into two families: bin-to-bin distances and cross-bin distances.

Bin-to-bin distances only compare the couples of bin-counts independently to form a distance between and : the Jensen-divergence, , Hellinger, total variation distances and more generally Csizar -divergences  (Amari and Nagaoka 2001, §3.2) all fall in this category. Notice that each of these distance is known to work usually better for histograms than a straightforward application of the Euclidean distance as illustrated for instance in (Chapelle et al. 1999, Table 4) or in our experimental section. This can be explained in theory using geometric (Amari and Nagaoka 2001, §3) or statistical arguments (Aitchison and Egozcue 2005).

Bin-to-bin distances are easy to compute and accurate enough to compare histograms when all features are sufficiently distinct. When, on the contrary, some of these features are known to be similar, either because of statistical co-occurrence (e.g.  the words Nadal and Federer) or through any other form of prior knowledge (e.g. color or amino-acid similarity) then a simple bin-to-bin comparison may not be accurate enough as argued by (Rubner et al. 2000, §2.2). In particular, bin-to-bin distances are large between histograms with distinct supports, regardless of the fact that these two supports may in fact describe very similar features.

Cross-bin distances handle this issue by considering all possible pairs of cross-bin counts to form a distance. The most simple cross-coordinate distance for general vectors in is arguably the Mahalanobis family of distances,

 dΩ(x,y)=√(x−y)TΩ(x−y),

where is a positive semidefinite matrix. The Mahalanobis distance between and can be interpreted as the Euclidean distance between and where is a Cholesky factor of or any square root of . Learning such linear maps or directly using labeled information has been the subject of a substantial amount of research in recent years. We briefly review this literature in the following section.

### 6.2 Mahalanobis Metric Learning

Xing et al. (2003), followed by Weinberger et al. (2006) and Davis et al. (2007) have proposed different algorithms to learn the parameters of a Mahalanobis distance, that is either a positive semi-definite matrix or a linear map . These techniques define first a criterion and a feasible set of candidate matrices to obtain, through optimization algorithms, a relevant matrix or . The criteria we propose in Section 3 are modeled along these ideas. Weinberger et al. (2006) were the first to consider criteria that only use nearest neighbors, which inspired in this work the proposal of for finite values of in Section 3.2 as opposed to considering the average over all possible distances as in (Xing et al. 2003) for instance.

We would like to insist at this point in the paper that Mahalanobis metric learning and ground metric learning have very little in common conceptually: Mahalanobis metric learning algorithms produce a positive semidefinite matrix or a linear operator . Ground metric learning produces instead a metric matrix . These sets of techniques operate on very different mathematical objects.

It is also worth mentioning that although Mahalanobis distances have been designed for general vectors in , and as a consequence can be applied to histograms, there is however, to our knowledge, no statistical theory which motivates their use on the probability simplex.

### 6.3 Metric Learning in Σd−1

Lebanon (2006) has proposed to learn a bin-to-bin distance in the probability simplex using a parametric family of distances parameterized by a histogram defined as

 dλ(r,c)=arccos(d∑i=1√riλirTλ√ciλicTλ)

This formula can be simplified by using the perturbation operator proposed by Aitchison (1986, p.46):

 ∀r,λ∈Σd−1,r⊙λdef=1rTλ(r1λ1,⋯,rdλd)T

Aitchison argues that the perturbation operation can be naturally interpreted as an addition operation in the simplex. Using this notation, the distance becomes the simple Fisher metric applied to the perturbed histograms and ,

 dλ(r,c)=arccos⟨√r⊙λ,√c⊙λ⟩.

Using arguments related to the fact that the distance should vary accordingly to the density of points described in a dataset, Lebanon (2006) proposes to learn this perturbation in a semi-supervised context, that is making only use of observed histograms but no other side-information. Because of this key distinction we do not consider this approach in the experimental section.

## 7 Experiments

We provide in this section a few details on the practical implementation of Algorithms (1), (2) and (3). We follow by presenting empirical evidence that ground metric learning improves upon other state-of-the-art metric learning techniques when considered on normalized histograms.

### 7.1 Implementation Notes

Algorithms (1), (2) and (3) were implemented using several optimization toolboxes. Algorithm (1) requires the computation of several transportation problems. We use the CPLEX Matlab API implementation of network flows with warm starts to that effect. The computational gains we obtain by using the API, instead of using a function call to the CPLEX matlab toolbox or to the Mosek solver are approximately 4 fold. These benefits come from the fact that only the constraint vector in Problem (P0) needs to be updated at each iteration of the first loop of Algorithm (1). We use the metricNearness toolbox to carry out both the projections of each inner loop iteration of Algorithm (2), as well as the last minimization of Algorithm (3). We compute Typical tables using the fminunc Matlab function, with the gradient and the Hessian of the objective of Problem (P5) provided as auxiliary functions. Since fminunc is by definition an unconstrained solver, its solution is kept if both and satisfy positivity constraints, which is the case for a large majority of pairs of histograms. Whenever these constraints are violated we optimize again this problem by using a slower constrained Newton method.

### 7.2 Images Classification Datasets

We sample randomly classes taken in the Caltech-256 family of images and consider 70 images in each class. Each image is represented as a normalized histogram of GIST features, obtained using the LEAR implementation of GIST features (Douze et al. 2009). These features describe edge directions at mid-resolution computed for each patch of a grid on each image. Each feature histogram is of dimension and subsequently normalized to sum to one.

We split these classes into two sets of classes, and and study the resulting binary classification problems that arise from each pair of classes in

 {a1,⋯,aN}×{aN+1,⋯,a2N}.

For each of these binary classification task we split the points from both classes into points to form a training set and points to form a test set. This amounts to having training points following the notations introduced in Section 3.1.

### 7.3 Distances used in this benchmark

#### 7.3.1 Bin-to-bin distances

We consider the and Hellinger distances on GIST features vectors,

 l1(r,c)=∥r−c∥1,l2(r,c)=∥r−c∥2,H(r,c)=∥√r−√c∥2,

where is the vector whose coordinates are the squared root of each coordinate of .

#### 7.3.2 Mahalanobis distances

We use the public implementations of LMNN (Weinberger and Saul 2009) and ITML (Davis et al. 2007) to learn two different Mahalanobis distances for each task. We run both algorithms with default settings, that is for LMNN and for ITML. We use these algorithms on the Hellinger representations of all histograms originally in the training set. We have considered this representation because the Euclidean distance of two histograms using the Hellinger map corresponds exactly to the Hellinger distance (Amari and Nagaoka 2001, p.57). Since the Mahalanobis distance builds upon the Euclidean distance, we argue that this representation is more adequate to learn Mahalanobis metrics in the probability simplex. The significant gain in performance observed in Figure 2 that is obtained through this simple transformation confirms this intuition.

#### 7.3.3 Ground Metric Learning

We learn ground metrics using the following settings. In each classification task, and for two images and , , each weight is set to if both histograms come from the same class and to if they come from different classes. The weights are further normalized to ensure that and . As a consequence the total sum of weights is equal to . The neighborhood parameter is set to to be directly comparable to the same parameter used for ITML and LMNN. The subgradient stepsize of Algorithm (2) is set to , guided by preliminary experiments and by the fact that, because of the normalization of the weights introduced above, both the current iteration in Algorithm (2) and the gradient steps or all have comparable norms as matrices. We perform a minimum of gradient steps in each inner loop and set to . Each inner loop is terminated when the progress is too small, that is when the objective does not progress more than every steps, or when reaches . We compute initial points using different representative tables as described in Algorithm (3). Figure 4 illustrates the variation in performance for different choices of . There is no “natural” distance between GIST features that we could consider. We have tried a few, taking for instance distances based on the directions and possible locations in the grid described by the GIST features, but we could not come up with one that was competitive with any of the methods considered above. This situation illustrates our claim in the introduction that GML can select agnostically a metric for features without using any prior knowledge on the features.

### 7.4 Comparison with the SVM baseline

For each distance , we consider its exponentiated kernel

and use a Support Vector Machine (SVM) classifier

(Cortes and Vapnik 1995) on the

test points estimated with the

training points. We use the following parameter grid to train the SVM’s : the regularization parameter is selected within the range ; the bandwidth parameter is selected as a multiple of the median distance computed in the training set, that is is selected within the range

. We use a 4 folds (testing on left-out fold) and 2 repeats cross validation procedure on the training set to select the parameter pair that has lowest average error. Transportation distances are not negative definite in the general case, which is why we add a sufficient amount of diagonal regularization (minus the smallest negative eigenvalue) on the resulting

Gram matrices to ensure that they are positive definite in the training phase.

### 7.5 Results

The most important results of this experimental section are summarized in Figure 1, which displays , for all considered distances, their average recall accuracy on the test set and the average classification error using a -nearest neighbor classifier. These quantities are averaged over binary classifications. In this figure, GML used with EMD is shown to provide, on average, the best possible performance: the left hand figure considers test points and shows that, for each point considered on its own, GML-EMD selects on average more same class points as closest neighbors than any other distance. The performance gap between GML-EMD and competing distances increases significantly as the number of retrieved neighbors is itself increased. The right hand figure displays the average error over all tasks of a -nearest neighbor classification algorithm when considered with all distances for varying values of . In this case too, GML combined with EMD fares much better than competing distances. The average error when using a SVM with these distances is represented in the legend of the right-hand side figure. Our results agree with the general observation in metric learning that Support Vector Machines perform usually better than -nearest neighbor classifiers with learned metrics (Weinberger and Saul 2009, Table 1). Note however that the -nearest-neighbor classifier seeded with GML-EMD has an average performance that is directly comparable with that of support vector machines seeded with the or kernels.

It is also worth mentioning as a side remark that the distance does not perform as well as the or Hellinger distances on these datasets, which validates our earlier statement that the Euclidean geometry is usually a poor choice to compare histograms directly. This intuition is further validated in Figure 2, where Mahalanobis learning algorithms are show to perform significantly better when they use the Hellinger representation of histograms.

Figure 3 shows that the performance of GML can vary significantly depending on the neighborhood parameter . We have also considered as a ground metric the initial point obtained by using typical tables and with Algorithm 3. The corresponding results appear as EMD Typ curves in the figure. These curves are far below those corresponding to GML-EMD Typ. This gap illustrates the fact that the subgradient descent performed by Algorithm 2 does have a real impact on performance, and that choosing a good initial point in itself is not enough.

Finally, Figure 4 reports additional performance curves for different initial points . These experiments tend to show that, despite their computational overhead, Barvinok’s typical tables seem to provide a better initial point than independence tables in terms of average performance. Please note that in Figures 1 and 2, and in Figures 3 and 4.