Learning Local Invariant Mahalanobis Distances

02/04/2015 ∙ by Ethan Fetaya, et al. ∙ 0

For many tasks and data types, there are natural transformations to which the data should be invariant or insensitive. For instance, in visual recognition, natural images should be insensitive to rotation and translation. This requirement and its implications have been important in many machine learning applications, and tolerance for image transformations was primarily achieved by using robust feature vectors. In this paper we propose a novel and computationally efficient way to learn a local Mahalanobis metric per datum, and show how we can learn a local invariant metric to any transformation in order to improve performance.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related work

Metric learning is an active research field with many algorithms, generally divided into linear Weinberger and Saul (2009) which learn a Mahalanobis distance, non-linear Kedem et al. (2014) that learn a nonlinear transformation and use distance on the transformed space, and local which learn a metric per datum. The LMNN and MLMM Weinberger and Saul (2009) algorithm are considered the leading metric learning method. For a recent comprehensive survey that covers linear, non-linear and local methods see Bellet et al. (2014).

The exemplar-SVM algorithm Malisiewicz et al. (2012) can be seen as a local similarity measure. This is obtained by maximizing margins, with a linear model, and is weakly supervised as our work. Unlike exemplar-SVM, we learn a Mahalanobis matrix and can learn an invariant metric. Another related work is PMLM Wang et al. (2012), which also finds a local Mahalanobis metric for each data point. However, this method uses global constraints, and therefore cannot work with weakly supervised data, i.e. a single positive example. All the techniques above do not learn local invariant metrics.

The most common way to achieve invariance, or at least insensitivity, to a transformation in computer vision applications is by using hand-crafted descriptors such as SIFT

Lowe (2004) or HOG Dalal and Triggs (2005). Another way, used in convolutional networks Lecun et al. (1998), is by adding pooling and subsampling forcing the net to be insensitive to small translations. It is important to note that transformations such as rotations have a global behaviour, i.e. there is a global consistency between the pixel movement. This global consistency is not totally captured by the pooling and subsampling. As we will see in our experiments, using an invariant metric can be useful even when working with robust features such as HOG.

2 Local Mahalanobis

In this section we will show how a local Mahalanobis distance with maximal margin can be learned in a fast and simple way.

We will assume that we are given a single query image that belong to some class, e.g. a face of a person. We will also be given a set of negative data that do not belong to that class, e.g. a set of face images of various other people. We will learn a local Mahalanobis metric for , , where means is positive semi-definite. For matrices , we will denote by the Frobinous norm and by the standard inner product .

We wish to find a Mahalanobis matrix given the positive datum and the negative data . Large margin methods have been very successful in metric learning Weinberger and Saul (2009), and more generally in machine learning, therefore, our algorithm will look for the PSD matrix that maximizes the distance to the closest negative example


The optimization cannot be solved as it is not bounded, since multiplying by a scalar multiplies the minimum distance by the same scalar. This can be solved by normalizing to have . As normally done with margin methods, we can minimize the norm under fixed margin constrained instead of maximizing the margin under fixed norm constraint. The resulting objective is


Where the constant

is arbitrary and will be convenient later on. While this is a convex semidefinite programming task, it is very slow for reasonable dimensional data (in the thousands) even for state of the art solvers. This is because PSD solvers apply a projection to the semidefinite cone, performing an expensive singular value or eigen decomposition at each iteration.

To solve this optimization in a fast manner we will first relax the PSD constraint and look at the following objective


We will then see how this is equivalent to a kernel SVM problem with a quadratic kernel, and therefore can be solved easily with off-the-shelf SVM solvers such as LIBSVM Chang and Lin (2011). Finally we will show how the solution of objective 5 is in fact the solution of objective 4 resulting in a fast solution to objective 4 without any matrix decomposition.

Theorem 1.

The solution of objective 5 is given by running kernel SVM with kernel on inputs where


Define , a function that maps a column vector to a matrix. This function has the following simple properties:

  • , i.e. the function is the mapping associated with the quadratic kernel.

  • For any matrix , we have .

which can be easily verified using . We can define auxiliary labelling and for . Combining everything objective 5 can be rewritten as


where we can include as for any matrix.

Objective 6 is exactly an SVM problem with quadratic kernel, with bias fixed to one, given inputs

, proving the theorem. Notice also that for the identity matrix

we have for and for , therefore the data is separable by and the optimization is feasible. ∎

Now that we have shown how objective 5 can be converted into a standard SVM form, for which efficient solvers exists, we will show how it is the solution to objective 4.

Theorem 2.

The solution to objective 5 is the solution to objective 4.


To prove the theorem it suffices to show that the solution is indeed positive semidefinite. A well known observation arrising from the dual formulation of the SVM objective Burges (1998) is that the optimal solution has the form


Since for any

, as its only nonzero eigenvalue is

, , and for we get


where the positive semidefiniteness in eq. 8 is assured due to the set of PSD matrices being a convex cone. ∎

Combining theorem 1 with theorem 2 we get that in order to solve objective 4 it is enough to run an SVM solver with a quadratic kernel function, thus avoiding any matrix decomposition.

Looking at this as a SVM problem has further benefits. The SVM solvers do not compute directly, but return the set of support vectors and coefficients such that . This allows us to work in high dimension , where the memory needed to store the matrix can be a problem, and can slow computations further. As the rank of the matrix is bounded by the number of support vectors, one can see that in many applications we get a relatively low rank matrix. This bound on the rank can be improved by using sparse-SVM algorithms Cotter et al. (2013). In practice we got low rank matrices without resorting to sparse solvers.

3 Local Invariant Mahalanobis

For some applications, we know a priori that certain transformations should have a small effect on the metric. We will show how to include this knowledge into the local metric we learn, learning locally invariant metrices. In section 4 we will see this has a major effect on performance.

Assume we know a set of functions that the desired metric should be insensitive to, i.e. should be small for all and . A canonical example is small rotations and translations on natural images. One of the major issues in computer vision arises from the instability of the pixel representation to these transformations. Various descriptors such as SIFT Lowe (2004) and HOG Dalal and Triggs (2005) offer a more robust representation, and have been highly successful in many computer vision applications. We will show in section 4 that even when using a relatively robust representation such as HOG, learning an invariant metric has a significant impact.

A natural way to mathematically formulate the idea of being insensitive to a transformation, is to require the leading term of the approximation to vanish in that direction. In our case this means


If we return to our basic intuition of the local Mahalanobis matrix as the Hessian matrix , we can now state the new local invariant Mahalanobis objective


We will show how by applying a small transformation to the data, we can reduce this to objective 4 which we can solved easily.

Theorem 3.

Define , then the minimizer of objective 4 with replaced by , its projection to is the minimizer of objective 10.


For PSD matrices, the constraint that is equivalent to . This can be seen if we write the vector in the basis of eigenvectors, and notice that components with positive eigenvalues have a positive contribution to the quadratic form. This means that for all . Each vector can be split into two orthogonal elements, where is its projection onto and is its projection onto . Our equality constraints now imply


since all the other terms vanish. We can now rewrite objective 10 as


If we forget the equality constrains we get objective 4 with replaced by . To finish the proof we need to show that the solution to the optimization without the equality constraints, does indeed satisfy them.

As we have already seen in the proof of theorem 2 the optimal solution is of the form . The vector is a member of so for


Proving that the solution satisfies the equality constraints. ∎

A few comments are worth noting about this formulation. First the problem may not be linearly separable, although in our experiments with real data we did not encounter any unseperable case. This can be easily solved, if needed, by the standard method of adding slack variables. Second, the algorithm just adds a simple preprocessing step to the previous algorithm and runs in approximately the same time.

4 Experiments

4.1 Running time

We compared running the optimization with an SVM solver Chang and Lin (2011), to solving it as a semidefinite problem and as a quadratic problem (relaxing the semidefinite constraint). The main limitation when running off-the-shelf solvers is memory. Quadratic or semidefinite solvers need a constraint matrix, which in our case is a full matrix of size where is the number of samples and is the data dimension. We tested all three approaches on the MNIST dataset of dimension 784 using only 5000 negative examples, as this already resulted in a matrix of size 24.6Gb.

Currently first order methods, such as ADMM Boyd et al. (2010), are the leading approaches to solving problems such as quadratic and semidefinite programming for large matrices. We used YALMIP for modeling and solved using SCS O’Donoghue et al. (2013). The time to run this as an semidefinite program was . The time it took to run this as a quadratic program was . In comparison, when we run this as an SVM problem it took at most . We excluded the time needed to build the constraint matrix for the quadratic and semidefinite solvers.

This order of magnitude improvement should not be a surprise. It is a well known that while SVM can be solved as a quadratic program, generic quadratic solvers perform much slower then solver designed specifically for SVM.

4.2 Mnist

Method Error
eSVM 1.75%
eSVM+shifts 1.59%
Local Mahal 1.69%
quadSVM+shifts 1.50%
inv-Mahal (our method) 1.26%
LMNN 1.69%
MLMNN 1.18%
Table 1: Classification error for MNIST dataset.

The MNIST dataset is a well known digit recognition dataset, comprising of grayscale images on which we perform deskewing preprocessing. For each of the training images we computed a local Mahalanobis distance and local invariant Mahalanobis(using only negative examples). On test time we performed classification with using the local metrics. We show some examples of nearest neighbours in Figure 1. We compared this with exemplar-SVM, as the leading technique most similar to ours. We also compared our scores to exemplar-SVM where we add the tansformed images as positive training data. To show the importance of the invariance objective, we compare also to SVM with quadratic kernel to which we add the transformed data as positive training data (unlike the way we use the shifted data). Finally, we compared our results to the state-of-the-art metric learning LMNN method (linear metric), and to MLMNN, a local version of LMNN, which learns multiple metrics (but not one per datum).

As can be seen in table 1, we perform much better then exemplar SVM and are comparable with MLMNN. It is important to note that unlike MLMNN, we compare each datum only to negatives, so our methods is applicable in scenarios where MLMNN is not.

Figure 1: Nearest neighbour for various matrices. (a) original image (b) distance (c) exemplar-SVM (d) local-Mahalanobis (e) local invariant Mahalanobis

Another key observation is the difference between the invariant-Mahalanobis and the quadratic-SVM with shifts. While very similar functionally, we see that looking at the problem as a local Mahalanobis matrix gives important intuition, i.e. the way to use the shifted images, that leads to better performance.

4.3 Labeling faces in the wild (LFW)

LFW is a challenging dataset containing 13,233 face images of 5749 different individuals with a high level of variability. The LFW dataset is divided into 10 subsets, when the task is to classify 600 pairs of images from one subset to same/not-same using the other 9 subsets as training data. We perform the unsupervised LFW task, where we do not use any labelling inside the training images we get, besides the fact that they are different than both test images.

We used the aligned images Huang et al. (2012) and represented using HOG features Dalal and Triggs (2005)

. We compared our results to a cosine similarity baseline, to exemplar-SVM and exemplar-SVM with shifts. We note that we cannot use LMNN or MLMNN on this data, as we only have negative images with a single positive image.

Method Error
cosine similarity 30.57 1.4%
eSVM 26.902.2%
eSVM+shifts 27.122.3%
Local Mahal 19.851.3%
inv-Mahal (our method) 19.481.5%
Table 2: Classification error for LFW dataset.

As we can see from table 2, the local Mahalanobis greatly out-performs the exemplar-SVM. We also see that even when using robust features such as HOG, learning an invariant metric improves performance, albeit to a lesser degree.

5 Summary

We showed an efficient way to learn a local Mahalanobis metric given a query datum and a set of negative data points. We have also shown how to incorporate prior knowledge about our data, in particular the transformations to which it should be robust, and use it to learn locally invariant metrics. We have shown that our methods are competitive with leading methods while being applicable to other scenarios where methods such as LMNN and MLMNN cannot be used.


  • Bar-Hillel et al. (2005) Bar-Hillel, A., Hertz, T., Shental, N., and Weinshall, D. (2005). Learning a mahalanobis metric from equivalence constraints. JMLR.
  • Bellet et al. (2014) Bellet, A., Habrard, A., and Sebban, M. (2014). A survey on metric learning for feature vectors and structured data. arXiv:1306.6709.
  • Boyd et al. (2010) Boyd, S., Parikh, N., Chu, Eric Peleato, B., and Eckstein, J. (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning.
  • Burges (1998) Burges, C. (1998).

    A tutorial on support vector machines for pattern recognition.

    Data Mining and Knowledge Discovery.
  • Chang and Lin (2011) Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology.
  • Chechik et al. (2010) Chechik, G., Sharma, V., Shalit, U., and Bengio, S. (2010). Large scale online learning of image similarity through ranking. JMLR.
  • Cotter et al. (2013) Cotter, A., Shalev-Shwartz, S., and Srebro, N. (2013). Learning optimally sparse support vector machines. ICML.
  • Dalal and Triggs (2005) Dalal, N. and Triggs, B. (2005). Histograms of oriented gradients for human detection. CVPR.
  • Davis et al. (2007) Davis, J., Kulis, B., Jain, P., Sra, S., and Dhillon, I. (2007). Information-theoretic metric learning. ICML.
  • Frome et al. (2006) Frome, A., Singer, Y., and Malik, J. . (2006). Image retrieval and classification using local distance functions. NIPS.
  • Guillaumin et al. (2009) Guillaumin, M., Verbeek, J., and Schmid, C. (2009). Is that you? metric learning approaches for face identification. ICCV.
  • Hoi et al. (2008) Hoi, S., Liu, W., and Chang, S.-F. (2008). Semi-supervised distance metric learning for collaborative image retrieval. CVPR.
  • Huang et al. (2012) Huang, G., Mattar, M. A., Lee, H., and Learned-Miller, E. (2012). Learning to align from scratch. NIPS.
  • Kedem et al. (2014) Kedem, D., Tyree, S., Weinberger, K., and Sha, F. (2014). Non-linear metric learning. NIPS.
  • Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proc. IEEE.
  • Lim and Lanckriet (2014) Lim, D. and Lanckriet, G. (2014). Efficient learning of mahalanobis metrics for ranking. ICML.
  • Lowe (2004) Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. IJCV.
  • Malisiewicz et al. (2012) Malisiewicz, T., Gupta, A., and Efros, A. (2012). Ensemble of exemplar-svms for object detection and beyond. ICCV.
  • O’Donoghue et al. (2013) O’Donoghue, B., Chu, E., Parikh, N., and Boyd, S. (2013). Operator splitting for conic optimization via homogeneous self-dual embedding. arXiv:1312.3039.
  • Wang et al. (2012) Wang, J., Woznica, A., and Kalousis, A. (2012). Parametric local metric learning for nearest neighbor classification. NIPS.
  • Weinberger and Saul (2009) Weinberger, K. and Saul, L. (2009). Distance metric learning for large margin nearest neighbor classification. JMLR.
  • Xiang et al. (2008) Xiang, S., Nie, F., and Zhang, C. (2008). Learning a mahalanobis distance metric for data clustering and classification. Pattern Recognition.