Notations

denotes an inner product.

denotes the set of symmetric matrices.

denotes the set of symmetric positive definite matrices.

denotes the tangent space to the manifold at the point .

denotes the matrix Frobenius norm.

Chol(P) denotes the lower triangular matrix obtained from the Cholesky decomposition of a matrix P.

exp() and log() denote matrix exponential and logarithm respectively.

and represent partial derivatives.
1 Introduction
Many computer vision applications involve features that obey specific constraints. Such features often lie in nonEuclidean spaces, where the underlying distance metric is not the regular norm. For instance, popular features like shapes, rotation matrices, linear subspaces, symmetric positive definite (SPD) matrices, etc. are known to lie on Riemannian manifolds. In such cases, one needs to develop inference techniques that make use of the underlying manifold structure.
Over the past few years, manifolds have been receiving considerable attention from the computer vision community. In this work, we focus our attention on the set of SPD matrices. Examples of SPD matrices in computer vision include diffusion tensors [1], structure tensors [2] and covariance region descriptors [3]. Diffusion tensors arise naturally in medical imaging [1]. In diffusion tensor magnetic resonance imaging (DTMRI), water diffusion in tissues is represented by a diffusion tensor characterizing the anisotropy within the tissue. In optical flow estimation and motion segmentation, structure tensors are often employed to encode important image features, such as texture and motion [2]. Covariance region descriptors are used in texture classification [3], object detection [4], object tracking, action recognition and face recognition [5]. There are several advantages of using covariance matrices as region descriptors. Covariance matrices provide a natural way of fusing multiple features which might be correlated. The diagonal entries of a covariance matrix represent the variance of individual features and the nondiagonal entries represent the cross correlations. The noise corrupting individual samples is largely filtered out with an averaging filter during covariance computation. Covariance matrices are low dimensional compared to joint feature histograms. Covariance matrices do not have any information regarding the ordering and the number of points. This implies a certain level of scale and rotation invariance over the regions in different images.
Various distance measures have been proposed in the literature for the comparison of SPD matrices. Among them, the two most widelyused distance measures are the affineinvariant distance [1] and the logFrobenius distance [6] (also referred to as logEuclidean distance in the literature). The main reason for their popularity is that they are geodesic distances induced by Riemannian metrics.
The logEuclidean framework [6] proposed by Arsigny et. al. defines a class of Riemannian metrics, rather than a single metric, called logEuclidean Riemannian metrics. According to this framework, any inner product defined on extended to by left or right multiplication is a biinvariant Riemannian metric. Equipped with this biinvariant metric, the space of SPD matrices is a flat Riemannian space and the geodesic distance corresponding to this biinvariant Riemannian metric is equal to the distance induced by in . Surprisingly, this remarkable result has not been used by the computer vision community. Since
is a vector space, this result allows us to learn logEuclidean Riemannian metrics and corresponding logEuclidean geodesic distances from the data by using Mahalanobis distance learning techniques like informationtheoretic metric learning (ITML) [7] and large margin nearest neighbor distance learning [8] in
. In this work, we explore this idea of data driven Riemannian metrics/geodesic distances for the set of SPD matrices. For learning Mahalanobis distances in we use the ITML technique.Organization: In section 2, we provide a brief overview of various distance measures used in the literature to compare SPD matrices. We briefly explain the ITML technique in section 3 and present our approach for learning logEuclidean Riemannian metrics/logEuclidean geodesic distances from the data in section 4. We provide some experimental results in section 5 and conclude the paper in section 6.
2 Distances to compare SPD matrices
Various distance measures have been used in the literature to compare SPD matrices. Each distance has been derived from different geometrical, statistical or informationtheoretic considerations. Though many of these distances try to capture the nonlinearity of SPD matrices, not all of them are geodesic distances induced by Riemannian metrics. Tables 1 and 2 summarize these distances and their properties. Among them, the logFrobenius distance[6] and the affineinvariant distance[1] are the most popular ones.
3 Mahalanobis distance learning using ITML
Information theoretic metric learning [7] is a technique for learning Mahalanobis distance functions from the data based on similarity and dissimilarity constraints. Let be a set of points in . Given pairs of similar points and pairs of dissimilar points , the aim of ITML is to learn an SPD matrix such that the Mahalanobis distance parametrized by is below a given threshold for similar pairs of points and above a given threshold for dissimilar pairs of points.
Let denote the LogDet divergence between SPD matrices defined as
(1) 
ITML formulates the Mahalanobis matrix learning as the following optimization problem:
(2)  
where denotes the index of the th constraint, is the vector of variables , is a vector whose components equal for similarity constraints and for dissimilarity constraints, is an SPD matrix that captures the prior knowledge about , and is a parameter controlling the tradeoff between satisfying the constraints and minimizing . This optimization problem can be solved efficiently using Bregman iterations. In this work, we use the publicly available ITML code provided by the authors of [7].
ITML parameters: We need to specify the values for the following parameters while using ITML: . We choose the constraint thresholds and as the and percentiles of the observed distribution of distances between pairs of points within the training dataset. Hence, the parameters for the ITML algorithm are and .
Distance  Formula  Symmetric  Triangle inequality  Geodesic 

Frobenius  Yes  Yes  No  
CholeskyFrobenius [13]  Yes  Yes  No  
Jdivergence [12]  Yes  No  No  
JensenBregman LogDet Divergence[11]  Yes  No  No  
Affineinvariant [1]  Yes  Yes  Yes  
LogFrobenius [6]  Yes  Yes  Yes 
Distance  Distance from  Affine invariance  Scale invariance  Rotation invariance  Inversion invariance 
Frobenius  Finite  No  No  Yes  No 
CholeskyFrobenius [13]  Finite  No  No  No  No 
Jdivergence [12]  Infinite  Yes  Yes  Yes  Yes 
JensenBregman LogDet Divergence[11]  Infinite  Yes  Yes  Yes  Yes 
Affineinvariant [1]  Infinite  Yes  Yes  Yes  Yes 
LogFrobenius [6]  Infinite  No  Yes  Yes  Yes 
4 LogEuclidean Riemannian metric learning
The logEuclidean framework [6] proposed by Arsigny et. al. defines a class of Riemannian metrics called logEuclidean metrics. The geodesic distances associated with logEuclidean metrics are called logEuclidean distances. Let be an operation on SPD matrices defined as . We have the following result based on the logEuclidean framework introduced in [6]:
Result 4.1: Any inner product defined on extended to the Lie group by left or right multiplication is a biinvariant Riemannian metric. The corresponding geodesic distance between and is given by
(3) 
where is the norm induced by . Note that here is the inverseexponential map at the identity matrix which is equal to the usual matrix logarithm in this case.
The set of all symmetric matrices form a vector space of dimension . Let denote the column vector form of the upper triangular part of a matrix . This operation provides a dimensional vector representation for . Let be an inner product defined on the vector space and be the corresponding matrix of inner products between the basis vectors corresponding to representation. Note that is uniquely characterized by . The distance between two matrices and induced by this inner product is given by
(4) 
Result 4.2: Let , where . Then, defines a unique inner product denoted by on . This inner product also defines a logEuclidean Riemannian metric which can be obtained by simply extending to the Lie group by left or right multiplication. The corresponding logEuclidean geodesic distance between and is given by
(5) 
The above result follows directly from result 4.1. Result 4.2 says that any Mahalanobis distance defined in the vector space is a geodesic distance on and the corresponding Riemannian metric is uniquely defined by the Mahalanobis matrix . Hence, we can learn Riemannian metrics/geodesic distances for from the data by learning Mahalanobis distance functions in the vector space . Table 3 summarizes our approach for leaning geodesic distances on . In this work, we use ITML technique for Mahalanobis distance learning.
Input: 

for to 
end 
Learn a Mahalanobis distance function using This gives a Mahalanobis matrix , where . 
Output: Geodesic distance between and : 
. 
5 Experiments
In the section, we evaluate the performance of the proposed Riemannian metric/geodesic distance learning approach on two applications: (i) Face matching using Labeled Faces in the Wild (LFW) dataset and (ii) Semisupervised clustering using ETH80 dataset.
5.1 Face matching using LFW face dataset
In this experiment our aim is to predict whether a given pair of face images correspond to the same person or not.
Dataset: The LFW dataset [9] is a collection of face photographs designed for studying the problem of unconstrained face recognition. This dataset consists of 13233 labeled face images of 1680 subjects collected from the web. This dataset consists of two subsets:

Development subset: The development subset consists of 2200 training image pairs, where 1100 are similar pairs and 1100 are dissimilar pairs, and 1000 test image pairs, where 500 are similar pairs and 500 are dissimilar pairs. An image pair is said to be similar if both the images correspond to the same person and dissimilar if they correspond to different persons.

Evaluation subset: The evaluation subset consists of 3000 similar image pairs and 3000 dissimilar image pairs. It is further divided into 10 subsets each of which consists of 300 similar pairs and 300 dissimilar pairs.
All the image pairs were generated by randomly selecting images from the 13233 images in the dataset. The development subset is meant for model and parameter selection. The evaluation subset should be used only once for final training and testing. To avoid overfitting, the image pairs in the development subset were chosen to be different from the image pairs in the evaluation subset.
5.1.1 Feature extraction
We crop the face region in each image and resize it to a image. Following [3], we convert each pixel in an image into a 9dimensional feature vector given by
where are the column and row coordinates respectively, and are the color coordinates and is the grayscale image. We use the covariance matrix of the feature vectors to represent the image.
5.1.2 Experimental protocol
Following the standard experimental protocol for this dataset, we use the development set for selecting the parameters of ITML and then use the evaluation set only once for final training and testing. Following steps summarize our experimental procedure:

Parameter selection: We train the ITML algorithm using the 2200 training pairs of the development subset and then test it on the 1000 test pairs of the development subset. We select the ITML parameters that give the best test accuracy.

Final training and testing: The evaluation set consists of 10 splits and we perform 10fold crossvalidation. In each fold, we use 9 splits (2700 similar pairs and 2700 dissimilar pairs) for training ITML and 1 split (300 similar pairs and 300 dissimilar pairs) for testing. For training ITML, we use the parameters that were selected in the previous step. Since our task is face matching, we need to threshold the learned distance function. In each fold, we find the threshold that gives best training accuracy and use the same threshold for test image pairs.
5.1.3 Comparative methods
We compare the performance of the proposed logEuclidean metric learning approach with the following approaches:

Directly use any of the following distances for matching:

Frobenius, CholeskyFrobenius, Jdivergence, JensenBregman LogDet divergence, Affineinvariant and LogFrobenius.


Use ITML directly with the covariance matrices by treating them as elements of the Euclidean space of symmetric matrices.

Use ITML with the lower triangular matrix obtained by Cholesky decomposition.
In all these methods the distance threshold is obtained in each fold independently based on the training data.
5.1.4 Parameters
The following parameter values were used for ITML:
These parameters were selected using the development subset of the dataset.
Frobenius  CholeskyFrobenius  LogFrobenius  Jdivergence  JensonBregman LogDet divergence  Affine invariant 
53.77  56.62  60.43  60.92  61.62  61.15 
Covariance matrices  Cholesky decompositions  LogEuclidean  

Frobenius  ITML  ITML gain  Frobenius  ITML  ITML gain  Frobenius  ITML  ITML gain 
53.77  57.58  3.81  56.62  63.53  6.91  60.43  69.37  8.94 
5.1.5 Results
Tables 4 and 5 summarize the prediction results for various approaches on the LFW data set. We can draw the following conclusions from these results:

The proposed Riemannian metric/geodesic distance learning approach outperforms the other approaches for comparing covariance matrices.

The logEuclidean geodesic distance learned from the data performs much better than the standard logFrobenius distance.

Distance learning with original covariance matrices or Cholesky decompositions performs poorly compared to distance learning in the logarithm domain.
5.2 Semisupervised clustering using ETH80 object dataset
In this experiment, we are interested in clustering the images in the ETH80 dataset into different object categories.
5.2.1 Dataset
The ETH80 object dataset [10] consists of images of 8 object categories with each category including 10 different object instances. Each object instance has 41 images captured under different views. So, each object category has 410 images resulting in a total of 3280 images.
5.2.2 Feature extraction
We convert each pixel in an image into a 9dimensional feature vector given by
where are the column and row coordinates respectively, and are the color coordinates and is the grayscale image. We compute the covariance matrix of the feature vectors over the entire image and use it to represent the image.
5.2.3 Experimental protocol and parameters
For every object category, we randomly select 4 images from each instance for training. Hence, we use 40 samples from each object category for training, resulting in a total of 320 training images. From each pair of training images, we generate either a similarity constraint or a dissimilarity constraints based on their category labels. We use all such constraints in learning the Mahalanobis distance function. Once we learn the Mahalanobis distance function, we use it for clustering the entire dataset of 3280 images.
We repeat the above procedure 5 times and report the average clustering accuracy. In each run, we select the value of ITML parameter using two fold crossvalidation on the training data. We use the following values for other ITML parameters in all the 5 runs: .
We use Kmeans algorithm for clustering. To handle the localoptimum issue, we run Kmeans with 20 different random initializations and select the clustering result corresponding to the minimum Kmeans cost value.
5.2.4 Comparative methods
We compare the performance of the proposed logEuclidean metric learning approach with the following approaches:

Unsupervised: Directly perform Kmeans clustering using any of the following distances: Frobenius, CholeskyFrobenius and LogFrobenius.

Use ITML directly with the covariance matrices by treating them as elements of the Euclidean space of symmetric matrices.

Use ITML with the lower triangular matrix obtained by Cholesky decomposition.
Computation of mean doesn’t have a closed form solution in the case of Jdivergence or JensenBregman LogDet divergence or Affineinvariant distance. Hence, we need to use some optimization procedure for computing the mean. This makes Kmeans algorithm highly computational. Hence, we do not use these distances for comparison in this work.
5.2.5 Results
Table 6 summarizes the clustering results for various approaches on the ETH80 dataset. We can draw the following conclusions from these results:

The proposed Riemannian metric/geodesic distance learning approach performs better than other approaches for clustering SPD matrices.

The logEuclidean geodesic distance learned from the data performs much better than the standard logFrobenius distance.

Distance learning with original covariance matrices or Cholesky decompositions performs poorly compared to distance learning in the logarithm domain.
6 Conclusion
In this work, we have explored the idea of datadriven Riemannian metrics or geodesic distances. Based on the logEuclidean framework [6], we have shown how geodesic distance functions can be learned for by simply learning Mahalanobis distance functions in the logarithm domain. We have conducted experiments using face and object data sets. The face matching and semisupervised object categorization results clearly show that the learned logEuclidean geodesic distance performs much better than other distances.
Covariance matrices  Cholesky decompositions  LogEuclidean  

Frobenius  ITML  ITML gain  Frobenius  ITML  ITML gain  Frobenius  ITML  ITML gain 
35.58  70.50  34.92  51.13  70.36  19.24  55.70  73.79  18.09 
References

X. Pennec, P. Fillard, and N. Ayache, “A Riemannian Framework for Tensor Computing”, IJCV, 2006.

A. Goh and R. Vidal, “Clustering and Dimensionality Reduction on Riemannian Manifolds”, In CVPR, 2008.

O. Tuzel, F. Porikli, and P. Meer, “Region Covariance: A Fast Descriptor for Detection and Classification”, In ECCV, 2006.

O. Tuzel, F. Porikli, and P. Meer, “Pedestrian Detection via Classification on Riemannian Manifolds”, PAMI, 2008.

M. Harandi, C. Sanderson, R. Hartley, and B. Lovel, “Sparse Coding and Dictionary Learning for Symmetric Positive Definite Matrices: A Kernel Approach”, In ECCV, 2012.

V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “LogEuclidean Metrics for Fast and Simple Calculus on Diffusion Tensors”, Magnetic Resonance in Medicine, 2006.

J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “InformationTheoretic Metric Learning ”, In ICML, 2007.

K. Q. Weinberger and L. K. Saul, “Distance Metric Learning for Large Margin Nearest Neighbor Classification”, JMLR, 2009.

G. B. Huang, M. Ramesh, T. Berg, and E. L.Miller, “Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments”, University of Massachusetts, Amherst, Technical Report 0749, October, 2007.

B. Leibe and B. Schiele, “Analyzing Appearance and Contour Based Methods for Object Categorization”, In CVPR, 2003.

A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos, “Efficient Similarity Search for Covariance Matrices via the JensenBregman LogDet Divergence”, In ICCV, 2011.

Z. Wang and B. C. Vemuri, “An Affine Invariant Tensor Dissimilarity Measure and its Applications to Tensorvalued Image Segmentation”, In CVPR, 2004.

I. L. Dryden, A. Koloydenko, and D. Zhou, “NonEuclidean Statistics for Covariance Matrices, with Applications to Diffusion Tensor Imaging”, The Annals of Applied Statistics, 2009.