 # Dimensionality Reduction on SPD Manifolds: The Emergence of Geometry-Aware Methods

Representing images and videos with Symmetric Positive Definite (SPD) matrices, and considering the Riemannian geometry of the resulting space, has been shown to yield high discriminative power in many visual recognition tasks. Unfortunately, computation on the Riemannian manifold of SPD matrices -especially of high-dimensional ones- comes at a high cost that limits the applicability of existing techniques. In this paper, we introduce algorithms able to handle high-dimensional SPD matrices by constructing a lower-dimensional SPD manifold. To this end, we propose to model the mapping from the high-dimensional SPD manifold to the low-dimensional one with an orthonormal projection. This lets us formulate dimensionality reduction as the problem of finding a projection that yields a low-dimensional manifold either with maximum discriminative power in the supervised scenario, or with maximum variance of the data in the unsupervised one. We show that learning can be expressed as an optimization problem on a Grassmann manifold and discuss fast solutions for special cases. Our evaluation on several classification tasks evidences that our approach leads to a significant accuracy gain over state-of-the-art methods.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Dimensionality Reduction (DR) is imperative in various disciplines of computer science, including machine learning and computer vision. Conventional methods, such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), are specifically designed to work with real-valued vectors coming from a flat Euclidean space. In modern computer vision, however, the data and the mathematical models often naturally lie on Riemannian manifolds (

e.g., subspaces form Grassmannian, 2D shapes lie on Kendall shape spaces Kendall (1984)). There has therefore been a growing need and interest to go beyond the extensively studied Euclidean spaces and analyze non-linear and curved Riemannian manifolds. In this context, a natural question arises: How can popular DR techniques be extended to curved Riemannian spaces? A principled answer to this question will open the door to exploiting higher-dimensional, more discriminative features, and thus to improved accuracies in a wide range of applications involving classification and clustering.

This paper tackles the problem of dimensionality reduction on the space of Symmetric Positive Definite (SPD) matrices, i.e., the SPD manifold. In computer vision, SPD matrices have been successfully employed for a variety of tasks, such as analyzing medical imaging Pennec et al. (2006), segmenting images Carreira et al. (2014) and recognizing textures Tuzel et al. (2006); Harandi et al. (2015), pedestrians Tuzel et al. (2008); Tosato et al. (2013); Jayasumana et al. (2015), faces Pang et al. (2008); Wang et al. (2012); Sivalingam et al. (2014), and actions Sanin et al. (2013); Guo et al. (2013).

The set of SPD matrices is clearly not a vector space as it is not closed under addition and scalar product (e.g., multiplying a positive definite matrix with a negative scalar makes it negative definite). As such, analyzing SPD matrices through the geometry of Euclidean spaces, such as using the Frobenius inner product as a mean of measuring similarity, is not only unnatural, but also inadequate. This inadequacy has recently been demonstrated in computer vision by a large body of work, e.g., Pennec et al. (2006); Tuzel et al. (2008); Jayasumana et al. (2015). One striking example is the swelling effect

that occurs in diffusion tensor imaging (DTI), where a matrix represents the covariance of the local Brownian motion of water molecules

Pennec et al. (2006)

– when considering Euclidean geometry to interpolate between two diffusion tensors, the determinant of the intermediate matrices may become strictly larger than the determinants of both original matrices, which, from a physics point of view, is unacceptable.

A popular and geometric way to analyze SPD matrices is through the Riemannian structure induced by the Affine Invariant Riemannian Metric (AIRM) Pennec et al. (2006)

, which is usually referred to as SPD manifold. The geodesic distance induced by AIRM is related to the distance induced by the Fisher-Rao metric on the manifold of multivariate Gaussian distributions with fixed means (see for example

Atkinson and Mitchell (1981)). It enjoys several properties, such as invariance to affine transformations, which are of particular interest in computer vision.

While the Riemannian structure induced by AIRM has been shown to overcome the limitations of Euclidean geometry to a great extent, the computational cost of the resulting techniques increases drastically with the dimension of the manifold (i.e., the size of the SPD matrices). As a consequence, with the exception of a few works that handle medium-sized features Carreira et al. (2014); Wang et al. (2012), previous studies have opted for low-dimensional SPD matrices (e.g., region covariance descriptors obtained from low-dimensional features). Clearly, and as evidenced by the recent feature-learning trends in computer vision, low-dimensional features are bound to be less powerful and discriminative. In other words, to match or even outperform state-of-the-art recognition systems on complex tasks, manifold-based methods will need to exploit high-dimensional SPD matrices. This paper introduces techniques to perform supervised and unsupervised DR methods dedicated to SPD manifolds, as illustrated by Fig. 1. Figure 1: Dimensionality Reduction on SPD Manifolds: Given data on a high-dimensional SPD manifold, where each sample represents an n×n SPD matrix, we learn a mapping to a lower-dimensional SPD manifold. We consider both the supervised scenario, illustrated here, where the resulting m×m SPD matrices are clustered according to class labels, and the unsupervised one, where the resulting matrices have maximum variance.

More specifically, in the supervised scenario, we introduce an approach that constructs a lower-dimensional and more discriminative SPD manifold from a high-dimensional one. To this end, we encode the notion of discriminative power by pulling together the training samples from the same class while pushing apart those from different classes. We study three variants of this approach, where the distance is defined by either the AIRM, the Stein divergence Sra (2012), or the Jeffrey divergence Wang and Vemuri (2004). In particular, the latter two divergences are motivated by the fact that they share invariance properties with the AIRM while being faster to compute.

In the unsupervised scenario, we draw inspiration from the Maximum Variance Unfolding (MVU) algorithm Weinberger and Saul (2006). That is, we introduce a method that maps a high-dimensional SPD manifold to a low-dimensional one, where the training matrices are furthest apart from their mean. As in the supervised case, we study three variants, that rely on the AIRM, the Stein divergence and the Jeffrey divergence, respectively Cherian et al. (2013).

We demonstrate the benefits of our approach on several classification and clustering tasks where the data can be represented with high-dimensional SPD matrices. In particular, our method outperforms state-of-the-art techniques on image-based material categorization and face recognition, and action recognition from 3D motion capture sequences. A Matlab implementation of our algorithms is available from the first author’s webpage

111This paper is an extended and revised version of our earlier work Harandi et al. (2014). In addition to providing more insight on the proposed methods, we extend our initial work by introducing an unsupervised DR algorithm and deriving variants of our unsupervised/supervised DR methods based on the Jeffrey divergence. .

## 2 Background Theory

This section provides an brief review of the Riemannian geometry of SPD manifolds, as well as of Bregman divergences and their properties.

Notation: Throughout the paper, bold capital letters denote matrices (e.g., ) and bold lower-case letters denote column vectors (e.g., ). is the identity matrix. denotes the general linear group, i.e., the group of real invertible matrices. is the space of real symmetric matrices. and are the SPD and Grassmannian manifolds, respectively, and will be formally defined later. is a diagonal matrix with the real values as diagonal elements. The principal matrix logarithm is defined as

 log(X)=∞∑r=1(−1)r−1r(X−In)r=UDiag\@(log(λi))UT, (1)

with . Similarly, the matrix exponential is defined as

 exp(X)=∞∑r=01r!Xr=UDiag\@(exp(λi))UT, (2)

with .

### 2.1 The Geometry of SPD Manifolds

An real SPD matrix has the property that for all non-zero . The space of SPD matrices, denoted by forms the interior of a convex cone in the -dimensional Euclidean space. is mostly studied when endowed with the Affine Invariant Riemannian Metric (AIRM) Pennec et al. (2006), defined as

 ⟨v,w⟩P ≜⟨P−1/2vP−1/2,P−1/2wP−1/2⟩ =Tr(P−1vP−1w), (3)

for and , where denotes the tangent space of the manifold at . This metric induces the following geodesic distance between points and :

 δR(X,Y)=∥log(X−1/2YX−1/2)∥F. (4)

The AIRM has several useful properties such as invariance to affine transformations (as the name implies), i.e., . For in-depth discussions of the AIRM, we refer the interested reader to Pennec et al. (2006) and Bhatia (2007).

### 2.2 Bregman Divergences

We now introduce two divergences derived from the Bregman matrix divergence, namely the Jeffrey and Stein divergences. Below, we discuss their properties and establish some connections with the AIRM, which motivated our choice of these divergences in our DR formulations.

Let be a strictly convex and differentiable function defined on the symmetric positive cone . The Bregman matrix divergence is defined as

 dζ(X,Y)=ζ(X)−ζ(Y)−⟨∇ζ(Y),X−Y⟩, (5)

where is the Frobenius inner product, and represents the gradient of evaluated at .

The Bregman divergence is asymmetric, non-negative, and definite (i.e., ). While the Bregman divergence enjoys a variety of useful properties Kulis et al. (2009), its asymmetric behavior is often a hindrance. In this paper we are interested in two types of symmetrized Bregman divergences, namely the Stein and the Jeffrey divergences.

The Stein, or , divergence (also known as Jensen-Bregman LogDet divergence Cherian et al. (2013)) is obtained from the Bregman divergence of Eq. 5 by using as seed function and by Jensen-Shannon symmetrization. This yields

 δ2S(X,Y) ≜12dζ(X,X+Y2)+12dζ(Y,X+Y2) (6)

The Jeffrey, or , divergence (also known as symmetric KL divergence) is obtained from the Bregman divergence of Eq. 5 by using as seed function and by direct symmetrization. This yields

 δ2J(X,Y) ≜12dζ(X,Y)+12dζ(Y,X) =12Tr(X−1Y)−12logdet(X−1Y) +12Tr(Y−1X)−12logdet(Y−1X)−n =12Tr(X−1Y)+12Tr(Y−1X)−n. (7)

The and divergences have a variety of properties akin to those of the AIRM. Below, we present the properties that motivated us to perform DR on using such divergences.

#### Invariance properties

An especially attractive property for the computer vision community is the invariance of the and divergences to affine transforms. More specifically (and similarly to the AIRM), for , we have

 δ2S(X,Y) =δ2S(AXAT,AYAT). δ2J(X,Y) =δ2J(AXAT,AYAT).

Furthermore, both divergences are invariant to inversion, i.e.,

 J(X,Y) =J(X−1,Y−1) S(X,Y) =S(X−1,Y−1).

Proofs for the above statements can be readily obtained by plugging the affine representations (e.g. ) or inverses into the definition of the and divergences.

#### Equality of curve lengths

Given a curve , let , and denote the length of under AIRM, Stein and J-divergence, respectively. Then and . The proof for the case of is given in Harandi et al. (2014) and for the is relegated to the supplementary material.

Beyond the previous two properties, the divergence also enjoys the following Hilbert space embedding property, which does not hold for AIRM Jayasumana et al. (2015).

#### Hilbert space embedding

The

divergence admits a Hilbert space embedding in the form of a Radial Basis Function (RBF) kernel

Sra (2012). More specifically, the kernel

 kS(X,Y)=exp{−βδ2S(X,Y)}, (8)

is positive definite for

 β∈{12,22,⋯,n−12}∪(12(n−1),∞). (9)

The kernel has been previously considered to be positive definite (e.g., equations 5 and 6 in Moreno et al. (2003)). However, we find that this is not the case as can be seen by the following counter example

 X1=, X2=[123−10−1066], X3=.

Here, the matrix

has a negative eigenvalue for

. With the mathematical tools discussed in this section, we can now turn to developing our DR algorithms for SPD matrices. In the following sections, we first start by introducing our approach to tackling supervised DR on SPD manifolds and then discuss the unsupervised scenario.

## 3 DR on SPD Manifolds

In this section, we describe our approach to learning an embedding of high-dimensional SPD matrices to a more discriminative, low-dimensional SPD manifold. In doing so, we propose to learn the parameters , , of a generic mapping , which we define as

 fW(X)=WTXW. (10)

Clearly, for a full rank matrix , if then . Given a set of training SPD matrices , where each matrix , our goal is to find the transformation such that the resulting low-dimensional SPD manifold preserves some interesting structure of the original data. In the remainder of this section, we discuss two different such structures: one coming from the availability of class labels, and one derived from unsupervised data.

### 3.1 Supervised Dimensionality Reduction

Let us first assume that each point belongs to one of possible classes and denote its class label by . In this scenario, we propose to encode the structure of the data via an affinity function . That is measures some notion of affinity between matrices and , and may be negative. In particular, we make use of the class labels to build 222Note that the framework developed in this section could also apply to the unsupervised and semi-supervised settings by changing the definition of the affinity function accordingly. and define an affinity function that encodes the notions of intra-class and inter-class distances. In short, our goal is to find a mapping that minimizes the intra-class distances while simultaneously maximizing the inter-class distances (i.e., a discriminative mapping).

More specifically, we make use of notions of within-class similarity and between-class similarity to compute the affinity between two SPD matrices. In particular, we define and to be binary functions given by

 gw(Xi,Xj)={1,%ifXi∈Nw(Xj)~{}or~{}Xj∈Nw(Xi)0,otherwise (11)
 gb(Xi,Xj)={1,%ifXi∈Nb(Xj)~{}or~{}Xj∈Nb(Xi)0,otherwise (12)

where is the set of nearest neighbours of that share the same label as , and contains the nearest neighbours of having different labels. Note that nearest neighbours are computed according to the AIRM, the Stein divergence, or the Jeffrey divergence. The affinity function is then defined as

 a(Xi,Xj)=gw(Xi,Xj)−gb(Xi,Xj), (13)

which resembles the Maximum Margin Criterion (MMC) of Li et al. (2006).

Having at our disposal, we propose to search for an embedding such that the affinity between pairs of SPD matrices is reflected by a measure of similarity on the low-dimensional SPD manifold. In particular, we make use of the AIRM, the Stein divergence, or the Jeffrey divergence to encode similarity between SPD matrices. This lets us write a cost function of the form

 L(W)=p∑i,j=1j≠ia(Xi,Xj)δ2(WTXiW,WTXjW), (14)

where is , or . To avoid degeneracies and ensure that the resulting embedding forms a valid SPD manifold, i.e., , we need to have full rank. Here, we enforce this requirement by imposing the unitary constraints . Note that, with the affine invariance property, this entails no loss of generality. Indeed, any full rank matrix can be expressed as , with an orthonormal matrix and . The affine invariance property of the metric therefore guarantees that

 L(~W)=L(WM)=L(W).

As a result, learning can be expressed as the minimization problem

 W∗= argminW∈Rn×mp∑i,j=1j≠ia(Xi,Xj)δ2(WTXiW,WTXjW) s.t. WTW=Im. (15)

As will be discussed in Section 3.3, (3.1) is an optimization problem on a Grassmann manifold, and can thus be solved by Newton-type methods on the Grassmannian . To this end, we need to compute the Jacobian of with respect to . For the divergence, this Jacobian matrix, denoted by hereafter, can be obtained by noting that (see Eq. 53 in Petersen and Pedersen (2012))

 DWlogdet(WTXW)=2XW(WTXW)−1. (16)

This lets us express the Jacobian of the Stein divergence as

 DWδ2S(WTXW,WTYW)=−XW(WTXW)−1 (17) −YW(WTYW)−1+(X+Y)W(WTX+Y2W)−1.

For the divergence, the Jacobian can be obtained by noting that (see Eq. 126 in Petersen and Pedersen (2012))

 DWTr(WTXW(WTYW)−1)=2XW(WTYW)−1 −2YW(WTYW)−1(WTXW)(WTYW)−1, (18)

 DWδ2J(WTXW,WTYW)= (19)

For the AIRM, we can exploit the fact that . We can then derive the Jacobian by utilizing Eq. 16, which yields

 DWδ2R(WTXW,WTYW)= (20) 4(XW(WTXW)−1−YW(WTYW)−1) ×log(WTXW(WTYW)−1).

Our supervised DR method for SPD matrices is summarized in Algorithm 1, where denotes the parallel transport of tangent vector from to (see Section 3.3 for details).

### 3.2 Unsupervised Dimensionality Reduction

We now turn to the scenario where we do not have access to the labels of the training samples. In other words, our training data only consists of a set of SPD matrices . To tackle this unsupervised DR scenario, we draw inspiration from algorithms, such as PCA and MVU Weinberger and Saul (2006). These algorithms search for a low-dimensional latent space where the points have maximum variance, i.e., collectively have maximum distance to their mean.

Here, we follow the same intuition, but with the goal of mapping high-dimensional SPD matrices to lower-dimensional ones. To this end, we express unsupervised DR on SPD manifolds as the optimization problem

 W∗= argmaxW∈Rn×mp∑i=1δ2(WTXiW,WTMW) s.t. WTW=Im, (21)

with the mean of with respect to the metric . As in the supervised case, and as discussed in more details in Section 3.3, (3.2) corresponds to an optimization problem in the Grassmann manifold. We therefore again opt for a Newton-type method on the Grassmannian to (approximately) solve it. Note that the gradient of the objective function has essentially the same form as in the supervised case, and can thus be easily obtained from Eq. (17), Eq. (19) and Eq. (20) for the Stein divergence, the AIRM and J-divergence, respectively.

As mentioned above, (3.2) depends on the mean of the training samples. Since these samples lie on a manifold, special care must be taken to compute their means. In particular, we make use of the Fréchet formulation to obtain . This can be expressed as

 M∗≜argminM∈Sn++  p∑i=1δ2(Xi,M). (22)

For the AIRM, this is equivalent to computing the Riemannian (Karcher) mean by exploiting the exponential and logarithm maps Pennec et al. (2006). For the Stein metric, we make use of the iterative Convex Concave Procedure (CCCP) introduced in Cherian et al. (2013). For the Jeffrey divergence, we show below that, unlike the AIRM and the Stein divergence, the Fréchet mean can be computed analytically.

The Fréchet mean of a set of points , based on the Jeffrey metric, i.e., the minimizer of Eq. 22 for , is given by

 M∗=L−1/2(L1/2ΓL1/2)1/2L−1/2 (23)

with and .

To prove this theorem, let us first we recall that, for and , a quadratic equation of the form , called a Riccati equation, has only one positive definite solution of the form Bhatia (2007)

 X=A−1/2(A1/2BA1/2)1/2A−1/2. (24)

According to Eq. 22, and by making use of the divergence, the Fréchet mean must satisfy

 ∂∑Ni=1δ2J(Xi,M)∂M=0. (25)

Given that

 ∂Tr(XM−1)∂M=M−1XM−1,

we have

 ∂∑Ni=1δ2J(Xi,M)∂M=N∑i=1X−1i−N∑i=1M−1XiM−1=0 ⇔MN∑i=1X−1iM=N∑i=1Xi,

which is a Riccati equation with a unique and closed form solution. A slightly different proof is also provided in Wang and Vemuri (2004).

There is a subtle difference between PCA in Euclidean space and the solution developed here. More specifically, unlike PCA in Euclidean space does not necessarily represent the mean of the transformed data in . That is,

 WTMW≠argminM∈Sm++  N∑i=1δ2(WTXiW,M)

in general.

### 3.3 Optimization Framework

Both the unsupervised and supervised DR techniques introduced above can be viewed as solving an optimization problem with a unitary constraint, which can generally be written as

 minW f(W) s.t. WTW=Im, (26)

where is the desired cost function and . In Euclidean space problems of the form of (3.3) are typically cast as eigenvalue problems (e.g., Saul and Roweis (2003); Li et al. (2006); Yan et al. (2007); Jia et al. (2009); Kokiopoulou et al. (2011)). However, the complexity of our cost functions prohibits us from doing so. Instead, we propose to make use of manifold-based optimization techniques.

Recent advances in optimization methods formulate problems with unitary constraints as optimization problems on Stiefel or Grassmann manifolds Edelman et al. (1998); Absil et al. (2008). More specifically, the geometrically correct setting for the minimization problem in (3.3) is, in general, on a Stiefel manifold. However, if the cost function is independent from the choice of basis spanned by , that is if for , then the problem is on a Grassmann manifold. Here denotes the group of orthogonal matrices. In our case, because of the affine invariance of the AIRM, the Stein divergence and the Jeffrey divergence, it can easily be checked that our cost function is indeed independent of the choice of basis. We can therefore make use of Grassmannian optimization techniques, and, in particular, of Newton-type optimization, which we briefly review below.

A Grassmann manifold is the space of -dimensional linear subspaces of for  Absil et al. (2008). Newton-type optimization, such as conjugate gradient (CG), over a Grassmannian is an iterative optimization routine that relies on a notion of search direction. In , such a direction is determined by the gradient vector. Similarly, on an abstract Riemannian manifold , the gradient of a smooth function identifies the direction of maximum ascent. Furthermore, the gradient of at a point , denoted by , is the element of satisfying . Here, is the Riemannian metric at and denotes the directional derivative of at along direction .

On , the gradient is expressed as

 ∇Wf(W)=(In−WWT)DW(f), (27)

where is the matrix of partial derivatives of with respect to the elements of , i.e.,

 [DW(f)]i,j=∂f(W)∂Wi,j.

For our approach, these derivatives are given by Eqs. 1719 and 20 for the Stein divergence, the Jeffrey divergence and the AIRM, respectively. Figure 2: Newton-type optimization on Riemannian manifolds. Here M denotes an abstract Riemannian manifold and TWM is the tangent space of M at W. Δ represents the gradient of the cost function f, for example, Δ0 is the gradient of f at W0. In each iteration of a Newton-type method, the gradient of the cost function is evaluated and a descent direction is determined (for the steepest descent it is simply along the gradient). The descent direction is mapped back to the manifold via the exponential map (or a retraction) to identify the new solution. The aforementioned procedure continues until convergence.

The descent direction obtained via needs to be mapped back to the manifold by the exponential map or by a retraction (see Chapter 4 in Absil et al. (2008) for definitions and detailed explanations). In the case of the Grassmannian, this can be understood as forcing the unitary constraint while making sure that the cost function decreases. Fig. 2 provides a conceptual diagram for Newton-type optimization on Riemannian manifolds.

Here, in particular, we make use of a CG method on the Grassmannian. CG methods compute the new descent direction by combining the gradient at the current and the previous solutions. To this end, it requires transporting the previous gradient to the current point on the manifold. Unlike in flat spaces, on a manifold one cannot transport a tangent vector from one point to another point by simple translation. To get a better intuition, take the case where the manifold is a sphere, and consider two tangent spaces, one located at the pole and one at a point on the equator. Obviously the tangent vectors at the pole do not belong to the tangent space at the equator. Therefore, simple vector translation is not sufficient. As illustrated in Fig. 3, transporting from to on the manifold requires subtracting the normal component at for the resulting vector to be a tangent vector. Such a transfer of tangent vector is called parallel transport. On the Grassmann manifold, parallel transport, and the other operations required for a CG method, have efficient numerical forms, which makes them well-suited to perform optimization on the manifold.

CG on a Grassmann manifold can be summarized by the following steps:

• Compute the gradient of the objective function on the manifold at the current solution using

 ∇Wf(W)=(In−WWT)DW(f). (28)
• Determine the search direction by parallel transporting the previous search direction and combining it with .

• Perform a line search along the geodesic at in the direction . On the Grassmann manifold, the geodesics going from point in direction can be represented by the Geodesic Equation Absil et al. (2008)

 X(t)=[XVU][cos(Σt)sin(Σt)]VT (29)

where is the parameter indicating the location along the geodesic, and

is the compact singular value decomposition of

.

These steps are repeated until convergence to a local minimum, or until a maximum number of iterations is reached. Figure 3: Parallel transport of a tangent vector Δ from a point W to another point V on the manifold.

It is worth mentioning that optimization techniques on matrix manifolds (e.g., Stiefel, Grassmann) are the core of several recent DR schemes Cunningham and Ghahramani (2015); Huang et al. (2015b, a). This is in part due to the availability of the manopt package, which makes optimizing over various Riemannian manifolds simple and straight-forward Boumal et al. (2014). As a matter of fact, in our experiments, we used the implementation of the Grassmannian CG method provided by manopt to obtain . Note that manopt also provides other methods, such as trust-region solvers. A full evaluation of these solvers, however, goes beyond the scope of this paper.

## 4 Further Discussions

In this section, we discuss several points regarding our DR framework. In particular, we discuss the case of the log-Euclidean metric Arsigny et al. (2006) and derive a formulation for this metric. Furthermore, we discuss the specific case where the SPD matrices encode Region Covariance Descriptors Tuzel et al. (2006).

### 4.1 DR with the Log-Euclidean Metric

In Section 3, we have developed DR methods on based on the AIRM and on two Bregman divergences. Another widely used metric to compare SPD matrices is the log-Euclidean metric defined as

 δlE(X,Y)=∥log(X)−log(Y)∥F, (30)

where denotes the matrix principal logarithm. This metric is indeed a true Riemannian metric (for a zero-curvature manifold) and can be understood as a metric over the flat space that identifies the Lie algebra of an SPD manifold. Below, we develop a supervised DR method on SPD manifolds similar to the one in Section 3.2, but using log-Euclidean metric. The adaptation to the unsupervised scenario introduced in Section 3.2 can easily be derived in a similar manner.

With the log-Euclidean metric, (3.1) can be rewritten as

 s.t. WTW=Im. (31)

A difficulty in tackling (31) arises from the fact that an analytic form for the gradient of with respect to is not known Yger and Sugiyama (2015). To overcome this limitation, we introduce an approximation of . This approximation relies on the following lemma.

The term can be approximated as . Note that the Taylor expansion of is given by Cheng et al. (2001),

 log(In−A)=−A−12A2−13A3−⋯. (32)

Therefore, we can write

 log(WTXW) =log(In−(In−WTXW)) ≈−(In−WTXW))=−WT(In−X)W ≈WTlog(X)W,

where both the second and third lines make use of the first order Taylor approximation from Eq. 32.

From the lemma above, we can cast (31) into the optimization problem

 minW∈Rn×mp∑i,j=1a(Xi,Xj)∥∥WTlog(Xi)W−WTlog(Xj)W∥∥2F, s.t. WTW=Im. (33)

The objective function of (33) can then be written as

 =Tr(WTF(W)W),

with

 F(W)=p∑i,j=1 a(Xi,Xj)(log(Xi)−log(Xj))WWT× (34)

which yields the optimization problem

 s.t. WTW=Im. (35)

We note that is invariant to the action of the orthogonal group, i.e., changing with for does not change the value of the trace. As such, in principle, (35) is a problem on and can be optimized in a similar manner as discussed before. In particular, to perform Newton-type methods on the Grassmannian, the required gradient is given by

 DWTr(WTF(W)W)=4p∑i,j=1a(Xi,Xj)×

While optimization on the Grassmannian can indeed be employed to solve (35), here, we propose a faster alternative which relies on eigen-decomposition. To this end, we follow an iterative two-stage procedure. First, we fix (i.e., assume that it does not on ), and compute the solution of the resulting approximation of (35), which can be achieved by computing the

smallest eigenvectors of

Kokiopoulou et al. (2011). Given the new , we update , and iterate. The pseudo-code of this procedure is given in Algorithm 2.

Figure 4 compares the speed and convergence behavior of our iterative eigen-decomposition-based solution against the CG method on the Grassmannian. This figure was computed using the MOCAP dataset (see Section 6.1.2 for details). First, note that the eigen-decomposition solution converges much faster than CG. While CG yields a slightly lower error, our experiments show that the eigen-decomposition solver is typically 10 times faster than CG on the Grassmannian, which, we believe, justifies its usage.

The recent work of Huang et alHuang et al. (2015b) introduced the idea of learning a log-Euclidean metric, which is related to our log-Euclidean-based supervised DR approach. This work formulates DR as the problem of finding a positive semi-definite matrix that maximizes the discriminative power of pairs of samples according to

In particular, was forced to have rank , and thus identifies a low-dimensional latent space. Obtaining was then formulated as a log-det problem Huang et al. (2015b). Our formulation, here, yields a much simpler optimization problem, and will thus be faster. Figure 4: Convergence behavior of the proposed Eigen-Decomposition solver against conjugate gradient optimization on the Grassmann manifold.

Since, in (33), the operation maps the matrices to Euclidean space, one might wonder if we should not apply a traditional vector-based DR approach to the resulting space. Note, however, that our goal is to go from to . Therefore, optimizing a projection between the corresponding Euclidean spaces would translate to an optimization problem on . By contrast, our symmetric formulation results in an optimization problem on , which is clearly less expensive.

From a purely geometrical point of view, we believe that the solutions developed using the AIRM, the Stein divergence and the Jeffrey divergence are more attractive. In particular, these solutions model the nonlinear geometry of the SPD manifold, while the log-Euclidean metric flattens it. Furthermore, in contrast with the log-Euclidean metric, the AIRM, the Stein and the Jeffrey divergences are invariant to affine transformations. We acknowledge, however, that the log-Euclidean metric has been shown to be a useful substitute to the AIRM in several applications (e.g., Arsigny et al. (2006); Wang et al. (2012); Carreira et al. (2014)).

Since the Frobenius norm also belongs to the family of Bregman divergences (with ), one could derive supervised/unsupervised DR formulations using as similarity measure. For example, in the supervised case, this would translate to solving

 s.t. WTW=Im. (36)

This can easily be rewritten in the form of (35), but where now has the

 F(W)=p∑i,j=1a(Xi,Xj)(Xi−Xj)WWT(Xi−Xj).

Therefore, our previous eigen-decomposition solution directly applies here.

### 4.2 Region Covariance Descriptors

When it comes to the SPD matrices used in our experiments, we exploited Region Covariance Matrices (RCMs) Tuzel et al. (2006) as image descriptors. Here, we discuss some interesting properties of our algorithm when applied to these specific SPD matrices.

There are several reasons why RCMs are attractive to represent images and videos. First, RCMs provide a natural way to fuse various feature types. Second, they help reducing the impact of noisy samples in a region via their inherent averaging operation. Third, RCMs are independent of the size of the region, and can therefore easily be utilized to compare regions of different sizes. Finally, RCMs can be efficiently computed using integral images Tuzel et al. (2008); Sanin et al. (2013).

Let be a image, and be a set of observations extracted from , e.g., concatenates intensity values, gradients along the horizontal and vertical directions, filter responses,… for image pixel . Let be the mean value of the observations. Then, image can be represented by the RCM

 CI=1r−1r∑i=1(oi−μ)(oi−μ)T=OJJTOT, (37)

where . To have a valid RCM, , otherwise would have zero eigenvalues, which would make both and indefinite.

After learning the projection , the low-dimensional representation of image is given by . This reveals two interesting properties of our learning scheme. 1) The resulting representation can also be thought of as an RCM with as a set of low-dimensional observations. Hence, in our framework, we can create a valid manifold with only observations instead of at least in the original input space. This is not the case for other algorithms, which require having matrices on as input. 2) Applying directly the set of observations reduces the computation time of creating the final RCM on . This is due to the fact that the computational complexity of computing an RCM is quadratic in the dimensionality of the features.

## 5 Related Work

In this section, we review the methods that have exploited notions of Riemannian geometry for DR. In contrast with our approach that goes from one high-dimensional SPD manifold to a lower-dimensional manifold, most of the literature has focused on going from a manifold to Euclidean space.

In this context, a popular approach consists of flattening the manifold via its tangent space. The best-known example of such an approach is Principal Geodesic Analysis (PGA) Fletcher et al. (2003, 2004). PGA and its variants such as Said et al. (2007); Huckemann et al. (2010); Sommer et al. (2010) have been successfully employed for various application, such as analyzing vertebrae outlines Sommer et al. (2009) and motion capture data Said et al. (2007). PGA can be understood as a generalization of PCA to Riemannian manifolds. To this end, the widely-used formulation proposed in Fletcher et al. (2004) identifies the tangent space whose corresponding subspace maximizes the variability of the data on the manifold. PGA, however, is equivalent to flattening the Riemannian manifold by taking its tangent space at the Karcher, or Fréchet, mean of the data. As such, it does not fully exploit the structure of the manifold. Furthermore, PGA, as PCA, cannot exploit the availability of class labels, and may therefore be sub-optimal for classification.

Another recent popular trend consists of embedding the manifold to an RKHS to perform DR. In particular, Jayasumana et al. (2015) relied on kernel PCA and Wang et al. (2012) on kernel Partial Least Squares (kPLS) and kernel Linear Discriminant Analysis (LDA) to achieve this goal. Embedding the manifold to an RKHS inherently requires a p.d. kernel. While significant progress has been made in identifying p.d. kernels on Riemannian manifolds Jayasumana et al. (2015); Li et al. (2013); Feragen et al. (2015), our knowledge is still limited in this regard. For example and in the case of SPD manifolds, the kernel employed in Wang et al. (2012) is a linear kernel on the identity tangent space of . In Jayasumana et al. (2015), the best performing kernel corresponds to the Gaussian kernel defined on the identity tangent space of . Therefore, in a very strict sense, the true structure of the manifold is not used by either of these works. As a matter of fact, it was recently shown that Gaussian kernels cannot preserve the geodesic distances on non-flat manifold Feragen et al. (2015).

In contrast to the previous methods, which flatten the manifold, via either a tangent space, or an RKHS, Goh and Vidal (2008) directly employs notions of Riemannian geometry to perform nonlinear DR. In particular, Goh and Vidal (2008) extends several nonlinear DR techniques, such as Locally Linear Embedding (LLE), Hessian LLE and Laplacian Eigenmaps, to their Riemannian counterparts. Take for example the case of LLE Saul and Roweis (2003). Given a set of vectors , the LLE algorithm determines a weight matrix which minimizes a notion of reconstruction error on . Once the weight matrix is determined, the algorithm embeds into a lower dimensional space where the neighbouring properties of are preserved. The neighbouring properties are encoded by and the embedding takes the form of an eigen-decomposition in the end. As shown in Goh and Vidal (2008), the construction of can be generalized to the case of an arbitrary Riemannian manifold by using the logarithm map. Hence, for a given set of points on , an embedding from can be obtained once is appropriately constructed. In Goh and Vidal (2008), the authors showed on several clustering problems on that the embedded data was more discriminative than the original one. In principle, the Riemannian extension of LLE (and of the other nonlinear DR algorithms discussed in Goh and Vidal (2008)) can also be applied to classification problems. However, they are limited to the transductive setting since they do not define any parametric mapping to the low-dimensional space.

In contrast to the previous methods, whose learned mappings are to Euclidean space, a few recent techniques have studied the case of mapping between two manifolds of different dimensions. To this end, these techniques have also made use of the bilinear form of Eq. 10. In Wang et al. (2011)

, a mapping between covariance matrices of different dimensions was learnt, but by ignoring the Riemannian geometry of the SPD manifold. More recently, and probably inspired by our preliminary study

Harandi et al. (2014), this bilinear form was employed to perform DR on the SPD manifold and on the Grassmannian by exploiting notions of Riemannian geometry Huang et al. (2015b, a); Yger and Sugiyama (2015). We also acknowledge that the work of Xu et alXu et al. (2015) is somehow relevant to the log-Euclidean development done in §4.1. However, in contrast to our proposal, in Xu et al. (2015) authors did not impose an orthogonality constraint on .

Finally, concepts of Riemannian geometry have also been exploited in the context of DR in Euclidean space. For instance, Lin and Zha Lin and Zha (2008) exploit the idea that the input (Euclidean) data lies on a low-dimensional Riemannian manifold. Recently, Cunningham and Ghahramani Cunningham and Ghahramani (2015) revisited linear DR techniques and analyzed them using the geometry of Stiefel manifolds.

## 6 Empirical Evaluation

We now evaluate our different SPD-based DR methods on several problems. We first consider the supervised scenario and present results on image and video classification tasks. We then turn to evaluating our unsupervised techniques for clustering on SPD manifolds. In all our experiments, the dimensionality of the low-dimensional SPD manifold was determined by cross-validation.

### 6.1 Image/Video Classification

The supervised SPD-DR algorithm introduced in Section 3.1

allows us to obtain a low-dimensional, more discriminative SPD manifold from a high-dimensional one. Many different classifiers can then be used to categorize the data on this new manifold. In our experiments, we make use of two such classifiers. First, we employ a simple nearest neighbour classifier based on the manifold metric (AIRM,

or divergence). This simple classifier clearly evidences the benefits of mapping the original Riemannian structure to a lower-dimensional one. Second, we make use of the Riemannian sparse coding algorithm of Harandi et al. (2015). This algorithm exploits the notion of sparse coding to represent a query SPD matrix using a codebook of SPD matrices. In all our experiments, we formed the codebook purely from the training data, i.e., no dictionary learning was employed. Note that this algorithm relies on a kernel derived from either the divergence or the log-Euclidean metric. We refer to the different algorithms evaluated in our experiments as:

• NN-AIRM: AIRM-based Nearest Neighbour classifier.

• NN-S: divergence-based Nearest Neighbour classifier.

• NN-J: divergence-based Nearest Neighbour classifier.

• NN-lE: log-Euclidean-based Nearest Neighbour classifier.

• NN-AIRM-DR: AIRM-based Nearest Neighbour classifier on the low-dimensional SPD manifold obtained with our approach.

• NN-S-DR: divergence-based Nearest Neighbour classifier on the low-dimensional SPD manifold obtained with our approach.

• NN-J-DR: divergence-based Nearest Neighbour classifier on the low-dimensional SPD manifold obtained with our approach.

• NN-lE-DR: log-Euclidean-based Nearest Neighbour classifier on the low-dimensional SPD manifold obtained with our approach.

• kSC-S: kernel sparse coding Harandi and Salzmann (2015) using the divergence on the high-dimensional SPD manifold.

• kSC-lE: kernel sparse coding using the log-Euclidean metric on the high-dimensional SPD manifold.

• kSC-S-DR: kernel sparse coding using the divergence on the low-dimensional SPD manifold obtained with our approach.

• kSC-lE-DR: kernel sparse coding using the log-Euclidean metric on the low-dimensional SPD manifold obtained with our approach.

In addition to these methods, we also provide the results of the PLS-based Covariance Discriminant Learning (CDL) technique of Wang et al. (2012), as well as of the state-of-the-art baselines of each specific dataset.

In practice, to define the affinity function (see Section 3.1), we set to the minimum number of points in each class and, to balance the influence of and , choose , with the specific value found by cross-validation.

#### 6.1.1 Material Categorization

For the task of material categorization, we used the UIUC dataset Liao et al. (2013). The UIUC material dataset contains 18 subcategories of materials taken in the wild from four general categories (see Fig. 5): bark, fabric, construction materials, and outer coat of animals. Each subcategory has 12 images taken at various scales. Following standard practice, half of the images from each subcategory was randomly chosen as training data, and the rest was used for testing. We report the average accuracy over 10 different random partitions.

Small RCMs, such as those used for texture recognition in Tuzel et al. (2006), are hopeless here due to the complexity of the task. Recently, SIFT features Lowe (2004) have been shown to be robust and discriminative for material classification Liao et al. (2013). Therefore, we constructed RCMs of size using 128 dimensional SIFT features (from grayscale images) and 27 dimensional color descriptors. To this end, we resized all the images to and computed dense SIFT descriptors on a regular grid with 4 pixels spacing. The color descriptors were obtained by simply stacking colors from patches centered at the grid points. Each grid point therefore yields one 155-dimensional observation in Eq. 37. The parameters for this experiments were set to (minimum number of samples in a class), and obtained by 5-fold cross-validation.

Table 3 compares the performance of the studied algorithms. The performance of the state-of-the-art method on this dataset was reported to be 43.5% Liao et al. (2013). The results show that appropriate manifold-based methods (i.e., kSC-S and CDL) with the original RCMs already outperform this state-of-the-art, while nearest neighbour (e.g., NN-AIRM, NN-S) on the same manifold yields worse performance. However, after applying our learning algorithm, NN not only outperforms state-of-the-art significantly, but also outperforms both CDL and kSC, except for the log-Euclidean solution. For example, kSC using the divergence is boosted by near than 14% by dimensionality reduction. The maximum accuracy of is obtained by kernel sparse coding on the learned SPD manifold (kSC-S-DR). Figure 5: Samples from the UIUC material dataset Liao et al. (2013).

#### 6.1.2 Action Recognition from Motion Capture Data

As a second experiment, we tackled the problem of human action recognition from motion capture sequences using the HDM05 database Müller et al. (2007). This database contains the following 14 actions: ‘clap above head’, ‘deposit floor’, ‘elbow to knee’, ‘grab high’, ‘hop both legs’, ‘jog’, ‘kick forward’, ‘lie down floor’, ‘rotate both arms backward’, ‘sit down chair’, ‘sneak’, ‘squat’, ‘stand up lie’ and ‘throw basketball’ (see Fig. 6 for an example). The dataset provides the 3D locations of 31 joints over time acquired at the speed of 120 frames per second. We describe an action of a joints skeleton observed over frames by its joint covariance descriptor Hussein et al. (2013), which is an SPD matrix of size . More specifically, let , and be the , , and coordinates of the joint at frame . Let be the vector of all joint locations at time , i.e., , which has elements. The SPD matrix describing an action occurring over frames is then taken as the covariance of the vectors .

In our experiments, we used 2 subjects for training (i.e., ’bd’ and ’mm’) and the remaining 3 subjects for testing (i.e., ’bk’, ’dg’ and ’tr’)333Note that this differs from the setup in Hussein et al. (2013), where 3 subjects were used for training and 2 for testing. However, with the setup of Hussein et al. (2013) where an accuracy of was reported, all our algorithms gave about accuracy, which made it impossible to compare them.. This resulted in 118 training and 188 test sequences for this experiment. The parameters of our method were set to (minimum number of samples in one class),