Sparsity is a popular concept in signal processing [1, 2, 3] and stipulates that natural signals like images can be efficiently described using only a few non-zero coefficients of a suitable basis (i.e. dictionary) . This paper introduces techniques to perform sparse coding on Symmetric Positive Definite (SPD) matrices. More specifically, unlike traditional sparse coding schemes that work on vectors, in this paper we discuss how SPD matrices can be described by sparse combination of dictionary atoms, where the atoms are also SPD matrices.
Our motivation stems from pervasive role of SPD matrices in machine learning, computer vision and related areas. For example, SPD matrices have been used in medical imaging, texture classification [4, 5], action recognition and gesture categorization , as well as face recognition [7, 5].
Extending sparse coding methods to SPD matrices is not trivial, since such matrices form the interior of the positive semidefinite cone. In other words, simply vectorizing SPD matrices and employing Euclidean geometry (e.g., Euclidean norms) does not lead to accurate representations [8, 9, 10]
. To overcome the drawbacks of Euclidean structure, SPD matrices are usually analyzed using a Riemannian structure, known as SPD or tensor manifold. This is where the difficulties arise. On one hand, taking into account the Riemannian geometry is important as discussed in various recent studies [8, 9, 5, 10]. On the other hand, the non-linearity of the Riemannian structure is a hindrance and demands specialized machineries.
Generally speaking, two approaches to handle the non-linearity of Riemannian manifolds are (i) locally flattening them via tangent spaces [9, 6], and (ii) embedding them in higher dimensional Hilbert spaces [5, 11, 10]. The latter has recently received a surge of attention, since embedding into Reproducing Kernel Hilbert Space (RKHS) through kernel methods  is a well-established and principled approach in machine learning. However, embedding SPD manifolds into RKHS requires non-trivial kernel functions defined on such manifolds, which, according to Mercer’s theorem , must be positive definite.
The contributions in this paper111This paper is a thoroughly extended and revised version of our earlier work . In addition to providing more insights on the proposed methods, we extend our primary ideas by studying and devising coding and dictionary learning methods in the RKHS induced by the Jeffrey kernel. We also devise an efficient algorithm to obtain sparse codes in our RKHS-based formulation. are four-fold:
We propose sparse coding and dictionary learning algorithms for data points (matrices) on SPD manifolds, by embedding the manifolds into RKHS. This is advantageous, as linear geometry applies in RKHS.
For the embedding we propose kernels derived from two Bregman matrix divergences, namely the Stein and Jeffrey divergences. While the kernel property of the Jeffrey divergence was discovered in 2005 , to our best knowledge, this is one of the first attempts to benefit from this kernel for analyzing SPD matrices.
For both kernels, we devise a closed-form solution for updating an SPD dictionary atom by atom.
We apply the proposed methods to several computer vision tasks where images are represented by region covariance matrices. Our proposed algorithms outperform state-of-the-art methods on several classification tasks, including face recognition, texture classification and action recognition.
2 Related Work
In computer vision, SPD matrices are used in various applications, including pedestrian detection , texture classification [4, 5], object recognition , object tracking , action recognition [14, 6] and face recognition [7, 5]. This is mainly because Region Covariance Descriptors (RCM) , which encode second order statistics, are straightforward and relatively robust descriptors for images and videos. Moreover, structure tensors, which are by nature SPD matrices, encode important image features (e.g
., texture and motion in optical flow estimation and motion segmentation). Lastly, diffusion tensors that naturally arise in medical imaging are described bySPD matrices .
Our interest in this paper is to perform sparse coding and dictionary learning on SPD matrices, since modern systems in various applications benefit from the notion of sparse coding. However, while significant steps have been taken to develop the theory of the sparse coding and dictionary learning in Euclidean spaces, only a handful of studies tackle similar problems for SPD matrices [15, 14, 16].
Sra and Cherian  proposed to measure the similarity between SPD matrices using the Frobenius norm and formulated the sparse coding and dictionary learning problems accordingly. While solving the problems using purely Euclidean structure of SPD matrices is computationally attractive, it neglects the Riemannian structure of SPD manifolds.
A somehow similar and straightforward idea is to flatten an SPD manifold using a fixed tangent space. Sparse coding by embedding manifolds into their identity tangent spaces, which identifies the Lie algebra of SPD manifolds, is considered in [17, 14, 18]. Though such embedding considerably simplifies the sparse coding formulation, the pair-wise distances are no longer adequate, which can affect discrimination performance. This is exacerbated for manifolds with negative curvature (e.g. SPD manifolds), since pair-wise distances are not even directly bounded222For manifolds with positive curvature, pair-wise distances on tangent spaces are greater or equal to true geodesic distances on the manifold according to Toponogov’s theorem . Such property does not hold for manifolds with negative curvature..
A more involved approach to learn a Riemannian dictionary is proposed very recently by Ho et al. . The underlying idea is to exploit the tangent bundle of the manifold. To avoid a trivial solution in this approach, an affine constraint has to be added to the general formulation . While this results in independency to the origin, it no longer addresses the original problem. Furthermore, switching back and forth to tangent spaces of SPD manifolds (as required by this formulation) can be computationally very demanding for high dimensional manifolds.
Sivalingam et al. [20, 21] proposed Tensor Sparse Coding (TSC) which utilizes the Burg divergence (an asymmetric type of Bregman divergence) to perform sparse coding and dictionary learning on SPD manifolds. To this end, they show that when the Burg divergence is used as the proximity measure, the problem of sparse coding becomes a MAXDET problem which is convex and hence can be solved by interior point algorithms . As for dictionary learning, two methods were proposed in [20, 21]. In the first method, a gradient descent approach was utilized to update dictionary atoms one by one. Inspired by the K-SVD algorithm , the second method updates dictionary atoms by minimizing a form of residual error over training data, which speeds up the process of dictionary learning. Besides the asymmetric nature of the Burg divergence, we note that the computational complexity of the TSC algorithm is high, especially for high-dimensional SPD manifolds.
This section provides an overview on Riemannian geometry of SPD manifolds, Bregman divergences and their properties. It provides the groundwork for techniques described in following sections. Throughout the paper, bold capital letters denote matrices (e.g., ) and bold lower-case letters denote column vectors (e.g., ). Notation is used to indicate element at position of vector . is the identity matrix. and denote the and norms, respectively, with indicating matrix transpose. designates the Frobenius norm. denotes the general linear group, the group of real invertible matrices. is the space of real symmetric matrices.
3.1 Riemannian Geometry of SPD Manifolds
An , real SPD matrix has the property that for all non-zero . The space of SPD matrices, denoted by , is not a vector space since multiplying an SPD matrix by a negative scalar results in a matrix which does not belong to . Instead, forms the interior of a convex cone in the -dimensional Euclidean space. The space is mostly studied when endowed with a Riemannian metric and thus forms a Riemannian manifold .
On a Riemannian manifold, a natural way to measure nearness is through the notion of geodesics, which are curves analogous to straight lines in . The geodesic distance is thus defined as the length of the shortest curve connecting the two points. The tangent space at a point on the manifold, , is a vector space that consists of the tangent (i.e., velocity) vectors of all possible curves passing through .
Two operators, namely the exponential map and the logarithm map , are defined over Riemannian manifolds to switch between the manifold and tangent space at . The exponential operator maps a tangent vector to a point on the manifold. The property of the exponential map ensures that the length of becomes equal to the geodesic distance between and . The logarithm map is the inverse of the exponential map, and maps a point on the manifold to the tangent space . The exponential and logarithm maps vary as point moves along the manifold.
On the SPD manifold, the Affine Invariant Riemannian Metric (AIRM) , defined as:
for and , induces the following geodesic distance between points and :
with being the principal matrix logarithm.
3.2 Bregman Divergences
In this part we introduce two divergences derived from Bregman matrix divergence, namely the Jeffrey and Stein divergences. We discuss their properties and establish their relations to AIRM. This provides motivation and grounding for our formulation of sparse coding and dictionary learning using the aforementioned divergences.
Let be a strictly convex and differentiable function defined on the symmetric positive cone . The Bregman matrix divergence is defined as
where , and represents the gradient of evaluated at .
Loosely speaking, the Bregman divergence between and can be understood as the distance between the function and its first order Taylor approximation constructed at . The Bregman divergence is asymmetric, non-negative, and definite (i.e., ). While the Bregman divergence enjoys a variety of useful properties , its asymmetric behavior can be a hindrance (e.g., in SVMs, the kernels need to be symmetric, hence asymmetric divergences cannot be used to devise kernels). In this paper we are interested in two types of symmetrized Bregman divergences, namely the Jeffrey and the Stein divergences.
The divergence (also known as Jeffrey or symmetric KL divergence) is obtained from the Bregman divergence of Eqn. (3) by using as the seed function where denotes determinant:
3.3 Properties of and divergences
The and divergences have a variety of properties which are akin to those of AIRM. The pertinent properties which inspired us to seek sparse coding on using such divergences are:
the length of curves under AIRM and S divergence is equal up to a scale .
Hilbert space embedding (SPD kernels)
divergences admit a Hilbert space embedding in the form of a Radial Basis Function (RBF) kernel. More specifically, for the -divergence it has been shown that the kernel
is conditionally positive definite . Formally:
Definition 4 (Conditionally Positive Definite Kernels).
Let be a nonempty set. A symmetric function is a conditionally positive definite kernel on if and only if for any , and with .
For a kernel algorithm that is translation invariant, cpd kernels can be used instead of pd kernels .
This property relaxes the requirement of having pd kernels for certain types of kernel algorithms. For example, in SVMs, a cpd kernel can be seamlessly used instead of a pd kernel. We note that in  the kernel was claimed to be positive definite. However, a formal proof is not available according to our best knowledge. For the Stein divergence, the kernel
is guaranteed to be positive definite for
4 Sparse Coding
Given a query , sparse coding in vector spaces optimizes the objective function
with being a dictionary of size . The function penalizes the solution if it is not sparse. The most common form of in the literature is obtained via -norm regularization:
where is a Riemannian dictionary and is a query point. The operators , and are Riemannian replacements for subtraction, summation and scalar multiplication, respectively. We note that the operators and should be commutative and associative.
There are several difficulties in solving Eqn. (11). For example, metrics on Riemannian manifolds do not generally result in Eqn. (11) being convex . As such, instead of solving Eqn. (11), here we propose to side-step the difficulties by embedding the manifold into a Hilbert space and replacing the idea of “combination” on manifolds with the general concept of linear combination in Hilbert spaces.
For the SPD manifold , our idea is implemented as follows. Let and be a Riemannian dictionary and an embedding function on , respectively. Given a Riemannian point , we seek a sparse vector such that admits the sparse representation over . In other words, we are interested in solving the following problem:
where and , with . Since is a reproducing kernel, is positive definite. This reveals that the optimization problem in Eqn. (13) is convex and similar to its counterpart in Euclidean space, except for the definition of and . Consequently, greedy or relaxation solutions can be adapted to obtain the sparse codes . To solve Eqn. (13) efficiently, we have extended the Feature-Sign Search Algorithm (FSSA)  to its kernel version (kFSSA) in Appendix 7.
We note that kernel sparse coding and dictionary learning in traditional Euclidean spaces are studied recently in [34, 35]. In contrast, our aim is to obtain sparse coding of points on SPD manifolds, using SPD matrices as dictionary atoms. In our proposed solution this requires dedicated SPD kernels. Moreover, as will be discussed in § 5 dedicated algorithms for dictionary learning should be devised.
4.1 Classification Based on Sparse Representation
If the atoms in the sparse dictionary are not labeled (for example if36] for classification. In a supervised classification scenario, i.e., if the atoms in sparse dictionary are labeled, the generated sparse codes of the query sample can be directly used for classification. Let be the class-specific sparse codes, where is the class label of atom and is the discrete Dirac function . An efficient way of utilizing class-specific dictionary is through computing residual errors . In this case, the residual error of query sample for class is defined as:
Expanding Eqn. (14) and noting that is not class-dependent, the following expression can be obtained:
Alternatively, the similarity between query sample to class can be defined as . The function could be a linear function like or even a non-linear one like . Preliminary experiments suggest that Eqn. (15) leads to higher classification accuracies when compared to the aforementioned alternatives.
4.2 Computational Complexity
In terms of computational complexity, we note that the complexity of computing the determinant of an matrix through Cholesky decomposition is . Therefore, computing by storing the determinant of dictionary atoms during learning costs .
For the divergence, we note that the inverse of an SPD matrix can be computed through Cholesky decomposition with flops. Therefore, can be computed in flops if matrix multiplication is done efficiently. As a result, computing the divergence is cheaper than computing divergence for SPD matrices of size less than 35.
The complexity of sparse coding is dictated by in Eqn. (15). Neglecting the complexity of the exponential in kernel functions, the complexity of generating Eqn. (15) is for divergence and for divergence.
Note that while the computational complexity is cubic in , it is linear in , i.e., number of dictionary atoms. To give the reader an idea on the speed of the proposed methods, it is worth mentioning that performing sparse coding on covariance descriptors used in § 6.1.1 took less than 10 and 7 seconds with Jeffrey and Stein divergences, respectively (on an Intel i7 machine using Matlab). Performing a simple nearest neighbor search using AIRM required more than 75 seconds on the same dataset.
5 Dictionary Learning
Given a finite set of observations , learning a dictionary by embedding SPD manifolds into Hilbert space can be formulated as minimizing the following energy function with respect to :
is the loss function defined in Eqn. (12). should be small if is “good” at representing the signals . Among the various solutions to the problem of dictionary learning in Euclidean spaces, iterative methods like K-SVD have received much attention . Borrowing the idea from Euclidean spaces, we propose to minimize the energy in Eqn. (16) iteratively.
To this end, we first initialize the dictionary randomly. It is also possible to use intrinsic k-means clustering using the Karcher mean  to initialize the dictionary. Each iteration of dictionary learning then constitutes of two parts, namely a sparse coding step and a dictionary update step. In the sparse coding step, the dictionary is fixed and sparse codes, are computed as discussed in § 4. In the dictionary update step, are held fixed while
is updated, with each dictionary atom updated independently. This resembles the Expectation Maximization (EM) algorithm in nature. In the following subsections, we discuss how dictionary atoms can be updated for both and divergences.
5.1 Dictionary Updates for Divergence
As mentioned above, to update , we keep and the sparse codes in Eqn. (16) fixed. Generally speaking, one can update using gradient descend algorithms on SPD manifolds. This can be done at iteration by exploiting the tangent space at and moving along the direction of steepest descent and utilizing the exponential map to obtain as a point on .
In this paper, we propose to learn the dictionary in an online manner. Our proposal results in an analytical and closed-form solution for updating dictionary atoms one by one. In contrast to , our formulation does not exploit the tangent bundle and exponential maps, and is hence faster and more scalable. By fixing and , the derivative of Eqn. (16) with respect to can be computed as
For the divergence, we note that
then the root of Eqn. (17), i.e., can be written as:
This equation is identified as a Riccati equation . Its solution is positive definite and given as
provided that both and are positive definite. We note that in deriving the solution, we have assumed that at iteration can be replaced by and hence are treated as scalars.
5.2 Dictionary Updates for Divergence
the solution of with can be written as:
Since Eqn. (24) contains inverses and kernel values, a closed-form solution for computing cannot be sought. As such, we propose an alternative solution by exploiting previous values of in the update step. More specifically, rearranging Eqn. (24) and replacing as well as by their previous values, atom at iteration is updated according to:
5.3 Practical Considerations
The dictionary update in Eqn. (22) results in an SPD matrix provided that matrices and are SPD. In practice, this might not be the case and as such projection to the positive definite cone is required. The same argument holds for Eqn. (25). Given an arbitrary square matrix , the problem of finding the closest SPD matrix to has received considerable attention in the literature (c.f., ). While projecting onto positive definite cone can be achieved by thresholding (i.e
., replacing negative eigenvalues by a small positive number), a more principal approach can be used as follows. If square matrixis positive definite then is also positive definite. As such, the following convex problem can be solved to obtain the closest SPD matrix to the square matrix by using a solver like CVX .
We note that the formulation provided here works for non-symmetric matrix as well. This is again useful in practice as numerical issues might create non-symmetric matrices (e.g., and in Eqn. (22) might not become symmetric due to the limited numerical accuracy in a computational implementation).
Two sets of experiments333The corresponding Matlab/Octave source code is available at http://nicta.com.au/people/mharandi are presented in this section. In the first set, we evaluate the performance of the proposed sparse coding methods (as described in § 4) without dictionary learning. This is to contrast sparse coding to previous state-of-the-art methods on several popular closed-set classification tasks. To this end, each point in the training set is considered as an atom in the dictionary. Since the atoms in the dictionary are labeled in this case, the residual error approach for classification (as described in § 4.1) will be used to determine the label of a query point. In the second set of experiments, the performance of the sparse coding methods is evaluated in conjunction with the proposed dictionary learning algorithms described in § 5. For brevity, we denote Riemannian sparse representation with divergence as RSR-J, and the divergence counterpart as RSR-S.
The first priority of the experiments is to contrast the proposed methods against recent techniques designed to work on SPD manifolds. That is, the tasks and consequently the datasets were chosen to enable fair comparisons against state-of-the-art SPD methods. While exploring other visual tasks such as face verification  is beyond the scope of this paper, it is an interesting path to pursue in future work.
6.1 Sparse Coding
Below, we compare and contrast the performance of RSR-J and RSR-S methods against state-of-the-art techniques in five classification tasks, namely action recognition from 3D skeleton data, face recognition, material classification, person re-identification and texture categorization.
6.1.1 Action Recognition from 3D Skeleton Sequences
We used the motion capture HDM05 dataset  for the task of action recognition from skeleton data. Each action is encoded by the locations of 31 joints over time, with the speed of 120 frames per second. Given an action by joints over frames, we extracted the joint covariance descriptor  which is an SPD matrix of size as follows. Let , and be the , , and coordinates of the -th joint at frame . Let be the vector of all joint locations at time , i.e., , which has elements. The action represented over frames is then described by the covariance of vectors .
We used 3 subjects (140 action instances) for training, and the remaining 2 subjects (109 action instances) for testing. The set of actions used in this experiment is: ‘clap above head’, ‘deposit floor’, ‘elbow to knee’, ‘grab high’, ‘hop both legs’, ‘jog’, ‘kick forward’, ‘lie down floor’, ‘rotate both arms backward’, ‘sit down chair’, ‘sneak’, ‘squat’, ‘stand up lie’ and ‘throw basketball’.
In Table I we compare the performance of RSR-J and RSR-S against logEuc-SC  and Cov3DJ . The TSC algorithm  does not scale well to large SPD matrices and thus is not considered here. Cov3DJ encodes the relationship between joint movement and time by deploying multiple covariance matrices over sub-sequences in a hierarchical fashion. The results show that in this case RSR-J is better than RSR-S. Furthermore, both RSR-J and RSR-S outperform logEuc-SC and Cov3DJ.
6.1.2 Face Recognition
We performed four tests with various pose angles. Training data was composed of images marked ‘ba’, ‘bj’ and ‘bk’ (i.e., frontal faces with expression and illumination variations). Images with ‘bd’, ‘be’, ‘bf’, and ‘bg’ labels (i.e., non-frontal faces) were used as test data.
Each face image is described by a SPD matrix using the following features:
where is the intensity value at position , denotes the magnitude of a complex value and is the response of a 2D Gabor waveletcentered at with orientation and scale . In this work, we followed  and generated 40 Gabor filters in 8 orientations and 5 scales.
The proposed methods are compared against TSC , logEuc-SC , Sparse Representation-based Classification (SRC)  and its Gabor-based extension (GSRC) . For SRC, PCA was used to reduce the dimensionality of data. We evaluated the performance of SRC for various dimensions of PCA space and the maximum performance is reported. For the GSRC algorithm , we followed the recommendations of the authors for the downsampling factor in Gabor filtering. As for the logEuc-SC, we consider a kernel extension of the original algorithm. In other words, instead of directly using representations in a sparse coding framework as done in , we consider a kernel extension on representations using an RBF kernel. The kernel extension of sparse coding is discussed in depth in [34, 35]. This enhances the results in all cases and makes the logEuc-SC and RSR methods more comparable.
Table II shows the performance of all the studied methods for the task of face recognition. Both RSR-J and RSR-S outperform other methods, with RSR-S being marginally better than RSR-J.
6.1.3 Material Categorization
We used the Flickr dataset  for the task of material categorization. The dataset contains ten categories of materials: fabric, foliage, glass, leather, metal, paper, plastic, stone, water and wood. Each category has 100 images, 50 of which are close-up views and the remaining 50 are views at object-scale (see Figure 3 for examples). A binary, human-labeled mask is provided for each image in the dataset, describing the location of the object in the image. We only consider pixels inside this binary mask for material recognition and disregard all background pixels. SIFT  features have recently been shown to be robust and discriminative for material classification . We therefore constructed RCMs of size using 128 dimensional SIFT features (only from gray-scaled images) and 27 dimensional color descriptors. To this end, SIFT descriptors were computed at points on a regular grid with 5 pixel spacing. The color descriptor was obtained by simply stacking colors from patches centered at grid points.
Table III compares the performance of the proposed methods against the state-of-the-art non-parametric geometric detail extraction method (SD) , augmented Latent Dirichlet Allocation (aLDA) , and the texton-based representation introduced in . The results indicate that RSR-S considerably outperforms previous state-of-the-art approaches. We also note that RSR-J outperforms methods proposed in [50, 48] by a large margin and is only slightly worse than the aLDA algorithm .
6.1.4 Person Re-identification
We used the modified ETHZ dataset . The original ETHZ dataset was captured using a moving camera , providing a range of variations in the appearance of people. The dataset is structured into 3 sequences. Sequence 1 contains 83 pedestrians (4,857 images), Sequence 2 contains 35 pedestrians (1,936 images), and Sequence 3 contains 28 pedestrians (1,762 images). See left panel of Fig. 4 for examples.
We downsampled all images to pixels. For each subject we randomly selected 10 images for training and used the rest for testing. Random selection of training and testing data was repeated 20 times to obtain reliable statistics. To describe each image, the covariance descriptor was computed using the following features:
where is the position of a pixel, while , and represent the corresponding color information. The gradient and Laplacian for color are represented by and , respectively.
We compared the proposed RSR methods with several techniques previously used for pedestrian detection: Symmetry-Driven Accumulation of Local Features (SDALF) , Riemannian Locality Preserving Projection (RLPP) , and log-Euclidean sparse coding . The results for TSC  could not be generated in a timely manner due to the heavy computational load of the algorithm.
Results for the first two sequences are shown in Fig. 4, in terms of cumulative matching characteristic (CMC) curves. The CMC curve represents the expectation of finding the correct match in the top matches. The proposed RSR-S method obtains the highest accuracy on both sequences. RSR-J outperforms SDALF, RLPP and log-Euclidean sparse coding on sequence one. For the second sequence, RLPP and SDALF perform better than RSR-J for low ranks while RSR-J outperforms them for rank higher than two.
For Sequence 3 (not shown), very similar performances are obtained by SDALF, RLPP and the proposed methods. For this sequence, RSR-J and RSR-S achieve rank 1 accuracy of and , respectively. The CMC curves are almost saturated at perfect recognition at rank 3 for both RSR-J and RSR-S methods.
6.1.5 Texture Classification
We performed a classification task using the Brodatz texture dataset . Examples are shown in Fig. 5. We followed the test protocol devised in  and generated nine test scenarios with various number of classes. This includes 5-texture (‘5c’, ‘5m’, ‘5v’, ‘5v2’, ‘5v3’), 10-texture (‘10’, ‘10v’) and 16-texture (‘16c’, ‘16v’) mosaics. To create a Riemannian manifold, each image was first downsampled to and then split into 64 regions of size . The feature vector for any pixel is . Each region is described by a covariance descriptor of these features. For each test scenario, five covariance matrices per class were randomly selected as training data and the rest was used for testing. The random selection of training/testing data was repeated 20 times.
Fig. 6 compares the proposed RSR methods against logEuc-SC  and TSC . In general, the proposed methods obtain the highest recognition accuracy on all test scenarios except for the ‘5c’ test, where both methods have slightly worse performance than TSC. We note that in some cases such as ‘5m’ and ‘5v2’, RSR-J performs better than RSR-S. However, RSR-S is overall a slightly superior method for this task.
6.2 Dictionary Learning
Here we analyse the performance of the proposed dictionary learning techniques as described in § 5 on two classification tasks: texture classification and action recognition.
6.2.1 Texture Classification
Here we consider a multi-class classification problem, using 111 texture images of the Brodatz texture dataset . From each image we randomly extracted 50 blocks of size . To train the dictionary, 20 blocks from each image were randomly selected, resulting in a dictionary learning problem with 2200 samples. From the remaining blocks, 20 per image were used as probe data and 10 as gallery samples. The process of random block creation and dictionary generation was repeated twenty times. The average recognition accuracies over probe data are reported here. In the same manner as in § 6.1.5, we used the feature vector to create the covariance, where the first dimension is the grayscale intensity, and the remaining dimensions capture first and second order gradients.
We used the proposed methods to obtain the sparse codes, coupled with a dictionary generated via two separate methods: intrinsic k-means, and the proposed learning algorithm (§ 5). The sparse codes were then classified using a nearest-neighbor classifier.
Figure 7 shows the performance of RSR-J and RSR-S for various dictionary sizes. Red curves show the performance when the intrinsic -means algorithm was utilized for dictionary learning. The blue curves demonstrate the recognition accuracies when the methods proposed in § 5 were used for training, and finally the green curves show the performance of log-Euclidean sparse coding equipped with K-SVD  algorithm for dictionary learning. The figures show that the proposed dictionary learning approach consistently outperforms -means bar one case (RSR-J for dictionary size 8). Using the proposed dictionary learning approach, RSR-J achieves the maximum recognition accuracy of with 104 atoms, while RSR-S obtains the maximum accuracy of with 24 atoms. In contrast, when intrinsic -means is used for dictionary learning, the maximum recognition accuracies for RSR-J and RSR-S are and , respectively. Furthermore, in all cases RSR-S is superior to the log-Euclidean solution, while RSR-J performs better than the log-Euclidean approach only for dictionaries with size larger than 24 atoms.
6.2.2 Action Recognition
The UCF sport action dataset  consists of ten categories of human actions including swinging on the pommel horse, driving, kicking, lifting weights, running, skateboarding, swinging at the high bar, swinging golf clubs, and walking (examples of a diving action are shown in Fig. 8). The number of videos for each action varies from 6 to 22 and there are 150 video sequences in total. Furthermore, the videos presented in this dataset have non-uniform backgrounds and both the camera and the subject are moving in some actions. Frames in all video sequences are cropped according to the region of interest provided with the dataset and then resized to . The standard protocol in this dataset is the leave-one-out (LOO) cross validation [56, 57, 58].
From each video, we extracted several RCMs by splitting video data into 3D volumes. Volumes had the size of in domains with 8 pixels shift in each direction. From each volume, a RCM was extracted using kinematic features described in . From training RCMs, we learned separate dictionaries for and divergences using the methods described in § 5 with 256 atoms each. The dictionaries were then used to determine the video descriptor. To this end, each video was described by simply pooling the sparse codes of its volumes using max operator. Having training and testing descriptors at our disposal, a linear SVM  was used as classifier.
In Table IV, the overall performance of the RSR-J and RSR-S methods is compared against three state-of-the-art Euclidean approaches: HOG3D , Hierarchy of Discriminative space-time Neighbourhood features (HDN) , and augmented features  in conjunction with multiple kernel learning (AFMKL). HOG3D is an extension of histogram of oriented gradient descriptor  to spatio-temporal spaces. HDN learns shapes of space-time feature neighbourhoods that are most discriminative for a given action category. The idea is to form new features composed of the neighbourhoods around the interest points in a video. AFMKL exploits appearance distribution features and spatio-temporal context features in a learning scheme for action recognition. As shown in Table IV, RSR-J outperforms the log-Euclidean approach and is marginally worse than AMFKL. RSR-S achieves the highest overall accuracy.
|logEuc-SC with dic. learning|
The confusion matrices for RSR-J and RSR-S divergences are shown in Tables LABEL:tab:ucf_confusion_matrix_j and LABEL:tab:ucf_confusion_matrix_s, respectively. RSR-J perfectly classifies the actions of diving, golf swinging, kicking, riding horse, high bar swinging, walking and lifting, while RSR-S achieves perfect classification on golf swinging, riding horse, running, high bar swinging and lifting. Nevertheless, the overall performance of RSR-S surpasses that of RSR-J since RSR-J performs poorly in classifying the pommel-horse action.
7 Main Findings and Future Work
With the aim of addressing sparse representation on SPD manifolds, we proposed to seek the solution through embedding the manifolds into RKHS with the aid of two Bregman divergences, namely Stein and Jeffrey divergences. This led to a relaxed and extended version of the Lasso problem  on SPD manifolds.
In Euclidean spaces, the success of many learning algorithms arises from their use of kernel methods . Therefore, one could expect embedding a Riemannian manifold into higher dimensional Reproducing Kernel Hilbert Space (RKHS), where linear geometry applies, facilitates inference. Such an embedding, however, requires a non-trivial kernel function defined on the manifold, which, according to Mercer’s theorem , must be positive definite. The approach introduced here attains its merit from the following facts:
By recasting the sparse coding from into RKHS, a convex problem is obtained which can be solved quite efficiently. The sparse coding problem is in effect linearized, which is far easier than solving the Riemannian version of sparse coding as depicted in Eqn. (11).
Recasting the sparse coding from into RKHS exploits the advantages of higher dimensional Hilbert spaces, such as easier separability of classes.
The and divergences used in this paper are closely related to the Affine Invariant Riemannian Metric (AIRM) , and have several useful properties such as invariance to inversion and affine transforms. However, unlike AIRM, the and divergences admit a Hilbert space embedding (i.e., can be converted to kernel functions).
Experiments on several classification tasks show that the proposed approaches achieve notable improvements in discrimination accuracy, in comparison to state-of-the-art methods such as tensor sparse coding . We conjuncture that this stems from better exploitation of Riemannian geometry, as both divergences enjoy several properties similar to affine invariant Riemannian metric on SPD manifolds.
We have furthermore proposed algorithms for learning a dictionary, closely tied to the Stein and Jeffrey divergences. The experiments show that in many cases better performance is achieved with RSR-S as compared to RSR-J. However, we note that Jeffrey divergence enjoys several unique properties (e.g., closed form solution for averaging and Hilbert space embedding for all values of in Eqn. (6)) which makes it attractive for analyzing SPD matrices. Future venues of exploration include devising other types of inference and machineries based on and divergences, such as structured learning.
Geometric mean of J-Divergence
For two matrices , the geometric mean of divergence and AIRM are the same.
Kernelized Feature Sign Algorithm
The efficiency of the Feature-Sign Search algorithm  for finding sparse codes in vector spaces has been analysed in . The algorithm was shown to outperform (in terms of speed and accuracy) sparse solvers such as the generic QP solver . The gain is even higher for large and high-dimensional datasets (which are common in computer vision tasks). As such, we have elected to recast the algorithm into its RKHS version, in order to find the sparse codes on SPD manifolds. We summarize the new version below, with the pseudo-code shown in Algorithm 1.
Given the objective function defined in Eqn. (12), if the signs (positive, zero, or negative) of the are known at the optimal value, each term of can be replaced by either , or . Considering only non-zero coefficients, this reduces Eqn. (12) to a standard, unconstrained quadratic optimization problem (QP) , which has an analytic solution.
The Feature-Sign Algorithm comprises of four basic steps. The first two steps can be considered as a greedy search. The search is directed towards selecting a new feature that maximizes the rate of decrements in Eqn. (12). This is accomplished by computing the first order derivative of Eqn. (12) with respect to features, i.e.,
Since an estimate of the feature signs is available, in the third step (feature-sign step) it is possible to find the solution of the unconstrained QP of , where
In Eqn. (31), , and are subvectors corresponding to . Similarly, is a submatrix corresponding to with being the subset of selected features. The closed form solution, i.e.,