Exploiting Structure Sparsity for Covariance-based Visual Representation

by   Jianjia Zhang, et al.
University of Wollongong

The past few years have witnessed increasing research interest on covariance-based feature representation. A variety of methods have been proposed to boost its efficacy, with some recent ones resorting to nonlinear kernel technique. Noting that the essence of this feature representation is to characterise the underlying structure of visual features, this paper argues that an equally, if not more, important approach to boosting its efficacy shall be to improve the quality of this characterisation. Following this idea, we propose to exploit the structure sparsity of visual features in skeletal human action recognition, and compute sparse inverse covariance estimate (SICE) as feature representation. We discuss the advantage of this new representation on dealing with small sample, high dimensionality, and modelling capability. Furthermore, utilising the monotonicity property of SICE, we efficiently generate a hierarchy of SICE matrices to characterise the structure of visual features at different sparsity levels, and two discriminative learning algorithms are then developed to adaptively integrate them to perform recognition. As demonstrated by extensive experiments, the proposed representation leads to significantly improved recognition performance over the state-of-the-art comparable methods. In particular, as a method fully based on linear technique, it is comparable or even better than those employing nonlinear kernel technique. This result well demonstrates the value of exploiting structure sparsity for covariance-based feature representation.



There are no comments yet.


page 2


Adaptive Feature Representation for Visual Tracking

Robust feature representation plays significant role in visual tracking....

CORAL: Colored structural representation for bi-modal place recognition

Place recognition is indispensable for drift-free localization system. D...

Multi-modal Egocentric Activity Recognition using Audio-Visual Features

Egocentric activity recognition in first-person videos has an increasing...

Blind Quality Assessment for in-the-Wild Images via Hierarchical Feature Fusion and Iterative Mixed Database Training

Image quality assessment (IQA) is very important for both end-users and ...

Audio Impairment Recognition Using a Correlation-Based Feature Representation

Audio impairment recognition is based on finding noise in audio files an...

Invariant Deep Compressible Covariance Pooling for Aerial Scene Categorization

Learning discriminative and invariant feature representation is the key ...

Instance Search via Instance Level Segmentation and Feature Representation

Instance search is an interesting task as well as a challenging issue du...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Covariance-based feature representation (Cov-RP in short) uses the covariance matrix of a predefined visual feature vector to represent an image region, a whole image, a set of images, or a sequence of video frames. It has been applied to various vision tasks including object detection and recognition 

[1], action recognition [2, 3], image set classification [4], to name a few. Through these applications, the Cov-RP has gradually developed from a local region descriptor to a more generic visual representation [5]. This trend of development is also reflected in the improvements proposed for this representation, from fast computation of covariance region descriptor [1], through developing novel measures and algorithms [6], to the recent use of nonlinear kernel techniques to characterise or even replace covariance matrix [7, 5].

Some visual recognition tasks face the issues of small sample size and high feature dimensionality. A typical example is skeletal human action recognition since usually a high dimensional feature vector is required to describe each video frame while the number of frames for one action instance is limited. This makes the estimate of covariance matrix unstable or even singular, significantly affecting the effectiveness of Cov-RP. A number of remedies have been proposed to address this fundamental issue, including appending scaled identity matrix 

[4], or using kernel matrix instead [5]. They have effectively improved the performance of Cov-RP in various recognition tasks.

In this paper, we note that Cov-RP essentially aims to characterise the underlying structure of visual features. However, the existing remedies have not yet paid sufficient attention to the appropriateness of sample-based covariance matrix for such a goal, and still treated the available samples as the only source of information. These make Cov-RP hard to fit the complexity of the tasks in recent applications.

(a) “Crouch or hide” action
      from MSRC-12 data set. (b) Proposed SICE-RP
Figure 1: Visualisation of SICE representation (SICE-RP) (b) for a “Crouch or hide” action (a) from MSRC-12 data set. SICE-RP responds highly inside the two red boxes, corresponding to the interactions between the -coordinates and the interactions between the -coordinates of the upper body, respectively. It responds lowly to the interactions of x-coordinates. These patterns are well consistent with the fact that the joints to squat down mainly move along - and -axis while staying still along -axis.

To address the above issues, this paper argues that an equally, if not more, important approach to boosting Cov-RP shall focus on better characterising the underlying structure of visual features. In particular, for the task of skeleton-based human action recognition, prior knowledge on feature structure is readily to be used [8], and the most common one may be structure sparsity [9]. Inspired by this observation, this work aims to investigate the effectiveness of exploiting structure sparsity for Cov-RP specifically in skeleton-based human action recognition. To conveniently accommodate this prior knowledge, we migrate from covariance matrix to its inverse, and take advantage of the sparse inverse covariance estimation (SICE) technique [10] to serve our purpose. In doing so, we produce a new representation in which SICE serves as the basic unit. An example of this representation is illustrated in Fig. 1.

Exploiting structure sparsity brings the following advantages to this new representation. Firstly, it avoids a direct use of covariance estimate that could be unreliable in the case of small sample, and is completely free of the singularity issue in such case. Secondly, it could be more effective to characterise high-dimensional visual features that usually present structure sparsity, compared with Cov-RP. In addition, the off-diagonal elements in inverse covariance matrix correspond to the partial correlations between two feature components, which factors out the influence of other components, and thus has an advantage over covariance matrix in modeling the essence of data relationship.

Moreover, we extend this new feature representation by utilising the monotonicity property of SICE [11]. Through it, we efficiently obtain a hierarchy of SICE matrices to reflect feature structure at different level of sparsity, and use all of them to produce an enriched representation. Accordingly, two discriminative learning algorithms are developed to adaptively integrate the hierarchy to measure the similarity between samples. This extension not only further improves the recognition performance, but also avoids choosing the single best sparsity level for the SICE process, which could be inefficient and suboptimal in practice.

To validate our approach and demonstrate its advantages, extensive experimental study is conducted to compare it with existing covariance-based representations and the state-of-the-art comparable methods on various data sets of skeleton-based human action recognition. As will be shown, the proposed feature representation achieves significant improvement over these methods. Especially, compared with the recent methods employing nonlinear kernel techniques, our new representation, as a fully linear technique, shows comparable or even better recognition performance, well demonstrating its potential power. In addition, those nonlinear kernel methods require prior knowledge to select appropriate kernel functions for the representation, which is not needed in ours.

Our contributions are summarised as follows. (i) To the best of our knowledge, we are the first to improve Cov-RP from the perspective of feature structure modeling and exploit sparsity for this representation. (ii) Our approach produces a new representation based on the SICE matrix, and achieves significant improvement over existing Cov-RP and the other comparable methods. (iii) We extend this new representation to use a hierarchy of SICE matrices and develop discriminative learning algorithms to adaptively integrate them, further improving the recognition performance. (iv) Extensive experimental study is conducted to verify the proposed approach in skeletal human action recognition. In addition, we also demonstrate its remarkable performance in brain image analysis, which shows its potential for generalization. It is worth noting that the proposed method is fundamentally different from that in [12]. In that work, SICE was used as a sparse Gaussian model to represent each class and each sample was still represented by a feature vector. In contrast, we use SICE as a representation for each sample. Accordingly, [12]

did not classify SICE matrices while we do.

2 Related Work

Given a predefined visual feature vector, the statistical variation and mutual correlation of feature components presented on a set could be used to represent this set. Cov-RP is based on this idea and implements it effectively. The specific definitions of visual features or the set vary with applications. For example, in the application to object detection at the early days, the features are simply the locations and intensities of each pixel, while the set is an image region (a set of pixels) [4]. In this case, a small-sized covariance matrix is computed to represent the image region. In the recent application to skeletal human action recognition, the features are the coordinates of skeletal joints from a video frame, while the set is a video sequence recording an action instance [2]. Accordingly, a covariance matrix of larger size is obtained to represent this action instance. During the past decade, with its simplicity, robustness with respect to illumination change, and the flexibility of comparing different-sized sets, covariance representations have shown excellent performance in various vision tasks [1, 2, 4].

In the literature, major improvements on Cov-RP generally fall into the following three aspects. The first aspect is on computational efficiency. One important improvement is the use of integral image technique to significantly lower the computation on large image regions [1]. Another recent work uses dimension reduction to reduce the size of covariance matrices while maximally maintaining their original similarities [3]. The second aspect focuses on better evaluating the similarity of covariance matrices. Based on the theories of Riemannian manifold, improvements in this aspect have produced a number of novel measures and algorithms, contributing to the theoretical development of Cov-RP [6]. The third aspect incorporates nonlinear kernel technique to enhance Cov-RP in modelling complex feature relationship. One way is to nonlinearly map visual features to a kernel-induced feature space and compute covariance matrix therein [7]. Another recent way is to compute a kernel matrix whose elements are the kernel values of visual features and use it to replace covariance matrix [5]. As shown by that work, this not only largely avoids the singularity issue, but can also model the nonlinear relationship of feature components.

3 Proposed new feature representation

3.1 Motivation and basic idea

Our approach starts from reviewing the essence of Cov-RP. It is not difficult to see that this representation is essentially a characterisation of the underlying structure of visual features distributed over a set. In specific, it assumes a Gaussian model and uses the sample-based covariance estimate to characterise this structure. However, none of the existing methods on Cov-RP has paid sufficient attention to the appropriateness of such covariance estimate. We argue that the following two issues have turned it to be less appropriate. Firstly, the presence of small sample against high feature dimensions makes the sample-based covariance estimate unstable or even singular. It becomes less effective in characterising the underlying data structure. For example, it is well known that the estimates of these larger and smaller eigenvalues intend to be biased in this case, and therefore some kind of regularisation has to be appended. Secondly and importantly, although high-dimensional visual features usually induce complex structure, there is often some prior knowledge available from specific tasks. This valuable prior knowledge shall be adequately incorporated, especially when the sample is scarce. Therefore, rigidly using sample-based covariance estimate is not proper in this case.

Structure sparsity [9]

may be the most common prior knowledge for high-dimensional data. In the terminology of probabilistic graphical model, a distribution can be illustrated as a graph, with each node corresponding to a feature component, and each edge indicating the presence of statistical dependence between the linked two nodes. In this case, structure sparsity means the sparsity of the graph, i.e., only a small number of edges exist. A typical example of such situation is in skeletal human action recognition. According to the kinematic configuration of human body, only a small number of joints are

directly linked. Another more general justification for assuming structure sparsity to high-dimensional data comes from the “Bet on Sparsity” principle [13]

. That is, if the graph is truly sparse, we impose a correct prior and will better characterise the underlying structure. If the graph is dense, we will not lose much, because there is no way to recover the underlying structure in the case of small sample. In the literature, sparsity has been well realised in computer vision and implemented in various vision tasks 

[9]. In this work, we exploit structure sparsity to improve Cov-RP for skeletal human action recognition.

To impose structure sparsity we switch from covariance matrix to its inverse. This is because covariance matrix measures correlation of feature components, without discriminating direct and indirect correlation. In contrast, inverse covariance measures the partial (direct) correlation by factoring out the effects of other feature components, and this allows the sparsity prior to be conveniently imposed. Note that although inverse covariance has this nice property, it has not been used in Cov-RP before. This may be possibly due to two reasons. Firstly, when covariance matrix is singular, its inverse cannot be readily obtained. Secondly and more importantly, several similarity measures for covariance matrix, such as log-Euclidean kernel [14] and Stein kernel, are inverse invariant. That is, the same result will be obtained if using the inverse directly computed from the original (invertible) covariance matrix. Integrating structure sparsity helps to obtain a more precise inverse covariance, even if the original covariance matrix is less reliable or singular. And the sparse inverse covariance matrix will produce different result from the original covariance matrix. These arguments motivate us to compute sparse inverse covariance estimate to improve Cov-RP.

3.2 Sparse inverse covariance estimate (SICE)

Let’s denote by a Gaussian model , where denotes covariance matrix and is its inverse. Each off-diagonal entry of measures the direct correlation between two feature components. It will be zero if components and are conditionally independent given all the remaining ones. The estimate of , denoted by , can be obtained by maximising a penalised log-likelihood of data, with a symmetric positive definite (SPD) constraint on  [10, 11]. The optimal solution is called sparse inverse covariance estimate (SICE).


where is the sample-based covariance matrix, while , and denote the determinant, trace and the sum of the absolute values of the entries of a matrix. imposes sparsity on to achieve more reliable estimation. The tradeoff between the degree of sparsity and the log-likelihood estimation of is controlled by the regularisation parameter . Increasing value will reveal the underlying data structure at different sparsity levels, with a larger inducing a sparser . Note that is guaranteed to be SPD and therefore non-singular. In this way, we obtain a new covariance representation in which sparse inverse covariance estimate is used instead.

3.3 Enriched SICE with hierarchical sparsity

From the perspective of feature representation, directly using may not be ideal due to the existence of the regularisation parameter . It is impractical to tune for every individual sample to obtain a proper feature representation. Even if we use a fixed value for all samples, this will add one extra algorithmic parameter to the recognition pipeline. Finding the single best has to resort to multi-fold cross-validation that increases the computation. Another issue is that representing the underlying feature structure at a single sparsity level may not be optimal, as different structures could appear at different levels . The potentially complementary information at other sparsity levels should also be considered.

To resolve these issues and improve this new representation in further, we propose to utilise a nice property of SICE, called “monotonicity property” [11]. In specific, this property means that by monotonically increasing in Eq. (1), the resulting will gradually change from being denser to being sparser. The entries of the SICE matrix will gradually vanish and this change is irreversible. Therefore, we can use a set of values arranged as . With this property, we can efficiently and safely obtain a set of that guarantee to characterise the underlying feature structure from denser to sparser levels. In doing so, we obtain an enriched SICE representation, which consists of a hierarchy of SICE matrices.

4 Integration via discriminative learning

When performing recognition, we need to integrate a hierarchy of SICE matrices to measure the similarity of two samples. The simplest integration may be a convex linear combination. Also, it is known that SICE matrices (being SPD) reside on a Riemannian manifold. To respect this fact and make this linear combination more sound, we will perform it in a kernel-induced feature space. The two versions of the combination method, denoted as SICE-RP and SICE-RP, are obtained through two discriminative learning algorithms as follows.

4.1 Sice-Rp method

A hierarchy of SICE matrices obtained from a sample are mapped from the dimensional SPD Riemannian manifold into a kernel-induced space by a nonlinear mapping . is implicitly conducted by using a kernel (say, log-Euclidean kernel in this paper). This mapping brings at least two advantages. Firstly, the Riemannian manifold geometry will be considered by using distance functions specially designed for SPD matrices; Secondly, the images of the SICE matrices under this mapping can be linearly combined in . Specifically, recall that , , denoting a hierarchy of SICE matrices extracted from one sample at different sparsity levels 111To keep concise, we omit the superscript “” from each .. The linear combination can be expressed as


where is the combination coefficient. We define a kernel function for the two hierarchies of SICE matrices from samples and as


where . Note that and are two different kernels. The former is defined over two hierarchies of SICE matrices, while the latter is defined over two individual SICE matrices.

4.2 Sice-Rp method

Different from SICE-RP that assigns a weight to each individual sparsity level, SICE-RP method assigns a weight to each pair of sparsity levels as follows.


where is a weight matrix with corresponding to the th entry of . Still imposing the constraint of convex linear combination, is optimised by solving:


It is not difficult to see that the above SICE-RP is a special case of SICE-RP, because . In SICE-RP, is restricted to a rank-one matrix . Therefore, SICE-RP has more flexibility to weight these cross-sparsity-level kernel evaluations.

4.3 Optimisation

The combination coefficient or can then be viewed as the tunable parameter of the kernel or , and its value can be sought by optimising a generalisation bound on classification performance, e.g., the radius margin bound that is the upper bound of Leave-One-Out error [15]. In the following, the optimisation of is given and can be obtained similarly.

We first consider a binary classification task and then extend the result to the multi-class case. Given a training set of samples, and without loss of generality, the samples are labeled by , the optimal can be obtained by solving


where is the radius of the smallest sphere enclosing all the training samples, while

denotes the normal of the SVM separating hyperplane, with

being the margin. can be obtained by optimising the following problem:


where denotes the kernel function or defined over two hierarchies of SICE matrices and in the paper. And can be obtained by solving the following optimisation problem of SVM with -norm soft margin:

subject to:

where ; is the regularisation parameter; if , and otherwise.

How to solve Eq.(6) has been well studied in the literature. In brief, it can be optimised by iteratively 1) updating and ; 2) minimizing with respect to using gradient-based methods, as outlined in Algorithm 1.

0:  A training set , stopping criteria: i) The total number of iterations ; ii) A small positive value .
0:  .  
1:  for   do
2:     Solve and according to Eq. (7) and Eq. (8);
3:     Update by a gradient-based method;
4:     if   ( is defined as then
5:        Break;
6:     end if
7:  end for
8:  return  ;
Algorithm 1 Proposed SICE-RP method with the radius margin bound.

For multi-class classification tasks, we employ one-vs-one partitioning strategy and optimize by using a pairwise combination of the radius margin bounds of binary SVM classifiers. Please refer to the supplementary for more details.

4.4 Differences from MKL and EMK

Multiple kernel learning (MKL) has been commonly used to combine different sources of information. Also, efficient match kernel (EMK) [16] has been well-known for evaluating the similarity of two sets of points. With our notations, they can be expressed as follows, respectively.

Comparing them with our methods shows that i) MKL and the proposed integration methods differ in cross-sparsity-level comparison. MKL is often used to integrate heterogeneous sources, and different sources are usually not comparable. As a result, MKL only considers the similarity between samples from the same source, i.e. and , . In our case, SICE matrices at different sparsity levels are of the same type and therefore comparable. This allows us to explore the similarity across sources, i.e. and , to measure the similarity between two samples. In this sense, our method enjoys more flexibility to capture information than MKL. ii) EMK uses an equal weight to combine the similarity measure over different pairs. In contrast, the weights in our methods are adaptively learned, allowing us to better align with a task.

5 Computational issues

Given a sample-based covariance matrix estimated from feature vectors, the optimisation in Eq. (1) is proved to be convex and guaranteed to converge even when  [10] and SICE matrix can be efficiently obtained by the off-the-shelf package GLASSO [10] in . As shown in [10], it takes only CPU second to obtain a SICE matrix. It also allows to efficiently build a path of SICE matrices for different values of . Therefore, the proposed methods can be efficiently computed. In addition, note that the complexity of SICE-RP is independent of the feature number once is provided. Also, as Cov-RP, SICE-RP still allows two sets with different number of features to be compared, because the resulting SICE matrix has a fixed size of .

6 Experimental result

We compare our three proposed methods (i.e., SICE-RP, SICE-RP and SICE-RP) with both the classic Cov-RP and several state-of-the-art methods mainly in skeletal human action recognition. Four benchmark data sets are tested, including HDM05 [7], MSRC-12 [2], MSR-Action3D [17] and MSR-DailyActivity3D
 [18]. For all data sets, only skeleton data are used while other information (e.g., depth maps or RGB videos) is not utilised. The proposed methods are also evaluated in medical image analysis to demonstrate their potential for generalization.

A kernel SVM classifier is employed throughout all experiments, in which the log-Euclidean kernel [14] is used to measure the similarity of two SPD matrices. For a fair comparison, all algorithmic parameters are tuned by multi-fold cross-validation on the training set only. The sparsity parameter used in SICE-RP is also chosen by cross-validation on the training set. For the proposed integration methods, ten sparsity levels of SICEs are computed for each sample, corresponding to the very dense to very sparse representations. To compare with the state-of-the-art methods, the training and test sets of these data sets are partitioned by following the literature. For HDM05, we used the instances of two subjects for training and the remaining for test, as in [3]. For MSRC-12, MSR-Action3D and MSR-DailyActivity3D, the cross-subject test setting [19]

is used, i.e., the odd-indexed subjects for training and the even-indexed ones for test.

Features are generated as follows. For HDM05 and MSRC-12, the 3D coordinates of each joint are used as the frame features, leading to a feature dimensionality of  ( joints) in HDM05 and  ( joints) in MSRC-12. For MSR-Action3D and MSR-DailyActivity3D, velocity is used as the frame features [20], which is calculated by the coordinate difference of 3D skeleton joints between a frame and its two direct neighbor frames. The dimensionality of the frame feature is joints). The frame number in each action instance is in HDM05, in MSRC-12, in MSR-Action3D and in MSR-DailyActivity3D. In Cov-RP, to address the singularity issue, a small regulariser (e.g., ) is appended as in [4].

6.0.1 Result on HDM05 data set

HDM05 has about instances from over motion classes. Most classes have to realisations of five actors named “bd”, “bk”, “dg”, “mm” and “tr”. We use two subjects “bd” and “mm” for training and the remaining three for test by following [3]. The results are given in Table 2.

In addition to Cov-RP, the results of another six methods in the literature are also quoted in Table 2. These methods can be roughly categorised into linear and nonlinear representations (corresponding to the lower or the upper portion of Table 2, respectively). As for the linear representation methods, RSR [21], RSR-ML [3] and CDL [4] are covariance-based representations, but they further conduct dimensionality reduction on covariance matrix using sparse coding or projection. We also test InverseCov-RP, which directly uses the inverse of the covariance matrix as representation without exploiting structure sparsity. As for the nonlinear representation methods, Cov--SVM [7] employs an infinite-dimensional covariance matrix in a kernel-induced feature space as representation. Ker-RP-RBF and Ker-RP-POL [5] are the two variants of the recently proposed non-linear kernel representations using RBF and polynomial kernels [5]. Note that, our methods belong to the linear representations.

We follow [3] and test the classification accuracy on i) classes only (the left column in Table 2), and ii) all the 100 classes (the right column in Table 2).

For classes, Cov-RP shows quite competitive performance and outperforms the linear representations of RSR, RSR-ML, CDL and the nonlinear representation Cov--SVM. As expected, InverseCov-RP obtains the same performance as Cov-RP since the log-Euclidean kernel is inverse-invariant. The nonlinear kernel representations of Ker-RP-RBF and Ker-RP-POL [5] outperform Cov-RP. In comparison, the three proposed methods demonstrate remarkable performance. SICE-RP achieves a high classification accuracy of %, on a par with Ker-RP-RBF [5] and better than all the other quoted methods. This may indicate the efficacy of exploring the structure sparsity. Moreover, SICE-RP further boosts the classification accuracy from % to %, updating the state-of-the-art performance.

For all the 100 classes, the overall classification accuracy decreases due to the significant increase of the number of action classes. In this case, the proposed SICE-RP still outperforms all the quoted ones in comparison. It achieves a significant improvement of percentage points over Cov-RP, and even wins the non-linear kernel representation methods Ker-RP-RBF and Ker-RP-POL [5]. When integrating the hierarchy of SICEs by the proposed methods, the improvement becomes more salient. Specifically, SICE-RP achieves a classification accuracy of %, which is percentage points higher than Cov-RP and percentage points higher than Ker-RP-RBF [5].

classes All classes
Methods in comparison Accuracy Accuracy
Methods using nonlinear representation
Cov--SVM [7] Not reported
Ker-RP-POL [5]
Ker-RP-RBF [5]
Methods using linear representation
RSR [21] Not reported
RSR-ML [3]
CDL [4] Not reported
Cov-RP [1]
SICE-RP (proposed)
SICE-RP (proposed)
SICE-RP (proposed)
Table 2: Comparison on MSRC-12 data set.
Methods in comparison Accuracy
Methods using nonlinear representation
Cov--SVM [7] 89.8
Ker-RP-POL [5]
Ker-RP-RBF [5]
Methods using linear representation
Hierarchy of Cov3DJs [2]
Cov-RP [1]
SICE-RP (proposed)
SICE-RP (proposed)
SICE-RP (proposed)
Table 1: Comparison on HDM05 data set (Two experiments).

6.0.2 Result on MSRC-12 data set

MSRC-12 contains gesture categories from subjects. As shown in Table 2, SICE-RP again outperforms all the existing methods, including Cov-RP and the non-linear kernel representation methods in [5]. By integrating the hierarchy of SICEs via our proposed integration methods, the classification accuracy of SICE-RP can be further improved to % by SICE-RP and to % by SICE-RP. This reinforces the effectiveness of the proposed SICE-RP and the integration methods.

In addition, to provide insight on the proposed representation, we visualise the SICE matrices computed on this data set to show the identified underlying structure of the interactions between skeletal joints for different actions. The results can be found in Fig. 1 and the supplementary material.

6.0.3 Result on MSR-Action3D data set

MSR-Action3D contains categories of actions from ten subjects. Each action is performed two or three times by each subject. The results are given in Table 3. Note that several non-Cov-related methods in the literature are also quoted in the left portion of this table.

As can be seen, although Cov-RP performs poorly in this case, the proposed SICE-RP achieves an accuracy up to %, bringing an improvement of percentage points over Cov-RP, and percentage points over Cov--SVM. It is interesting to see that SICE-RP also wins the methods in the left portion of Table 3, which involve complex representations of features, e.g., sparse coding [18] or use additional information like depth maps [22]. When a hierarchy of SICEs are integrated by the proposed SICE-RP and SICE-RP, the performance can be further boosted to %, reaching the state-of-the-art performance of the nonlinear representation Ker-RP-RBF [5]. Although SICE-RP and SICE-RP tie Ker-RP-RBF in [5], they have an advantage: the methods in [5] require prior knowledge to select appropriate kernel functions for the representation, which is not needed in SICE-RP or SICE-RP.

Methods in comparison Accuracy Methods using nonlinear representation Accuracy
Structured Streaming Skeletons [23] Cov--SVM [7]
DBN+HMM [24] Ker-RP-POL [5]
Actionlet Ensemble  [25] Ker-RP-RBF [5]
Pose Set [26] Methods using linear representation
Moving Pose  [20] Hierarchy of Cov3DJs [2]
Lie Group [27] Cov-RP [1]
SNV  [18] InverseCov-RP
Spatiotemp. Features Fusing [22] SICE-RP (proposed)
DL-GSGC+TPM [17] SICE-RP (proposed)
SICE-RP (proposed)
Table 3: Comparison on MSR-Action3D data set.

6.0.4 Result on MSR-DailyActivity3D data set

MSR-DailyActivity3D involves human-object interactions such as drink, eat, read book, etc. The results are given in Table 5. On this data set, the best performance is achieved by the non-linear representation methods Ker-RP-POL and Ker-RP-RBF [5]. Cov-RP performs close to some of the state-of-the-art non-Cov-based representations. Our SICE-RP once again demonstrates reasonably good performance, with an accuracy of %, significantly better than most of the quoted state-of-the-art results. Specifically, it outperforms Cov-RP by a large margin of percentage points. The accuracy can be further improved to % through integrating multiple SICEs using SICE-RP or SICE-RP. This result is close to the highest accuracy of % obtained by Ker-RP-POL. Note that, additional information is used in some state-of-the-art methods, such as depth map [28, 18] and local occupancy patterns [25], while our SICE-RP soley utilizes the skeleton data.

Methods in comparison Accuracy
Moving Pose [20]
Local HON4D [28]
Actionlet Ensemble [25]
SNV [18]
Methods using nonlinear representation
Cov--SVM [7]
Ker-RP-POL [5]
Ker-RP-RBF [5]
Methods using linear representation
Cov-RP [1]
SICE-RP (proposed)
SICE-RP (proposed)
SICE-RP (proposed)
Table 5: Comparison on ADHD-200 data set.
Methods in comparison Accuracy
AttributedGraph[29] 62.8
NetStructure[30] 69.6
Methods using nonlinear representation
Cov--SVM [7] 67.8
Ker-RP-POL [5]
Ker-RP-RBF [5]
Methods using linear representation
Cov-RP [1]
SICE-RP (proposed)
SICE-RP (proposed)
SICE-RP (proposed)
Table 4: Comparison on MSR-DailyActivity3D data set.

6.0.5 Comparison on medical image analysis

Cov-RP is also used in brain image analysis. The benchmark data set ADHD-200 is tested for this case to verify the potential generalisation of the proposed methods. It is provided by the Neuro Bureau for differentiating Attention Deficit Hyperactivity Disorder (ADHD) from healthy control subjects. ADHD-200 consists of resting-state functional MRI (fMRI) images of training and test subjects.The fMRI images are preprocessed with Athena pipeline222http://neurobureau.projects.nitrc.org/ADHD200/Introduction.html, after which, each subject is characterised by the averaged time series from each of brain regions. A covariance matrix is estimated for each subject based on time points.Therefore this task suffers the issue of small sample vs high feature dimensionality, similar to the skeletal human action recognition task.

In the literature, the state-of-the-art classification accuracy on this data set is % in [30]. The comparison between the existing methods and ours is presented in Table 5. As seen, Cov-RP (of brain regions) only obtains an accuracy of %, much worse than that in [30]

. This is probably due to the unreliable covariance estimation suffering from the small sample problem. The nonlinear representation Ker-RP-RBF 

[5] also performs poorly with a classification accuracy below %. On the contrary, SICE-RP obtains an accuracy of %, beating both Cov-RP and Ker-RP-RBF and comparable with that in [30]. This may be attributed to the fact that brain network is very sparse. The integration of SICEs seems very effective on this data set, as SICE-RP and SICE-RP significantly boost the classification accuracy to % and %, respectively. This demonstrates that our methods could be potentially generalised to other applications with small sample and high dimensionality.

We have verified the effectiveness of exploiting structure sparsity in skeletal human action recognition and medical image analysis. For these tasks, the number of samples is relatively small and the dimensionality is high. Also, the prior knowledge on structure sparsity is clear in these tasks. As a sanity check, we further investigate how the proposed methods perform on the tasks with lower feature dimensions and a larger number of feature vectors. This sanity check experiment agrees with the principle of “Bet on sparsity” and suggests that exploiting structure sparsity could well maintain competitive performance and be a safe option for more applications. Refer to the supplementary for details.

6.0.6 Comparison with MKL and EMK

As shown in section 4.4, MKL does not utilise the cross-source information and is a special case of SICE-RP. An experiment is conducted to compare SICE-RP, SICE-RP and MKL to investigate if the cross-source information helps. The four human action recognition data sets are used. As seen in Table 6, MKL is only able to improve the performance of SICE-RP on MSRC-12 and MSR-DailyActivity3D. However, this is still inferior to the best classification performance achieved by our proposed SICE-RP and SICE-RP. This experiment also compares SICE-RP, SICE-RP and EMK. As seen, EMK only improves SICE-RP on MSRC-12 and HDM05 (100 classes), but is worse than SICE-RP and SICE-RP on all action recognition data sets. This demonstrates the advantage of our adaptive integration methods.

HDM05 (14 classes)
HDM05 (100 classes)
Table 6: Comparison between SICE-RP, SICE-RP, MKL and EMK on human action recognition data sets.

7 Conclusion and future work

To address the new issues encountered by covariance representation, we propose to improve the quality of characterising the underlying structure of visual features, and this leads to the use of SICE matrix as a generic feature representation. This new representation exploits the structure sparsity potentially existing among feature components, and is therefore more robust against sample scarcity and high feature dimensionality. The significant improvement achieved by this new representation is verified in skeletal human action recognition and medical image analysis. Also, the two integration methods developed in this work further improve recognition performance while avoiding searching for a single best sparsity level. The future work will apply the proposed representation to more vision tasks, investigate its efficacy for unsupervised learning scenario, and explore its interaction with nonlinear representations.


  • [1] Tuzel, O., Porikli, F., Meer, P.: Region covariance: A fast descriptor for detection and classification. In: ECCV 2006, Part II. (2006) 589–600
  • [2] Hussein, M.E., Torki, M., Gowayyed, M.A., El-Saban, M.: Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In: IJCAI 2013, Beijing, China, August 3-9, 2013. (2013)
  • [3] Harandi, M.T., Salzmann, M., Hartley, R.: From manifold to manifold: geometry-aware dimensionality reduction for spd matrices. In: ECCV. Springer (2014) 17–32
  • [4] Wang, R., Guo, H., Davis, L.S.: Covariance discriminative learning: A natural and efficient approach to image set classification. In: CVPR. (2012) 2496–2503
  • [5] Wang, L., Zhang, J., Zhou, L., Tang, C., Li, W.: Beyond covariance: Feature representation with nonlinear kernel matrices. In: ICCV, IEEE (2015)
  • [6] Quang, M.H., San Biagio, M., Murino, V.: Log-hilbert-schmidt metric between positive definite operators on hilbert spaces. In: NIPS. (2014) 388–396
  • [7] Harandi, M., Salzmann, M., Porikli, F.: Bregman divergences for infinite dimensional covariance matrices. In: CVPR, IEEE (2014) 1003–1010
  • [8] Lehrmann, A.M., Gehler, P.V., Nowozin, S.:

    A non-parametric bayesian network prior of human pose.

    In: ICCV, IEEE (2013) 1281–1288
  • [9] Huang, J., Zhang, T., Metaxas, D.: Learning with structured sparsity.

    The Journal of Machine Learning Research

    12 (2011) 3371–3412
  • [10] Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3) (2008) 432–441
  • [11] Huang, S., Li, J., Sun, L., Ye, J., Fleisher, A., Wu, T., Chen, K., Reiman, E.: Learning brain connectivity of alzheimer’s disease by sparse inverse covariance estimation. NeuroImage 50(3) (2010) 935–949
  • [12] Zhou, L., Wang, L., Ogunbona, P.: Discriminative sparse inverse covariance matrix: Application in brain functional network classification. In: CVPR. (2014) 3097–3104
  • [13] Hastie, T., Tibshirani, R., Friedman, J., Franklin, J.: The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27(2) (2005) 83–85
  • [14] Jayasumana, S., Hartley, R., Salzmann, M., Li, H., Harandi, M.: Kernel methods on the Riemannian manifold of symmetric positive definite matrices. In: CVPR. IEEE (2013) 73–80
  • [15] Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.:

    Choosing multiple parameters for support vector machines.

    Machine learning 46(1-3) (2002) 131–159
  • [16] Bo, L., Sminchisescu, C.: Efficient match kernel between sets of features for visual recognition. In: Advances in neural information processing systems. (2009) 135–143
  • [17] Luo, J., Wang, W., Qi, H.: Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: ICCV, IEEE (2013) 1809–1816
  • [18] Yang, X., Tian, Y.: Super normal vector for activity recognition using depth sequences. In: CVPR, IEEE (2014) 804–811
  • [19] Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3d points. In: CVPRW, IEEE (2010) 9–14
  • [20] Zanfir, M., Leordeanu, M., Sminchisescu, C.: The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In: ICCV, IEEE (Dec 2013) 2752–2759
  • [21] Harandi, M.T., Sanderson, C., Hartley, R., Lovell, B.C.: Sparse coding and dictionary learning for symmetric positive definite matrices: A kernel approach. In: ECCV. Springer (2012) 216–229
  • [22] Zhu, Y., Chen, W., Guo, G.: Fusing spatiotemporal features and joints for 3d action recognition. In: CVPRW, IEEE (2013) 486–491
  • [23] Zhao, X., Li, X., Pang, C., Zhu, X., Sheng, Q.Z.: Online human gesture recognition from motion data streams. In: ACMMM, ACM (2013) 23–32
  • [24] Wu, D., Shao, L.: Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: CVPR, IEEE (2014) 724–731
  • [25] Wang, J., Liu, Z., Wu, Y., Yuan, J.: Learning actionlet ensemble for 3d human action recognition. IEEE TPAMI 36(5) (2014) 914–927
  • [26] Wang, C., Wang, Y., Yuille, A.L.: An approach to pose-based action recognition. In: CVPR, IEEE (2013) 915–922
  • [27] Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: CVPR, IEEE (2014) 588–595
  • [28] Oreifej, O., Liu, Z.: Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In: CVPR, IEEE (2013) 716–723
  • [29] Dey, S., Rao, A.R., Shah, M.: Attributed graph distance measure for automatic detection of attention deficit hyperactive disordered subjects. Frontiers in neural circuits 8 (2014)
  • [30] Dey, S., Rao, A.R., Shah, M.: Exploiting the brain’s network structure in identifying adhd subjects. Frontiers in systems neuroscience 6 (2012)