1 Introduction
Covariancebased feature representation (CovRP in short) uses the covariance matrix of a predefined visual feature vector to represent an image region, a whole image, a set of images, or a sequence of video frames. It has been applied to various vision tasks including object detection and recognition
[1], action recognition [2, 3], image set classification [4], to name a few. Through these applications, the CovRP has gradually developed from a local region descriptor to a more generic visual representation [5]. This trend of development is also reflected in the improvements proposed for this representation, from fast computation of covariance region descriptor [1], through developing novel measures and algorithms [6], to the recent use of nonlinear kernel techniques to characterise or even replace covariance matrix [7, 5].Some visual recognition tasks face the issues of small sample size and high feature dimensionality. A typical example is skeletal human action recognition since usually a high dimensional feature vector is required to describe each video frame while the number of frames for one action instance is limited. This makes the estimate of covariance matrix unstable or even singular, significantly affecting the effectiveness of CovRP. A number of remedies have been proposed to address this fundamental issue, including appending scaled identity matrix
[4], or using kernel matrix instead [5]. They have effectively improved the performance of CovRP in various recognition tasks.In this paper, we note that CovRP essentially aims to characterise the underlying structure of visual features. However, the existing remedies have not yet paid sufficient attention to the appropriateness of samplebased covariance matrix for such a goal, and still treated the available samples as the only source of information. These make CovRP hard to fit the complexity of the tasks in recent applications.
(a) “Crouch or hide” action  

from MSRC12 data set.  (b) Proposed SICERP 
To address the above issues, this paper argues that an equally, if not more, important approach to boosting CovRP shall focus on better characterising the underlying structure of visual features. In particular, for the task of skeletonbased human action recognition, prior knowledge on feature structure is readily to be used [8], and the most common one may be structure sparsity [9]. Inspired by this observation, this work aims to investigate the effectiveness of exploiting structure sparsity for CovRP specifically in skeletonbased human action recognition. To conveniently accommodate this prior knowledge, we migrate from covariance matrix to its inverse, and take advantage of the sparse inverse covariance estimation (SICE) technique [10] to serve our purpose. In doing so, we produce a new representation in which SICE serves as the basic unit. An example of this representation is illustrated in Fig. 1.
Exploiting structure sparsity brings the following advantages to this new representation. Firstly, it avoids a direct use of covariance estimate that could be unreliable in the case of small sample, and is completely free of the singularity issue in such case. Secondly, it could be more effective to characterise highdimensional visual features that usually present structure sparsity, compared with CovRP. In addition, the offdiagonal elements in inverse covariance matrix correspond to the partial correlations between two feature components, which factors out the influence of other components, and thus has an advantage over covariance matrix in modeling the essence of data relationship.
Moreover, we extend this new feature representation by utilising the monotonicity property of SICE [11]. Through it, we efficiently obtain a hierarchy of SICE matrices to reflect feature structure at different level of sparsity, and use all of them to produce an enriched representation. Accordingly, two discriminative learning algorithms are developed to adaptively integrate the hierarchy to measure the similarity between samples. This extension not only further improves the recognition performance, but also avoids choosing the single best sparsity level for the SICE process, which could be inefficient and suboptimal in practice.
To validate our approach and demonstrate its advantages, extensive experimental study is conducted to compare it with existing covariancebased representations and the stateoftheart comparable methods on various data sets of skeletonbased human action recognition. As will be shown, the proposed feature representation achieves significant improvement over these methods. Especially, compared with the recent methods employing nonlinear kernel techniques, our new representation, as a fully linear technique, shows comparable or even better recognition performance, well demonstrating its potential power. In addition, those nonlinear kernel methods require prior knowledge to select appropriate kernel functions for the representation, which is not needed in ours.
Our contributions are summarised as follows. (i) To the best of our knowledge, we are the first to improve CovRP from the perspective of feature structure modeling and exploit sparsity for this representation. (ii) Our approach produces a new representation based on the SICE matrix, and achieves significant improvement over existing CovRP and the other comparable methods. (iii) We extend this new representation to use a hierarchy of SICE matrices and develop discriminative learning algorithms to adaptively integrate them, further improving the recognition performance. (iv) Extensive experimental study is conducted to verify the proposed approach in skeletal human action recognition. In addition, we also demonstrate its remarkable performance in brain image analysis, which shows its potential for generalization. It is worth noting that the proposed method is fundamentally different from that in [12]. In that work, SICE was used as a sparse Gaussian model to represent each class and each sample was still represented by a feature vector. In contrast, we use SICE as a representation for each sample. Accordingly, [12]
did not classify SICE matrices while we do.
2 Related Work
Given a predefined visual feature vector, the statistical variation and mutual correlation of feature components presented on a set could be used to represent this set. CovRP is based on this idea and implements it effectively. The specific definitions of visual features or the set vary with applications. For example, in the application to object detection at the early days, the features are simply the locations and intensities of each pixel, while the set is an image region (a set of pixels) [4]. In this case, a smallsized covariance matrix is computed to represent the image region. In the recent application to skeletal human action recognition, the features are the coordinates of skeletal joints from a video frame, while the set is a video sequence recording an action instance [2]. Accordingly, a covariance matrix of larger size is obtained to represent this action instance. During the past decade, with its simplicity, robustness with respect to illumination change, and the flexibility of comparing differentsized sets, covariance representations have shown excellent performance in various vision tasks [1, 2, 4].
In the literature, major improvements on CovRP generally fall into the following three aspects. The first aspect is on computational efficiency. One important improvement is the use of integral image technique to significantly lower the computation on large image regions [1]. Another recent work uses dimension reduction to reduce the size of covariance matrices while maximally maintaining their original similarities [3]. The second aspect focuses on better evaluating the similarity of covariance matrices. Based on the theories of Riemannian manifold, improvements in this aspect have produced a number of novel measures and algorithms, contributing to the theoretical development of CovRP [6]. The third aspect incorporates nonlinear kernel technique to enhance CovRP in modelling complex feature relationship. One way is to nonlinearly map visual features to a kernelinduced feature space and compute covariance matrix therein [7]. Another recent way is to compute a kernel matrix whose elements are the kernel values of visual features and use it to replace covariance matrix [5]. As shown by that work, this not only largely avoids the singularity issue, but can also model the nonlinear relationship of feature components.
3 Proposed new feature representation
3.1 Motivation and basic idea
Our approach starts from reviewing the essence of CovRP. It is not difficult to see that this representation is essentially a characterisation of the underlying structure of visual features distributed over a set. In specific, it assumes a Gaussian model and uses the samplebased covariance estimate to characterise this structure. However, none of the existing methods on CovRP has paid sufficient attention to the appropriateness of such covariance estimate. We argue that the following two issues have turned it to be less appropriate. Firstly, the presence of small sample against high feature dimensions makes the samplebased covariance estimate unstable or even singular. It becomes less effective in characterising the underlying data structure. For example, it is well known that the estimates of these larger and smaller eigenvalues intend to be biased in this case, and therefore some kind of regularisation has to be appended. Secondly and importantly, although highdimensional visual features usually induce complex structure, there is often some prior knowledge available from specific tasks. This valuable prior knowledge shall be adequately incorporated, especially when the sample is scarce. Therefore, rigidly using samplebased covariance estimate is not proper in this case.
Structure sparsity [9]
may be the most common prior knowledge for highdimensional data. In the terminology of probabilistic graphical model, a distribution can be illustrated as a graph, with each node corresponding to a feature component, and each edge indicating the presence of statistical dependence between the linked two nodes. In this case, structure sparsity means the sparsity of the graph, i.e., only a small number of edges exist. A typical example of such situation is in skeletal human action recognition. According to the kinematic configuration of human body, only a small number of joints are
directly linked. Another more general justification for assuming structure sparsity to highdimensional data comes from the “Bet on Sparsity” principle [13]. That is, if the graph is truly sparse, we impose a correct prior and will better characterise the underlying structure. If the graph is dense, we will not lose much, because there is no way to recover the underlying structure in the case of small sample. In the literature, sparsity has been well realised in computer vision and implemented in various vision tasks
[9]. In this work, we exploit structure sparsity to improve CovRP for skeletal human action recognition.To impose structure sparsity we switch from covariance matrix to its inverse. This is because covariance matrix measures correlation of feature components, without discriminating direct and indirect correlation. In contrast, inverse covariance measures the partial (direct) correlation by factoring out the effects of other feature components, and this allows the sparsity prior to be conveniently imposed. Note that although inverse covariance has this nice property, it has not been used in CovRP before. This may be possibly due to two reasons. Firstly, when covariance matrix is singular, its inverse cannot be readily obtained. Secondly and more importantly, several similarity measures for covariance matrix, such as logEuclidean kernel [14] and Stein kernel, are inverse invariant. That is, the same result will be obtained if using the inverse directly computed from the original (invertible) covariance matrix. Integrating structure sparsity helps to obtain a more precise inverse covariance, even if the original covariance matrix is less reliable or singular. And the sparse inverse covariance matrix will produce different result from the original covariance matrix. These arguments motivate us to compute sparse inverse covariance estimate to improve CovRP.
3.2 Sparse inverse covariance estimate (SICE)
Let’s denote by a Gaussian model , where denotes covariance matrix and is its inverse. Each offdiagonal entry of measures the direct correlation between two feature components. It will be zero if components and are conditionally independent given all the remaining ones. The estimate of , denoted by , can be obtained by maximising a penalised loglikelihood of data, with a symmetric positive definite (SPD) constraint on [10, 11]. The optimal solution is called sparse inverse covariance estimate (SICE).
(1) 
where is the samplebased covariance matrix, while , and denote the determinant, trace and the sum of the absolute values of the entries of a matrix. imposes sparsity on to achieve more reliable estimation. The tradeoff between the degree of sparsity and the loglikelihood estimation of is controlled by the regularisation parameter . Increasing value will reveal the underlying data structure at different sparsity levels, with a larger inducing a sparser . Note that is guaranteed to be SPD and therefore nonsingular. In this way, we obtain a new covariance representation in which sparse inverse covariance estimate is used instead.
3.3 Enriched SICE with hierarchical sparsity
From the perspective of feature representation, directly using may not be ideal due to the existence of the regularisation parameter . It is impractical to tune for every individual sample to obtain a proper feature representation. Even if we use a fixed value for all samples, this will add one extra algorithmic parameter to the recognition pipeline. Finding the single best has to resort to multifold crossvalidation that increases the computation. Another issue is that representing the underlying feature structure at a single sparsity level may not be optimal, as different structures could appear at different levels . The potentially complementary information at other sparsity levels should also be considered.
To resolve these issues and improve this new representation in further, we propose to utilise a nice property of SICE, called “monotonicity property” [11]. In specific, this property means that by monotonically increasing in Eq. (1), the resulting will gradually change from being denser to being sparser. The entries of the SICE matrix will gradually vanish and this change is irreversible. Therefore, we can use a set of values arranged as . With this property, we can efficiently and safely obtain a set of that guarantee to characterise the underlying feature structure from denser to sparser levels. In doing so, we obtain an enriched SICE representation, which consists of a hierarchy of SICE matrices.
4 Integration via discriminative learning
When performing recognition, we need to integrate a hierarchy of SICE matrices to measure the similarity of two samples. The simplest integration may be a convex linear combination. Also, it is known that SICE matrices (being SPD) reside on a Riemannian manifold. To respect this fact and make this linear combination more sound, we will perform it in a kernelinduced feature space. The two versions of the combination method, denoted as SICERP and SICERP, are obtained through two discriminative learning algorithms as follows.
4.1 SiceRp method
A hierarchy of SICE matrices obtained from a sample are mapped from the dimensional SPD Riemannian manifold into a kernelinduced space by a nonlinear mapping . is implicitly conducted by using a kernel (say, logEuclidean kernel in this paper). This mapping brings at least two advantages. Firstly, the Riemannian manifold geometry will be considered by using distance functions specially designed for SPD matrices; Secondly, the images of the SICE matrices under this mapping can be linearly combined in . Specifically, recall that , , denoting a hierarchy of SICE matrices extracted from one sample at different sparsity levels ^{1}^{1}1To keep concise, we omit the superscript “” from each .. The linear combination can be expressed as
(2) 
where is the combination coefficient. We define a kernel function for the two hierarchies of SICE matrices from samples and as
(3) 
where . Note that and are two different kernels. The former is defined over two hierarchies of SICE matrices, while the latter is defined over two individual SICE matrices.
4.2 SiceRp method
Different from SICERP that assigns a weight to each individual sparsity level, SICERP method assigns a weight to each pair of sparsity levels as follows.
(4) 
where is a weight matrix with corresponding to the th entry of . Still imposing the constraint of convex linear combination, is optimised by solving:
(5) 
It is not difficult to see that the above SICERP is a special case of SICERP, because . In SICERP, is restricted to a rankone matrix . Therefore, SICERP has more flexibility to weight these crosssparsitylevel kernel evaluations.
4.3 Optimisation
The combination coefficient or can then be viewed as the tunable parameter of the kernel or , and its value can be sought by optimising a generalisation bound on classification performance, e.g., the radius margin bound that is the upper bound of LeaveOneOut error [15]. In the following, the optimisation of is given and can be obtained similarly.
We first consider a binary classification task and then extend the result to the multiclass case. Given a training set of samples, and without loss of generality, the samples are labeled by , the optimal can be obtained by solving
(6) 
where is the radius of the smallest sphere enclosing all the training samples, while
denotes the normal of the SVM separating hyperplane, with
being the margin. can be obtained by optimising the following problem:(7)  
where denotes the kernel function or defined over two hierarchies of SICE matrices and in the paper. And can be obtained by solving the following optimisation problem of SVM with norm soft margin:
(8)  
subject to: 
where ; is the regularisation parameter; if , and otherwise.
How to solve Eq.(6) has been well studied in the literature. In brief, it can be optimised by iteratively 1) updating and ; 2) minimizing with respect to using gradientbased methods, as outlined in Algorithm 1.
For multiclass classification tasks, we employ onevsone partitioning strategy and optimize by using a pairwise combination of the radius margin bounds of binary SVM classifiers. Please refer to the supplementary for more details.
4.4 Differences from MKL and EMK
Multiple kernel learning (MKL) has been commonly used to combine different sources of information. Also, efficient match kernel (EMK) [16] has been wellknown for evaluating the similarity of two sets of points. With our notations, they can be expressed as follows, respectively.
Comparing them with our methods shows that i) MKL and the proposed integration methods differ in crosssparsitylevel comparison. MKL is often used to integrate heterogeneous sources, and different sources are usually not comparable. As a result, MKL only considers the similarity between samples from the same source, i.e. and , . In our case, SICE matrices at different sparsity levels are of the same type and therefore comparable. This allows us to explore the similarity across sources, i.e. and , to measure the similarity between two samples. In this sense, our method enjoys more flexibility to capture information than MKL. ii) EMK uses an equal weight to combine the similarity measure over different pairs. In contrast, the weights in our methods are adaptively learned, allowing us to better align with a task.
5 Computational issues
Given a samplebased covariance matrix estimated from feature vectors, the optimisation in Eq. (1) is proved to be convex and guaranteed to converge even when [10] and SICE matrix can be efficiently obtained by the offtheshelf package GLASSO [10] in . As shown in [10], it takes only CPU second to obtain a SICE matrix. It also allows to efficiently build a path of SICE matrices for different values of . Therefore, the proposed methods can be efficiently computed. In addition, note that the complexity of SICERP is independent of the feature number once is provided. Also, as CovRP, SICERP still allows two sets with different number of features to be compared, because the resulting SICE matrix has a fixed size of .
6 Experimental result
We compare our three proposed methods (i.e., SICERP, SICERP and SICERP) with both the classic CovRP and several stateoftheart methods mainly in skeletal human action recognition. Four benchmark data sets are tested, including HDM05 [7], MSRC12 [2], MSRAction3D [17] and MSRDailyActivity3D
[18]. For all data sets, only skeleton data are used while other information (e.g., depth maps or RGB videos) is not utilised. The proposed methods are also evaluated in medical image analysis to demonstrate their potential for generalization.
A kernel SVM classifier is employed throughout all experiments, in which the logEuclidean kernel [14] is used to measure the similarity of two SPD matrices. For a fair comparison, all algorithmic parameters are tuned by multifold crossvalidation on the training set only. The sparsity parameter used in SICERP is also chosen by crossvalidation on the training set. For the proposed integration methods, ten sparsity levels of SICEs are computed for each sample, corresponding to the very dense to very sparse representations. To compare with the stateoftheart methods, the training and test sets of these data sets are partitioned by following the literature. For HDM05, we used the instances of two subjects for training and the remaining for test, as in [3]. For MSRC12, MSRAction3D and MSRDailyActivity3D, the crosssubject test setting [19]
is used, i.e., the oddindexed subjects for training and the evenindexed ones for test.
Features are generated as follows. For HDM05 and MSRC12, the 3D coordinates of each joint are used as the frame features, leading to a feature dimensionality of ( joints) in HDM05 and ( joints) in MSRC12. For MSRAction3D and MSRDailyActivity3D, velocity is used as the frame features [20], which is calculated by the coordinate difference of 3D skeleton joints between a frame and its two direct neighbor frames. The dimensionality of the frame feature is joints). The frame number in each action instance is in HDM05, in MSRC12, in MSRAction3D and in MSRDailyActivity3D. In CovRP, to address the singularity issue, a small regulariser (e.g., ) is appended as in [4].
6.0.1 Result on HDM05 data set
HDM05 has about instances from over motion classes. Most classes have to realisations of five actors named “bd”, “bk”, “dg”, “mm” and “tr”. We use two subjects “bd” and “mm” for training and the remaining three for test by following [3]. The results are given in Table 2.
In addition to CovRP, the results of another six methods in the literature are also quoted in Table 2. These methods can be roughly categorised into linear and nonlinear representations (corresponding to the lower or the upper portion of Table 2, respectively). As for the linear representation methods, RSR [21], RSRML [3] and CDL [4] are covariancebased representations, but they further conduct dimensionality reduction on covariance matrix using sparse coding or projection. We also test InverseCovRP, which directly uses the inverse of the covariance matrix as representation without exploiting structure sparsity. As for the nonlinear representation methods, CovSVM [7] employs an infinitedimensional covariance matrix in a kernelinduced feature space as representation. KerRPRBF and KerRPPOL [5] are the two variants of the recently proposed nonlinear kernel representations using RBF and polynomial kernels [5]. Note that, our methods belong to the linear representations.
We follow [3] and test the classification accuracy on i) classes only (the left column in Table 2), and ii) all the 100 classes (the right column in Table 2).
For classes, CovRP shows quite competitive performance and outperforms the linear representations of RSR, RSRML, CDL and the nonlinear representation CovSVM. As expected, InverseCovRP obtains the same performance as CovRP since the logEuclidean kernel is inverseinvariant. The nonlinear kernel representations of KerRPRBF and KerRPPOL [5] outperform CovRP. In comparison, the three proposed methods demonstrate remarkable performance. SICERP achieves a high classification accuracy of %, on a par with KerRPRBF [5] and better than all the other quoted methods. This may indicate the efficacy of exploring the structure sparsity. Moreover, SICERP further boosts the classification accuracy from % to %, updating the stateoftheart performance.
For all the 100 classes, the overall classification accuracy decreases due to the significant increase of the number of action classes. In this case, the proposed SICERP still outperforms all the quoted ones in comparison. It achieves a significant improvement of percentage points over CovRP, and even wins the nonlinear kernel representation methods KerRPRBF and KerRPPOL [5]. When integrating the hierarchy of SICEs by the proposed methods, the improvement becomes more salient. Specifically, SICERP achieves a classification accuracy of %, which is percentage points higher than CovRP and percentage points higher than KerRPRBF [5].
classes  All classes  
Methods in comparison  Accuracy  Accuracy 
Methods using nonlinear representation  
CovSVM [7]  Not reported  
KerRPPOL [5]  
KerRPRBF [5]  
Methods using linear representation  
RSR [21]  Not reported  
RSRML [3]  
CDL [4]  Not reported  
CovRP [1]  
InverseCovRP  
SICERP (proposed)  
SICERP (proposed)  
SICERP (proposed) 
Methods in comparison  Accuracy 

Methods using nonlinear representation  
CovSVM [7]  89.8 
KerRPPOL [5]  
KerRPRBF [5]  
Methods using linear representation  
Hierarchy of Cov3DJs [2]  
CovRP [1]  
InverseCovRP  
SICERP (proposed)  
SICERP (proposed)  
SICERP (proposed) 
6.0.2 Result on MSRC12 data set
MSRC12 contains gesture categories from subjects. As shown in Table 2, SICERP again outperforms all the existing methods, including CovRP and the nonlinear kernel representation methods in [5]. By integrating the hierarchy of SICEs via our proposed integration methods, the classification accuracy of SICERP can be further improved to % by SICERP and to % by SICERP. This reinforces the effectiveness of the proposed SICERP and the integration methods.
In addition, to provide insight on the proposed representation, we visualise the SICE matrices computed on this data set to show the identified underlying structure of the interactions between skeletal joints for different actions. The results can be found in Fig. 1 and the supplementary material.
6.0.3 Result on MSRAction3D data set
MSRAction3D contains categories of actions from ten subjects. Each action is performed two or three times by each subject. The results are given in Table 3. Note that several nonCovrelated methods in the literature are also quoted in the left portion of this table.
As can be seen, although CovRP performs poorly in this case, the proposed SICERP achieves an accuracy up to %, bringing an improvement of percentage points over CovRP, and percentage points over CovSVM. It is interesting to see that SICERP also wins the methods in the left portion of Table 3, which involve complex representations of features, e.g., sparse coding [18] or use additional information like depth maps [22]. When a hierarchy of SICEs are integrated by the proposed SICERP and SICERP, the performance can be further boosted to %, reaching the stateoftheart performance of the nonlinear representation KerRPRBF [5]. Although SICERP and SICERP tie KerRPRBF in [5], they have an advantage: the methods in [5] require prior knowledge to select appropriate kernel functions for the representation, which is not needed in SICERP or SICERP.
Methods in comparison  Accuracy  Methods using nonlinear representation  Accuracy 

Structured Streaming Skeletons [23]  CovSVM [7]  
DBN+HMM [24]  KerRPPOL [5]  
Actionlet Ensemble [25]  KerRPRBF [5]  
Pose Set [26]  Methods using linear representation  
Moving Pose [20]  Hierarchy of Cov3DJs [2]  
Lie Group [27]  CovRP [1]  
SNV [18]  InverseCovRP  
Spatiotemp. Features Fusing [22]  SICERP (proposed)  
DLGSGC+TPM [17]  SICERP (proposed)  
SICERP (proposed) 
6.0.4 Result on MSRDailyActivity3D data set
MSRDailyActivity3D involves humanobject interactions such as drink, eat, read book, etc. The results are given in Table 5. On this data set, the best performance is achieved by the nonlinear representation methods KerRPPOL and KerRPRBF [5]. CovRP performs close to some of the stateoftheart nonCovbased representations. Our SICERP once again demonstrates reasonably good performance, with an accuracy of %, significantly better than most of the quoted stateoftheart results. Specifically, it outperforms CovRP by a large margin of percentage points. The accuracy can be further improved to % through integrating multiple SICEs using SICERP or SICERP. This result is close to the highest accuracy of % obtained by KerRPPOL. Note that, additional information is used in some stateoftheart methods, such as depth map [28, 18] and local occupancy patterns [25], while our SICERP soley utilizes the skeleton data.
Methods in comparison  Accuracy 

Moving Pose [20]  
Local HON4D [28]  
Actionlet Ensemble [25]  
SNV [18]  
Methods using nonlinear representation  
CovSVM [7]  
KerRPPOL [5]  
KerRPRBF [5]  
Methods using linear representation  
CovRP [1]  
InverseCovRP  
SICERP (proposed)  
SICERP (proposed)  
SICERP (proposed) 
Methods in comparison  Accuracy 

AttributedGraph[29]  62.8 
NetStructure[30]  69.6 
Methods using nonlinear representation  
CovSVM [7]  67.8 
KerRPPOL [5]  
KerRPRBF [5]  
Methods using linear representation  
CovRP [1]  
InverseCovRP  
SICERP (proposed)  
SICERP (proposed)  
SICERP (proposed) 
6.0.5 Comparison on medical image analysis
CovRP is also used in brain image analysis. The benchmark data set ADHD200 is tested for this case to verify the potential generalisation of the proposed methods. It is provided by the Neuro Bureau for differentiating Attention Deficit Hyperactivity Disorder (ADHD) from healthy control subjects. ADHD200 consists of restingstate functional MRI (fMRI) images of training and test subjects.The fMRI images are preprocessed with Athena pipeline^{2}^{2}2http://neurobureau.projects.nitrc.org/ADHD200/Introduction.html, after which, each subject is characterised by the averaged time series from each of brain regions. A covariance matrix is estimated for each subject based on time points.Therefore this task suffers the issue of small sample vs high feature dimensionality, similar to the skeletal human action recognition task.
In the literature, the stateoftheart classification accuracy on this data set is % in [30]. The comparison between the existing methods and ours is presented in Table 5. As seen, CovRP (of brain regions) only obtains an accuracy of %, much worse than that in [30]
. This is probably due to the unreliable covariance estimation suffering from the small sample problem. The nonlinear representation KerRPRBF
[5] also performs poorly with a classification accuracy below %. On the contrary, SICERP obtains an accuracy of %, beating both CovRP and KerRPRBF and comparable with that in [30]. This may be attributed to the fact that brain network is very sparse. The integration of SICEs seems very effective on this data set, as SICERP and SICERP significantly boost the classification accuracy to % and %, respectively. This demonstrates that our methods could be potentially generalised to other applications with small sample and high dimensionality.We have verified the effectiveness of exploiting structure sparsity in skeletal human action recognition and medical image analysis. For these tasks, the number of samples is relatively small and the dimensionality is high. Also, the prior knowledge on structure sparsity is clear in these tasks. As a sanity check, we further investigate how the proposed methods perform on the tasks with lower feature dimensions and a larger number of feature vectors. This sanity check experiment agrees with the principle of “Bet on sparsity” and suggests that exploiting structure sparsity could well maintain competitive performance and be a safe option for more applications. Refer to the supplementary for details.
6.0.6 Comparison with MKL and EMK
As shown in section 4.4, MKL does not utilise the crosssource information and is a special case of SICERP. An experiment is conducted to compare SICERP, SICERP and MKL to investigate if the crosssource information helps. The four human action recognition data sets are used. As seen in Table 6, MKL is only able to improve the performance of SICERP on MSRC12 and MSRDailyActivity3D. However, this is still inferior to the best classification performance achieved by our proposed SICERP and SICERP. This experiment also compares SICERP, SICERP and EMK. As seen, EMK only improves SICERP on MSRC12 and HDM05 (100 classes), but is worse than SICERP and SICERP on all action recognition data sets. This demonstrates the advantage of our adaptive integration methods.
Data set  SICERP  SICERP  SICERP  MKL  EMK 

MSRC12  
HDM05 (14 classes)  
HDM05 (100 classes)  
MSRAction3D  
MSRDailyActivity3D  
Average 
7 Conclusion and future work
To address the new issues encountered by covariance representation, we propose to improve the quality of characterising the underlying structure of visual features, and this leads to the use of SICE matrix as a generic feature representation. This new representation exploits the structure sparsity potentially existing among feature components, and is therefore more robust against sample scarcity and high feature dimensionality. The significant improvement achieved by this new representation is verified in skeletal human action recognition and medical image analysis. Also, the two integration methods developed in this work further improve recognition performance while avoiding searching for a single best sparsity level. The future work will apply the proposed representation to more vision tasks, investigate its efficacy for unsupervised learning scenario, and explore its interaction with nonlinear representations.
References
 [1] Tuzel, O., Porikli, F., Meer, P.: Region covariance: A fast descriptor for detection and classification. In: ECCV 2006, Part II. (2006) 589–600
 [2] Hussein, M.E., Torki, M., Gowayyed, M.A., ElSaban, M.: Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In: IJCAI 2013, Beijing, China, August 39, 2013. (2013)
 [3] Harandi, M.T., Salzmann, M., Hartley, R.: From manifold to manifold: geometryaware dimensionality reduction for spd matrices. In: ECCV. Springer (2014) 17–32
 [4] Wang, R., Guo, H., Davis, L.S.: Covariance discriminative learning: A natural and efficient approach to image set classification. In: CVPR. (2012) 2496–2503
 [5] Wang, L., Zhang, J., Zhou, L., Tang, C., Li, W.: Beyond covariance: Feature representation with nonlinear kernel matrices. In: ICCV, IEEE (2015)
 [6] Quang, M.H., San Biagio, M., Murino, V.: Loghilbertschmidt metric between positive definite operators on hilbert spaces. In: NIPS. (2014) 388–396
 [7] Harandi, M., Salzmann, M., Porikli, F.: Bregman divergences for infinite dimensional covariance matrices. In: CVPR, IEEE (2014) 1003–1010

[8]
Lehrmann, A.M., Gehler, P.V., Nowozin, S.:
A nonparametric bayesian network prior of human pose.
In: ICCV, IEEE (2013) 1281–1288 
[9]
Huang, J., Zhang, T., Metaxas, D.:
Learning with structured sparsity.
The Journal of Machine Learning Research
12 (2011) 3371–3412  [10] Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3) (2008) 432–441
 [11] Huang, S., Li, J., Sun, L., Ye, J., Fleisher, A., Wu, T., Chen, K., Reiman, E.: Learning brain connectivity of alzheimer’s disease by sparse inverse covariance estimation. NeuroImage 50(3) (2010) 935–949
 [12] Zhou, L., Wang, L., Ogunbona, P.: Discriminative sparse inverse covariance matrix: Application in brain functional network classification. In: CVPR. (2014) 3097–3104
 [13] Hastie, T., Tibshirani, R., Friedman, J., Franklin, J.: The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27(2) (2005) 83–85
 [14] Jayasumana, S., Hartley, R., Salzmann, M., Li, H., Harandi, M.: Kernel methods on the Riemannian manifold of symmetric positive definite matrices. In: CVPR. IEEE (2013) 73–80

[15]
Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.:
Choosing multiple parameters for support vector machines.
Machine learning 46(13) (2002) 131–159  [16] Bo, L., Sminchisescu, C.: Efficient match kernel between sets of features for visual recognition. In: Advances in neural information processing systems. (2009) 135–143
 [17] Luo, J., Wang, W., Qi, H.: Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: ICCV, IEEE (2013) 1809–1816
 [18] Yang, X., Tian, Y.: Super normal vector for activity recognition using depth sequences. In: CVPR, IEEE (2014) 804–811
 [19] Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3d points. In: CVPRW, IEEE (2010) 9–14
 [20] Zanfir, M., Leordeanu, M., Sminchisescu, C.: The moving pose: An efficient 3d kinematics descriptor for lowlatency action recognition and detection. In: ICCV, IEEE (Dec 2013) 2752–2759
 [21] Harandi, M.T., Sanderson, C., Hartley, R., Lovell, B.C.: Sparse coding and dictionary learning for symmetric positive definite matrices: A kernel approach. In: ECCV. Springer (2012) 216–229
 [22] Zhu, Y., Chen, W., Guo, G.: Fusing spatiotemporal features and joints for 3d action recognition. In: CVPRW, IEEE (2013) 486–491
 [23] Zhao, X., Li, X., Pang, C., Zhu, X., Sheng, Q.Z.: Online human gesture recognition from motion data streams. In: ACMMM, ACM (2013) 23–32
 [24] Wu, D., Shao, L.: Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: CVPR, IEEE (2014) 724–731
 [25] Wang, J., Liu, Z., Wu, Y., Yuan, J.: Learning actionlet ensemble for 3d human action recognition. IEEE TPAMI 36(5) (2014) 914–927
 [26] Wang, C., Wang, Y., Yuille, A.L.: An approach to posebased action recognition. In: CVPR, IEEE (2013) 915–922
 [27] Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: CVPR, IEEE (2014) 588–595
 [28] Oreifej, O., Liu, Z.: Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In: CVPR, IEEE (2013) 716–723
 [29] Dey, S., Rao, A.R., Shah, M.: Attributed graph distance measure for automatic detection of attention deficit hyperactive disordered subjects. Frontiers in neural circuits 8 (2014)
 [30] Dey, S., Rao, A.R., Shah, M.: Exploiting the brain’s network structure in identifying adhd subjects. Frontiers in systems neuroscience 6 (2012)
Comments
There are no comments yet.