1 Introduction
Videobased classification plays a key role in human motion analysis fields such as action and gesture recognition. Both fields have shown promising applications in many areas, including security and surveillance, contentbased video analysis, humancomputer interaction and animation. According to a recent survey on recognition of human activities [28], the focus has shifted to methods that do not rely on human body models, where the information is extracted directly from the images and hence being less dependent on reliable segmentation and tracking algorithms. Such image representation methods can be categorised into global and local based approaches [22].
Methods with global image representation encode visual information as a whole. Ali and Shah [1]
extract a series of kinematic features based on optical flow. A group of kinematic modes is found using principal component analysis. Guo
[6] encode the same kinematic features using sparse representation of covariance matrices. Several methods first divide the region of interest into a fixed spatial or temporal grid, extract features inside each cell and then combine them into a global representation. For example, this can be achieved using local binary patterns (LBP) [11], or histograms of oriented gradients (HOG) [27]. Global representations are sensitive to viewpoint, noise and occlusions which may lead to unreliable classification. Furthermore, global representations depend on reliable localisation of the region of interest [22].Local representations are designed to deal with the abovementioned issues by describing the visual information as a collection of patches, usually at the cost of increased computation. Laptev and Lindeberg [14] extract interest points using a 3D Harris corner detector and use the points for modelling the actions. One of the major drawbacks is the low number of interest points that are able to remain stable across an image sequence. A common solution is to work with windowed data, extracting salient regions which can be represented using Gabor filtering [4].
Wang [31] showed that dense sampling approaches tend to perform better compared to interest point based approaches. Dense sampling is typically done for a set of patches inside the region of interest. Features are extracted from each patch to form a descriptor. These descriptor representations differ from gridbased global representations in that they can have an arbitrary position and size, and that the patches are not combined to form a single representation but form a set of multiple representations. Examples are HOG and HOF (histogram of oriented flow) descriptors [15], SIFT descriptors [17], and their respective spatiotemporal versions, HOG3D [31] and 3D SIFT [26]. Because of the likely large number of descriptors and/or their high dimensionality, comparing sets of descriptors is often not straightforward. This has led to compressed representations such as formulating sets of descriptors as bagsofwords [21].
In this paper we propose the use of spatiotemporal covariance descriptors for action and gesture recognition tasks. Flat region covariance descriptors were first proposed for the task of object detection and classification in images [29]. Each covariance descriptor represents the features inside an image region as a normalised covariance matrix. They have led to improved results over related descriptors such as HOG, in terms of detection performance as well as robustness to translation and scale [29]. Furthermore, covariance matrices provide a low dimensional representation which enables efficient comparison between sets of covariance descriptors.
The proposed spatiotemporal descriptors, which we name Cov3D, belong to the group of symmetric positive definite matrices which do not form a vector space. They can be formulated as a connected Riemannian manifold, and taking into account the nonlinear nature of the space of the descriptors may lead to improved classification results. The most common approach for classification on manifolds is to first map the points into an appropriate Euclidean representation
[16]and then use traditional machine learning methods. A recent example of mapping is the Riemannian locality preserving projection (RLPP) technique
[8].The Cov3D descriptors are extracted from spatiotemporal windows inside sample videos, with the number of possible windows being very large. As such, we use a boosting approach to search the windows to find a subset which is the most useful for classification. We propose to extend RLPP by weighting (WRLPP), in order to take into account the weights of the training samples. This weighted projection leads to a better representation of the neighbourhoods around the most critical training samples during each boosting iteration. The proposed Cov3D descriptors, in conjunction with the classification approach based on WRLPP boosting, lead to a stateoftheart method for action and gesture recognition.
We continue the paper as follows. In Section 2 we describe the spatiotemporal covariance descriptors, and use the concept of integral video to enable fast calculation inside any spatiotemporal window. In Section 3, we first overview the concept of Riemannian manifolds formulated in the context of positive definite symmetric matrices, and then detail the proposed boosting classification approach based on weighted Riemannian locality preserving projection. In Section 4, we compare the performance of the proposed method against several recent stateoftheart methods on three benchmark datasets. Concluding remarks and possible future directions are given in Section 5.
2 Cov3D Descriptors
In this section we first present the general form of the proposed spatiotemporal covariance descriptors (Cov3D), an algorithm for their fast calculation, and finally how they can be specialised for action and gesture recognition. For convenience, we follow the notation in [29].
Let be the sequence of images and be the dimensional feature video extracted from :
(1) 
where the function can be any mapping such as intensity, colour, gradients, or optical flow. For a given spatiotemporal window , let be the dimensional feature vectors inside . The region is represented with the covariance matrix of the feature vectors:
(2) 
where is the mean of the points. Fig. 1 shows the construction of a covariance descriptor inside a spatiotemporal window. Examples of feature vectors specific for action and gesture recognition are given in Section 2.2.
Representing a spatiotemporal window with a covariance matrix has several advantages: (i) it is a lowdimensional representation which is independent on the size of the window, (ii) the impact of noisy samples is reduced through the averaging during covariance computation, (iii) it is a straightforward method of fusing correlated features.
2.1 Fast computation
Integral images are an intermediate image representation used for the fast calculation of region sums [30]. The concept has been extended to image sequences [10], where the integral images are stacked to form an integral video, and can be used to compute spatiotemporal region sums in constant time. For a video , its integral video is defined as:
(3) 
Tuzel [29] used the integral image representations for fast calculation of flat region covariances. Here we extend the idea for fast calculation of covariance matrices inside a spatiotemporal window using the integral video representation. The th element of the covariance matrix defined in (2) can be expressed as:
(4) 
where refers to the th element of the th vector. To find the covariance in a given spatiotemporal window , we have to compute the sum of each feature dimension, , as well as the sum of the multiplication of any two feature dimensions, . With representing the number of dimensions, the covariance of any spatiotemporal window can be computed in time, as follows.
We need to compute a total of integral videos. Let be the tensor of the integral videos:
(5) 
where is the th element of vector . Furthermore, let be the tensor of the secondorder integral videos:
(6) 
for . The complexity of calculating the tensors is . The dimensional feature vector and the dimensional matrix can be obtained from the above tensors using:
(7)  
(8) 
Let be the spatiotemporal window of points , as shown in Fig. 2. The covariance of the spatiotemporal window bounded by and is:
(9) 
where . Similarly, after a few rearrangements, the covariance of the region can be computed as:
(10) 
where
(11)  
(12) 
and .
2.2 Features and regions
Commonly used features for action and gesture recognition include intensity gradients and optical flow. Previous studies have shown the benefit of combining both types of features [4, 31]. We define the feature mapping , present in (1
), as the following combination of gradient and opticalflow based features, extracted from pixel location
:(13) 
where
(14)  
(15) 
The first four gradient based features in (14) represent the first and second order intensity gradients at pixel location . The last two gradient based features correspond to the gradient magnitude and gradient orientation. The opticalflow based features in (15) represent, in order: the horizontal and vertical components of the flow vector, the first order derivatives of the flow components with respect to , and the spatial divergence and vorticity of the flow field as defined in [1]. Each descriptor is hence a matrix, as has dimensions.
For reliable recognition, several regions (and hence several descriptors) are typically used. Fig. 3 shows the spatiotemporal windows of two descriptors which can be used for recognition of face expressions. With the defined mapping, the input video is mapped to , a dimensional feature video. Since the cardinality of the set of spatiotemporal windows is very large, we only consider windows of a minimum size and increment their location and size by a minimum interval value. Further specifics on the windows used in the experiments are given in Section 4.
Following [29], each covariance descriptor , is normalised with respect to the covariance descriptor of the region containing the full feature video, , to improve the robustness against illumination variations:
(16) 
where is equal to at the diagonal entries and the rest is set to zero.
3 Classification of Actions and Gestures
The Cov3D descriptors are symmetric positive definite matrices of size , which can be formulated as a connected Riemannian manifold () [7]. In this section we first briefly overview Riemannian manifolds, followed by describing the proposed weighted Riemannian locality preserving projection (WRLPP) that allows mapping from Riemannian manifolds to Euclidean spaces. We then describe a classification algorithm that uses WRLPP.
3.1 Riemannian manifolds
A manifold can be considered as a continuous surface lying in a higher dimensional Euclidean space. Formally, a manifold is a topological space which is locally similar to an Euclidean space [29]. Intuitively, the tangent space is the plane tangent to the surface of the manifold at point .
A point on the manifold can be mapped to a vector in the tangent space using the logarithm map operator . For the logarithm map is defined as:
(17) 
where
is the matrix logarithm operator. Given the eigenvalue decomposition of a symmetric matrix,
, the matrix logarithm can be computed via:(18) 
where is a diagonal matrix, with each diagonal element equal to the logarithm of the corresponding element in .
The minimum length curve connecting two points on the manifold is called the geodesic, and the distance between two points is given by the length of this curve. Geodesics are related to the tangents in the tangent space. For , the distance between two points on the manifold can be found via:
(19) 
3.2 Weighted RLPP
The usual approach for classification on manifolds is to first map the points into an appropriate Euclidean representation [29] and then use traditional machine learning methods. Points in the manifold can be mapped into a fixed tangent space (such as where
is the identity matrix)
[6]. Since distances in the manifold are only locally preserved in the tangent space, better results can be achieved by considering the tangent space at the Karcher mean, the point which minimises the distances among the samples, as shown in [29]. Improved results have been obtained by considering multiple tangent spaces [19, 25]. A more complex approach involves using training data to create a mapping that tries to preserve the relations between points, such as the RLPP approach [8].RLPP is based on Laplacian eigenmaps [2]. Given training points from the underlying Riemannian manifold , the local geometrical structure of can be modelled by building an adjacency graph . The simplest form of is a binary graph obtained based on the nearest neighbour properties of Riemannian points: two nodes are connected by an edge if one node is among the nearest neighbours of the other node. From the adjacency graph we can find the degree and Laplacian matrices, respectively:
(20)  
(21) 
where the degree matrix is a diagonal matrix of size , with diagonal entries indicating the the number of edges of each node in the adjacency graph.
RLPP also uses a heat pseudokernel matrix , with the th element constructed via:
(22) 
where is the geodesic distance defined in (19).
The final mapping can be found through the following generalised eigenvalue problem [8]:
(23) 
The number of possible Cov3D descriptors inside a sample video is very large. As such, we elected to use boosting to search for a subset of the best descriptors for classification. We could use the original RLPP mapping approach to map the matrices as vectors at each boosting iteration. However, as shown in [29], the sample weights can be used to generate a mapping which is more appropriate for the critical training samples. Therefore, we propose a modified projection, specifically designed to be used during boosting, which uses sample weights to generate the final mapping. We refer to this approach as weighted Riemannian locality preserving projection (WRLPP).
In the modified projection, the adjacency graph is replaced with a weighted adjacency graph , defined as:
(24) 
where is a diagonal matrix with diagonal values that correspond to the vector of sample weights . Using the weighted adjacency graph, edges involving critical samples (ie. samples with higher weights) become more important and their geometrical structure is better preserved. The modified projection approach is detailed in Algorithm 1.
Once the the projection matrix has been obtained, a given point (a Cov3D matrix) on the manifold can then be mapped to Euclidean space via:
(25) 
where , with defined in (22), and representing the training points.
3.3 Classification
As mentioned in the preceding section, we have chosen to use boosting to find a subset of the best descriptors for classification, as the number of possible Cov3D descriptors inside a sample video is large. For simplicity, we used a combination of onevsone LogitBoost classifiers
[5] to achieve multiclass classification.We start with a brief description of binary LogitBoost classification, with class labels
. The probability of sample
belonging to class is represented by:(26) 
where , with representing a weak learner.
The LogitBoost algorithm learns a set of weak learners by minimising the negative binomial log likelihood of the data. A weighted least squares regression of training points is fitted to response values , with weights , where
(27)  
(28) 
As we are using Cov3D descriptors (covariance matrices) as input data, we adapt the weak learners to use the projected descriptors. In other words, is replaced with , with representing a covariance matrix.
For every unique pair of classes, we train a onevsone LogitBoost classifier as follows. Only the samples belonging to the pair of classes are used for training the binary classifier. One class is selected to be the positive class and the other as the negative class. For each boosting iteration, we search for the region whose Cov3D descriptor best separates positive from negative samples. The descriptor is calculated for all the training samples and mapped to vector space with WRLPP, using the sample weights calculated for the current boosting iteration. Once in vector space, we fit a linear regression and use it as the weak LogitBoost classifier.
To prevent overfitting, the number of weak classifiers on each onevsone classifier is controlled by a probability margin between the last accepted positive and the last rejected negative. Both margin samples are determined by the target detection rate () and the target false positive rejection rate (). The final multiclass classifier is a set of onevsone classifiers. Each onevsone classifier , where and are the labels of its two classes, has a positive class and a threshold . The positive class is the label of the class deemed to be positive and the threshold is found via boosting. Algorithm 2 summarises the training process.
A sample video is classified as follows. Given a onevsone classifier , the probability of a sample video belonging to the positive class is evaluated using:
(29) 
After evaluating with all the onevsone classifiers in the set, the sample is labelled as the class which maximises:
(30) 
where is if is the positive class , or otherwise. In other words, is labelled as the class with greater probability sum, selecting all the onevsone classifiers that evaluate to that class.
4 Experiments
We compared the performance of the proposed algorithm against baseline approaches as well as several stateoftheart methods. We used three benchmark datasets, with an overview of the datasets shown in Table 1.
In the following subsections, we first present an evaluation of several Riemannian to Euclidean space mapping approaches, justifying the use of the weighted RLPP. We then follow with experiments showing the performance on sport actions, facial expressions and hand gestures.
Unless otherwise stated, no preprocessing was performed in the input sequences and all the recognition results were obtained using 5fold cross validation to divide the samples into training and testing sets.
In all cases we used the following parameters: 0.95 detection rate, 0.95 false positive rejection rate, 0.5 margin. Furthermore, since the search space of spatiotemporal windows is very large, we restricted the minimum size of the windows, as well as the minimum increment on location and size of the windows, to of the frame size.
Dataset  UCF [24]  CK+ [18]  Cambridge [12] 

Type  sports  facial expressions  hand gestures 
Classes  10  7  9 
Subjects  —  123  2 
Scenarios  —  —  5 
Video samples  150  593  900 
Resolution  variable 
4.1 Comparison of mapping approaches
In Fig. 4, we compare the following six Riemannian to Euclidean space mapping () approaches which can be used during boosting: (i) no mapping (ie., using a vectorised representation of the uppertriangle of the covariance matrix), (ii) projection to a fixed tangent space [6], (iii) projection to the weighted Karcher mean of the samples [29], (iv) projection using ktangent spaces [25], (v) mapping the points with the original RLPP method [8], and (vi) mapping the points with the proposed WRLPP approach.
Since the mapping approach affects individual binary classifiers, we show results per classifier with detection error tradeoff curves. We chose the onevsone classifiers between conflicting class pairs (where samples of one class are misclassified as the other class) on the Cambridge hand gesture recognition dataset (which is described in Section 4.4). Each point on the curve represents the average of all the chosen classifiers. The curves were obtained by varying the classification threshold in Algorithm 2.
With the exception of the original RLPP method, incrementally better results are obtained by using the mapping approaches in the mentioned order, as they provide increasingly better vector representations of the manifold space. Although RLPP is designed to provide a better representation compared to tangentbased approaches, it appears not to be appropriate for boosting as it does not take into account the sample weights of critical training points. The proposed WRLPP method addresses this problem, resulting in the best overall performance.
4.2 UCF sport dataset
The UCF sport action dataset [24] consists of ten categories of human actions, containing videos with nonuniform backgrounds where both the camera and the subject might be moving. We use the regions of interest provided with the dataset.
We compared the Cov3D approach against the following methods: HOG3D [31], hierarchy of discriminative spacetime neighbourhood features (HDN) [13], and augmented features in conjunction with multiple kernel learning (AFMKL) [32]. HOG3D is the extension of histogram of oriented gradient descriptor [15] to the spatiotemporal case. HDN learns shapes of spacetime feature neighbourhoods that are most discriminative for a given action category. The idea is to form new features composed of the neighbourhoods around the interest points in a video. AFMKL exploits appearance distribution features and spatiotemporal context features in a learning scheme for action recognition. As shown in Table 2, the proposed Cov3Dbased approach achieves the highest accuracy.
4.3 CK+ facial expression dataset
The extended CohnKanade (CK+) facial expression database [18] contains 593 sequences from 123 subjects. We used the sequences with validated emotion labels, among 7 possible emotions. The image sequences vary in duration (i.e. 10 to 60 frames) and incorporate the onset (which is also the neutral frame) to peak formation of the facial expressions.
We compared the Cov3D approach against active appearance models (AAM), constrained local models (CLM) [3], and temporal modelling of shapes (TMS) [9]
. AAM is the baseline approach included with the dataset. It uses active appearance models to track the faces and extract the features, and then uses support vector machines (SVM) to classify the facial expressions. The CLM approach is an improvement on AAM, designed for better generalisation to unseen objects. The TMS approach uses latentdynamic conditional random fields to model temporal variations within shapes.
We show the performance per emotion in Table 3, in line with existing literature. The proposed Cov3D approach achieves the highest average recognition accuracy of (averaged over the 7 classes). The next best method (TMS) obtained an average accuracy of .
4.4 Cambridge hand gesture dataset
The Cambridge handgesture dataset [12] consists of 900 image sequences of 9 gesture classes. Each class has 100 image sequences performed by 2 subjects, captured under 5 illuminations and 10 arbitrary motions. The 9 classes are defined by three primitive hand shapes and three primitive motions. Each sequence was recorded with a fixed camera having roughly isolated gestures in space and time. We followed the test protocol defined in [12]. Sequences with normal illumination were considered for training while tests were performed on the remaining sequences.
The proposed method was compared against tensor canonical correlation analysis (TCCA) [12], product manifolds (PM) [20] and tangent bundles (TB) [19]
. TCCA is the extension of canonical correlation analysis to multiway data arrays or tensors. Canonical correlation analysis and principal angles are standard methods for measuring the similarity between subspaces. In the PM method a tensor is characterised as a point on a product manifold and classification is performed on this space. The product manifold is created by applying a modified high order singular value decomposition on the tensors and interpreting each factorised space as a Grassmann manifold. In the TB method, video data is represented as a third order tensor and factorised using high order singular value decomposition, where each factor is projected onto a tangent space and the intrinsic distance is computed from a tangent bundle for action classification.
We report the recognition rates for the four test sets in Table 4, where the proposed Cov3Dbased approach obtains the highest performance.
5 Conclusion
In this paper, we first extended the flat covariance descriptors proposed in [29] to spatiotemporal covariance descriptors termed Cov3D, and then showed how they can be computed quickly through the use of integral video representations.
The proposed Cov3D descriptors belong to the group of symmetric positive definite matrices, which can be formulated as a connected Riemannian manifold. Prior to classification, points on a manifold are generally mapped to an Euclidean space, through a technique such as Riemannian locality preserving projection (RLPP) [8].
The Cov3D descriptors are extracted from spatiotemporal windows inside sample videos, with the number of possible windows being very large. We used a boosting approach to find a subset which is the most useful for classification. In order to take into account the weights of the training samples, we further proposed to extend RLPP by incorporating weighting during the projection. The weighted projection (termed WRLPP) leads to a better representation of the neighbourhoods around the most critical training samples during each boosting iteration.
Combining the proposed Cov3D descriptors with the classification approach based on WRLPP boosting leads to a stateoftheart method for action and gesture recognition. The proposed Cov3Dbased method performs better than several recent approaches on three benchmark datasets for action and gesture recognition. The method is robust and does not require additional processing of the videos, such as foreground detection, interestpoint detection or tracking. To our knowledge, this is the first approach proving to be equally suitable (ie., recognition accuracy) for both action and gesture recognition.
Further avenues of research include adapting the method for related tasks, such as anomaly detection in surveillance videos
[23], where there is often a shortage of positive examples.References
 [1] S. Ali and M. Shah. Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans. Pattern Analysis and Machine Intelligence, 32(2):288–303, 2010.
 [2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003.
 [3] S. Chew, P. Lucey, S. Lucey, J. Saragih, J. Cohn, and S. Sridharan. Personindependent facial expression detection using constrained local models. In IEEE International Conference on Automatic Face Gesture Recognition and Workshops, pages 915–920, 2011.
 [4] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatiotemporal features. In IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pages 65–72, 2005.

[5]
J. Friedman, T. Hastie, and R. Tibshirani.
Additive logistic regression: a statistical view of boosting.
Annals of Statistics, 28(2):337–407, 2000.  [6] K. Guo, P. Ishwar, and J. Konrad. Action recognition using sparse representation on covariance manifolds of optical flow. In IEEE Int. Conf. Advanced Video and Signal Based Surveillance (AVSS), pages 188–195, 2010.
 [7] M. T. Harandi, C. Sanderson, R. Hartley, and B. C. Lovell. Sparse coding and dictionary learning for symmetric positive definite matrices: A kernel approach. In ECCV 2012, Lecture Notes in Computer Science (LNCS), Vol. 7573, pages 216–229, 2012.
 [8] M. T. Harandi, C. Sanderson, A. Wiliem, and B. C. Lovell. Kernel analysis over Riemannian manifolds for visual recognition of actions, pedestrians and textures. In IEEE Workshop on the Applications of Computer Vision, pages 433–439, 2012.
 [9] S. Jain, C. Hu, and J. Aggarwal. Facial expression recognition with temporal modeling of shapes. In IEEE Int. Conf. Computer Vision Workshops, pages 1642–1649, 2011.
 [10] Y. Ke, R. Sukthankar, and M. Hebert. Efficient visual event detection using volumetric features. In IEEE Int. Conf. Computer Vision, volume 1, pages 166–173, 2005.
 [11] V. Kellokumpu, G. Zhao, and M. Pietikainen. Human activity recognition using a dynamic texture based method. In British Machine Vision Conference, 2008.
 [12] T.K. Kim and R. Cipolla. Canonical correlation analysis of video volume tensors for action categorization and detection. IEEE Trans. Pattern Analysis and Machine Intelligence, 31(8):1415–1428, 2009.

[13]
A. Kovashka and K. Grauman.
Learning a hierarchy of discriminative spacetime neighborhood
features for human action recognition.
In
IEEE Conf. Computer Vision and Pattern Recognition
, pages 2046–2053, 2010.  [14] I. Laptev and T. Lindeberg. Spacetime interest points. In IEEE Int. Conf. Computer Vision, volume 1, pages 432–439, 2003.
 [15] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In IEEE Conf. Computer Vision and Pattern Recognition, pages 1–8, 2008.
 [16] T. Lin and H. Zha. Riemannian manifold learning. IEEE Trans. Pattern Analysis and Machine Intelligence, 30(5):796–809, 2008.
 [17] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. Int. J. Computer Vision, 60(2):91–110, 2004.
 [18] P. Lucey, J. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended CohnKanade dataset (CK+): A complete dataset for action unit and emotionspecified expression. In Computer Vision and Pattern Recognition Workshops, pages 94–101, 2010.
 [19] Y. Lui. Tangent bundles on special manifolds for action recognition. IEEE Trans. Circuits and Systems for Video Technology, 22(6):930–942, 2012.
 [20] Y. M. Lui, J. Beveridge, and M. Kirby. Action classification on product manifolds. In IEEE Conf. Computer Vision and Pattern Recognition, pages 833–839, 2010.
 [21] J. Niebles, H. Wang, and L. FeiFei. Unsupervised learning of human action categories using spatialtemporal words. Int. J. Computer Vision, 79(3):299–318, 2008.
 [22] R. Poppe. A survey on visionbased human action recognition. Image and Vision Computing, 28(6):976–990, 2010.
 [23] V. Reddy, C. Sanderson, and B. C. Lovell. Improved anomaly detection in crowded scenes via cellbased analysis of foreground speed, size and texture. In Computer Vision and Pattern Recognition Workshops (CVPRW), pages 55–61, 2011.
 [24] M. Rodriguez, J. Ahmed, and M. Shah. Action MACH a spatiotemporal maximum average correlation height filter for action recognition. In IEEE Conf. Computer Vision and Pattern Recognition, pages 1–8, 2008.
 [25] A. Sanin, C. Sanderson, M. T. Harandi, and B. C. Lovell. Ktangent spaces on Riemannian manifolds for improved pedestrian detection. In IEEE Int. Conf. Image Procesing, 2012.
 [26] P. Scovanner, S. Ali, and M. Shah. A 3dimensional SIFT descriptor and its application to action recognition. In International Conference on Multimedia, pages 357–360, 2007.
 [27] C. Thurau and V. Hlavac. Pose primitive based human action recognition in videos or still images. In IEEE Conf. Computer Vision and Pattern Recognition, pages 1–8, 2008.
 [28] P. Turaga, R. Chellappa, V. Subrahmanian, and O. Udrea. Machine recognition of human activities: a survey. IEEE Trans. Circuits and Systems for Video Technology, 18(11):1473–1488, 2008.
 [29] O. Tuzel, F. Porikli, and P. Meer. Pedestrian detection via classification on Riemannian manifolds. IEEE Trans. Pattern Analysis and Machine Intelligence, 30(10):1713–1727, 2008.
 [30] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In IEEE Conf. Computer Vision and Pattern Recognition, volume 1, pages 511–518, 2001.
 [31] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. Evaluation of local spatiotemporal features for action recognition. In British Machine Vision Conference, 2009.
 [32] X. Wu, D. Xu, L. Duan, and J. Luo. Action recognition using context and appearance distribution features. In IEEE Conf. Computer Vision and Pattern Recognition, pages 489–496, 2011.