Human action recognition from videos is a challenging problem. Differences in viewing direction and distance, body sizes of the human subjects, clothing, and style of performing the action are some of the main factors making human action recognition difficult. Most of the previous approaches to action recognition have focused on using traditional RGB cameras [1, 2, 3]. Since the release of Microsoft Kinect depth camera, depth based human action recognition methods [4, 5, 6, 7, 8, 9, 10, 11, 12] start to emerge. Being a depth sensor, the Kinect camera is not affected by scene illumination and the color of the clothes worn by the human subject, making object segmentation an easier task. The challenges still remain are loose clothing, occlusions, and variations in the style and speed of actions.
In this context, some algorithms have exploited silhouette and edge pixels as discriminative features. For example, Li et al.  sampled boundary pixels from 2D silhouettes as a bag of features. Yang et al.  added temporal derivative of 2D projections to get Depth Motion Maps (DMM). Vieira et al.  computed silhouettes in 3D by using the space-time occupancy patterns. Instead of these very simple occupancy features, Wang et al. 
computed a vector of 8 Haar features on a uniform grid in the 4D volume. LDA was used to detect the discriminative feature positions and an SVM classifier was used for action classification. Xia and Aggarwal proposed an algorithm to extract Space Time Interest Points (STIPs) from depth sequences and modelled local 3D depth cuboid using the Depth Cuboid Similarity Feature (DCSF). However, the accuracy of this algorithm is dependent on the noise level of depth images. Tang et al.  proposed to use histograms of the normal vectors computed from depth images for object recognition. Given a depth image, they computed the spatial derivatives and transformed to polar coordinates where the 2D histograms of were used as object descriptors. Oreifej and Liu  extended these derivatives to the temporal dimension. They normalized the gradient vectors to unit magnitude and projected them onto a fixed basis before histogramming. In their formulation, the last components of the normalized gradient vectors were the inverse of the gradient magnitude. As a result, information from very strong derivative locations, such as edges and silhouettes, may get suppressed.
In , the HOG3D feature is computed within a small 3D cuboid centered at a space-time point and Sparse Coding (SC) is utilized to obtain a more discriminative representation. The depth images have a high level of noise and two subjects may perform one action in different styles. So, to favor sparsity, SC might select quite different elements in the dictionary for similar actions (details in Section I-B3). To overcome this problem, we use the Locality-constrained Linear Coding (LLC)  in this paper to make locality more essential than sparsity. Moreover, instead of a small 3D cuboid, we divide the input video sequence into subsequences, blocks, and then cells (Fig. 1). The HOG3D features computed at the cell level are concatenated to form the descriptor at the block level. Given classes, the LLC followed by maximum pooling and a logistic regression classifier with L2 regularization finally give probability values for each subsequence. The video sequence is represented by the concatenation of the subsequence descriptors. Finally, we use an SVM for action classification.
We evaluate the proposed algorithm on three standard depth datasets [7, 12, 10] and two standard color datasets [17, 18]. We compare the proposed method with ten state-of-the-art methods [6, 8, 7, 19, 5, 20, 21, 22, 23, 4]. Our experimental results show that our algorithm outperforms these ten methods.
I-a Proposed Algorithm
We consider an action as a function operating on a three dimensional space with being independent variables and the depth () being the dependent variable, i.e., . The discrimination of a particular action can be characterized by using the variations of the depth values along these dimensions.
I-B Feature Extraction
In order to capture sufficient discriminative information such as local motion and appearance characteristics, the depth sequence is divided into equally spaced overlapping spatio-temporal subsequences of size , where is the number of frames in the whole video. Each subsequence is divided into blocks of size , where . Each block is further divided into equally spaced non-overlapping cells of size . This hierarchy is shown in Fig. 1.
I-B1 Cell Descriptor
In each cell, a 3D histogram of oriented gradients (HOG) feature  is computed by evaluating a gradient vector at each pixel in that cell:
where the derivatives are given by: , , and .
The HOG feature is computed by projecting each gradient vector onto directions obtained by joining the centers of faces of a regular n-sided polyhedron with its center. A regular dodecahedron is a type of regular 12-sided polyhedron that is commonly used to quantize 3D gradients. It is composed of 12 regular pentagonal faces and each face corresponds to a histogram bin. Let be the matrix of the center positions of all faces:
For a regular dodecahedron with center at the origin, these normalized vectors are given by:
where is the golden ratio, and is the length of vector . The gradient vector is projected on to give
Since should vote into only one single bin in case it is perfectly aligned with the corresponding axis running through the origin and the face center, the projected vector should be quantized. A threshold value is computed by projecting any two neighboring vectors and , i.e.,
The quantized vector is given by
where . We define to be scaled by the gradient magnitude, i.e.,
and a histogram, , for each cell is computed:
Note that this summation is equivalent to histogramming, because each vector represents votes in the corresponding bins defined by the 12 directions of the regular dodecahedron.
I-B2 Block Descriptor
We combine a fixed number of cells in each neighborhood into blocks. To get the block descriptor, we vertically concatenate the histograms of all cells in that block (Fig. 2(b)):
where denotes the total number of cells in each block.
The block descriptor in (7) is normalized and a Symmetric Sigmoid function is applied to it as a trade-off between the gradient magnitude and gradient orientation:
where is an index: . Finally, is normalized to give
I-B3 Subsequence Descriptor Using SC
To compare the descriptors computed using SC and LLC, we show below how we obtain the intermediate variable (see Fig. 2(c)) using both methods to get two different representations for the spatio-temporal subsequences.
For each spatio-temporal subsequence, we combine all the spatio-temporal blocks along the time dimension (see Fig. 1(b) and Fig. 2(a)). We horizontally concatenate all the block descriptors within the subsequence to give
where denotes the number of blocks within the subsequence. In the case where , then also and variable becomes the number of frames in the depth video. i.e., .
We then create a dictionary by applying the -means clustering algorithm on the computed HOG3D of the training blocks. Here denotes the number of block descriptors, each has dimensions. The SC sparsely encodes each block descriptor into a linear combination of a few atoms of dictionary by optimizing:
where is a regularization parameter which determines the sparsity of the representation of each local spatio-temporal feature. The twofold optimization aims at minimizing the reconstruction error and the sparsity of the coefficient set simultaneously.
I-B4 Subsequence Descriptor Using LLC
An alternative to obtain the coefficient set is to use the LLC. Instead of Eq. (11), the LLC code uses the following criteria:
where denotes the element-wise multiplication, and is the locality adapter that gives different weights to the basis vectors depending on how similar they are to the input descriptor . Specifically,
where is used for adjusting the weight decay speed for the locality adapter.
For the encoding of block descriptors, we found that locality is more essential than sparsity, as locality must lead to sparsity but not necessary vice versa . So, the LLC should be more discriminative than SC. We illustrate the superiority of LLC over SC by an example. Fig. 3(a) shows a block within a spatio-temporal subsequence of a subject who performed an action and Fig. 3(b) shows the block in the subsequence at the same spatial location in another video of another subject doing the same action. However, the coefficients of SC for these two blocks (Fig. 3(c)) are quite different (Euclidean distance is 1.38) which means using SC may lead to failure to identify the actions being the same. On the other hand, the LLC coefficients for these two blocks (Fig. 3(d)) are very similar to each other (both are distinct from the SC coefficients) and have a smaller distance apart (Euclidean distance is 0.2). As shown in this figure, LLC appears to better represent the spatio-temporal blocks.
After the optimization step (using Eq. (11) or (12)), we get a set of sparse codes , where each vector has only a few nonzero elements. It can also be interpreted that each block descriptor only responds to a small subset of dictionary atoms.
To capture the global statistics of the subsequence, we use a maximum pooling function given by (Fig. 2 (c)), where returns a vector with the element defined as:
Through maximum pooling, the obtained vector is the subsequence descriptor (Fig. 2(c)).
I-C Sequence Descriptor and Classification
The probability of each action class for a subsequence descriptor is . To calculate this probability, we use a logistic regression classifier with L2 regularization . The logistic regression model was initially proposed for binary classification; however, it was later on extended to multi-class classification. Given data , weight , class label , and a bias term , it assumes the following probability model:
If the training instances are with labels , for
, one estimatesby minimizing the following equation:
where denotes the L2 regularization term, and is a user-defined parameter that weights the relative importance of the two terms. The one-vs-the-rest strategy by Crammer and Singer  is used to solve the optimization for this multi-class problem.
For all the subsequences in the training data corresponding to the same spatial location , we train one logistic regression classifier. If there are spatial locations (i.e., the total number of overlapping subsequences), we will have classifiers. For the classifier corresponding to each spatial location , we can compute the probability of each action class: , where is the number of class labels.
For an input video sequence, the class probabilities of all the spatio-temporal subsequences are concatenated to create the sequence descriptor: . In the training stage, the sequence descriptors of all the training action videos are used to train an SVM classifier.
In the testing stage, the sequence descriptor of the test video sequence is computed and fed to the trained classifier. The class label is defined to be the one that corresponds to the maximum probability.
Ii Experimental Results
We evaluate the performance of the proposed algorithm on two different types of videos: (1) depth and (2) color. For depth data, we use three standard datasets including the MSRAction3D [10, 8], MSRGesture3D [12, 6], and MSRActionPairs3D . For color data, we use two standard datasets including Weizmann , and UCFSports . The performance of our proposed algorithm is compared with ten state-of-the-art algorithms [6, 8, 7, 19, 5, 20, 21, 22, 23, 4]. Except for [20, 19], all accuracies are reported from the original papers and the codes are obtained from the original authors.
Datasets. The MSRGesture3D dataset contains 12 American Sign Language (ASL) gestures. Each gesture is performed 2 or 3 times by each of the 10 subjects. In total, the dataset contains 333 depth sequences. The MSRAction3D dataset consists of 567 depth sequences of 20 human sports actions. Each action is performed by 10 subjects 2 or 3 times. These actions were chosen in the context of interactions with game consoles and cover a variety of movements related to torso, legs, arms and their combinations. The MSRActionPairs3D dataset contains 6 pairs of actions, such that within each pair the motion and the shape cues are similar, but their correlations vary. For example, Pick up and Put down actions have similar motion and shape; however, the co-occurrence of the object shape and the hand motion is in different spatio-temporal order. Each action is performed 3 times using 10 subjects. In total, the dataset contains 360 depth sequences.
The Weizmann dataset contains 90 video sequences of 9 subjects, each performing 10 actions. The UCFSports dataset consists of videos from sports broadcasts, with a total of 150 videos from 10 action classes. Videos are captured in realistic scenarios with complex and cluttered background, and actions exhibit significant intra-class variation.
Experimental Settings. In all the experiments, the ROIs of the depth/RGB videos are resized to pixels for the sake of computation. The size of each cell is pixels (i.e., ) and the number of cells in each block is along the and dimensions (i.e., ), respectively. Variables and are set to 1; variable , the size of the dictionary, is set to 200. For RGB videos, we consider the variable mentioned in Section I-A as the gray-level value of each pixel.
Ii-a Depth Videos
In the first experiment, we evaluate the proposed method on the depth datasets (MSRGesture3D, MSRAction3D and MSRActionPairs3D). To compare our method with previous techniques on MSRGesture3D dataset, we use leave-one-subject-out cross-validation scheme proposed by . The average accuracies of our algorithm using SC and LLC are 89.6% and 94.1%, respectively. These results prove that LLC is more discriminative than SC. Our algorithm outperformed existing state-of-the-art algorithms (Table I). Note that Actionlet method  cannot be applied to this dataset because of the absence of 3D joint positions.
For the MSRAction3D dataset, we performed experiments, same as previous works [8, 7], using five subjects for training and five subjects for testing. The accuracy obtained is 90.9% which is higher than the accuracy 87.1% from SC, 88.9% from HON4D  and 89.3% from DSTIP  (Table I). For most of the actions, our method achieves near perfect recognition accuracy. The classification errors occur if two actions are too similar to each other, such as hand catch and high throw.
For the MSRActionPairs3D dataset, same as previous work , we use half of the subjects for training and the rest for testing. The proposed algorithm has achieved an average accuracy of 98.3% which is higher than the accuracy of SC, HON4D and Actionlet methods (Table I). For nine of the actions, our method achieves 100% recognition accuracy; for the remaining three actions, our method achieves 93% recognition accuracy.
|Our Method (SC)||89.6||87.1||94.2|
|Our Method (LLC)||94.1||90.9||98.3|
Ii-B RGB Videos
In the second experiment, we evaluate the proposed method on the RGB datasets (Weizmann and UCFSports). For Weizmann dataset, we follow the experimental methodology from . Testing is performed by leave-one-out (LOO) on a per person basis, i.e., for each fold, training is done on 8 subjects and testing on all video sequences of the remaining subjects. Our method achieves 100% recognition accuracy, which is higher than the accuracy 84.3% from HOG3D  based method (Table II). For the UCFSports dataset, most of the reported results used LOO cross validation and the best result is 91.3% . The accuracy of our algorithm with LOO is 93.6% which is higher than . But, there are strong scene correlations among videos in certain classes; many videos are captured in exactly the same location. Lan et al.  show that, with LOO, the learning method can exploit this correlation and memorize the background instead of learning the action. So, to help alleviate this problem, same as the recently published works [22, 21, 20], we split the dataset by taking one third of the videos from each action category to form the test set, and the rest of the videos are used for training. This reduces the chances of videos in the test set sharing the same scene with videos in the training set. The accuracy of our algorithm using this protocol is 93.6%. Our algorithm outperformed existing state-of-the-art algorithms (Table II) with large margin. Note that [22, 21, 20] used the raw videos whereas our method uses the ROI of the videos. For fair comparison, we use the ROI of the videos and run the code of  that obtained from the original author. Also,  used different protocol and split the UCFSports dataset by taking one fifth of the videos from each action category to form the test set, and the rest of the videos were used for training.
|Our Method (SC)||100||87.4|
|Our Method (LLC)||100||93.6|
The experimental results show that the proposed method can be applied on color and depth videos and the accuracies obtained are higher than the state-of-the-art methods. Also, the results prove that LLC is more discriminative than SC for human action recognition problem.
Iii Conclusion and Future Work
In this paper, we propose a new action recognition algorithm which uses LLC to capture discriminative information of human body variation in each spatio-temporal subsequence of the input depth video. The proposed algorithm has been tested and compared with ten state-of-the-art algorithms on three benchmark depth datasets and two benchmark color datasets. On average, our algorithm is found to be more accurate than these ten algorithms.
-  D. Weinland, M. Özuysal, and P. Fua, “Making action recognition robust to occlusions and viewpoint changes,” in ECCV, 2010.
-  I. Everts, J. V. Gemert, and T. Gevers, “Evaluation of color stips for human action recognition,” in CVPR, 2013.
-  F. Shi, E. Petriu, and R. Laganiere, “Sampling strategies for real-time action recognition,” in CVPR, 2013.
-  H. Rahmani, A. Mahmood, A. Mian, and D. Huynh, “Real time action recognition using histograms of depth gradients and random decision forests,” in WACV, 2014.
-  L. Xia and J. Aggarwal, “Spatio-temporal depth cuboid similarity feature for activity recongition using depth camera,” in CVPR, 2013.
-  J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, “Robust 3D action recognition with random occupancy patterns,” in ECCV, 2012, pp. 872–885.
-  O. Oreifej and Z. Liu, “HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences,” in CVPR, 2013.
-  J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in CVPR, 2012, pp. 1290 –1297.
-  X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depth motion maps-based histograms of oriented gradients,” in ACM ICM, 2012.
-  W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3D points,” in CVPRW, 2010.
-  C. Keskin, F. Kirac, Y. Kara, and L. Akarun, “Real time hand pose estimation using depth sensors,” in ICCVW, 2011.
-  A. Kurakin, Z. Zhang, and Z. Liu, “A real time system for dynamic hand gesture recognition with a depth sensor,” in EUSIPCO, 2012.
-  A. W. Vieira, E. Nascimento, G. Oliveira, Z. Liu, and M. Campos, “STOP: Space-time occupancy patterns for 3D action recognition from depth map sequences,” in CIARP, 2012.
-  S. Tang, X. Wang, X. Lv, T. Han, J. Keller, Z. He, M. Skubic, and S. Lao, “Histogram of oriented normal vectors for object recognition with a depth sensor,” in ACCV, 2012.
-  Y. Zhu, X. Zhao, Y. Fu, and Y. Liu, “Sparse coding on local spatial-temporal volumes for human action recognition,” in ACCV, 2010.
-  J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality constrained linear coding for image classification,” in CVPR, 2010.
-  M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in ICCV, 2005, pp. 1395–1402.
-  M. Rodriguez, J. Ahmed, and M. Shah, “Action mach a spatio-temporal maximum average correlation height filter for action recognition,” in CVPR, 2008.
-  A. Klaeser, M. Marszalek, and C. Schmid, “A spatio-temporal descriptor based on 3d-gradients,” in BMVC, 2008.
-  Y. Tian, R. Sukthankar, and M. Shah, “Spatiotemporal deformable part models for action detection,” 2013.
-  T. Lan, Y. Wang, and G. Mori, “Discriminative figure-centric models for joint action localization and recognition,” in ICCV, 2011.
-  M. Raptis, I. Kokkinos, and S. Soatto, “Discovering discriminative action parts from mid-level video representations,” in CVPR, 2012.
-  A. Yao, J. Gall, and L. Gool, “A hough transform-based voting framework for action recognition,” in CVPR, 2010.
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A
library for large linear classification,”
Journal of Machine Learning Research, 2008.
-  S. S. Keerthi, S. Sundararajan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin, “A sequential dual method for large scale multi-class linear svms,” in ACM SIGKDD, 2008.
-  X. Wu, D. Xu, L. Duan, and J. Luo, “Scalable action recognition with a subspace forest,” in CVPR, 2012.