I Introduction
Human action recognition from videos is a challenging problem. Differences in viewing direction and distance, body sizes of the human subjects, clothing, and style of performing the action are some of the main factors making human action recognition difficult. Most of the previous approaches to action recognition have focused on using traditional RGB cameras [1, 2, 3]. Since the release of Microsoft Kinect depth camera, depth based human action recognition methods [4, 5, 6, 7, 8, 9, 10, 11, 12] start to emerge. Being a depth sensor, the Kinect camera is not affected by scene illumination and the color of the clothes worn by the human subject, making object segmentation an easier task. The challenges still remain are loose clothing, occlusions, and variations in the style and speed of actions.
In this context, some algorithms have exploited silhouette and edge pixels as discriminative features. For example, Li et al. [10] sampled boundary pixels from 2D silhouettes as a bag of features. Yang et al. [9] added temporal derivative of 2D projections to get Depth Motion Maps (DMM). Vieira et al. [13] computed silhouettes in 3D by using the spacetime occupancy patterns. Instead of these very simple occupancy features, Wang et al. [6]
computed a vector of 8 Haar features on a uniform grid in the 4D volume. LDA was used to detect the discriminative feature positions and an SVM classifier was used for action classification. Xia and Aggarwal
[5] proposed an algorithm to extract Space Time Interest Points (STIPs) from depth sequences and modelled local 3D depth cuboid using the Depth Cuboid Similarity Feature (DCSF). However, the accuracy of this algorithm is dependent on the noise level of depth images. Tang et al. [14] proposed to use histograms of the normal vectors computed from depth images for object recognition. Given a depth image, they computed the spatial derivatives and transformed to polar coordinates where the 2D histograms of were used as object descriptors. Oreifej and Liu [7] extended these derivatives to the temporal dimension. They normalized the gradient vectors to unit magnitude and projected them onto a fixed basis before histogramming. In their formulation, the last components of the normalized gradient vectors were the inverse of the gradient magnitude. As a result, information from very strong derivative locations, such as edges and silhouettes, may get suppressed.In [15], the HOG3D feature is computed within a small 3D cuboid centered at a spacetime point and Sparse Coding (SC) is utilized to obtain a more discriminative representation. The depth images have a high level of noise and two subjects may perform one action in different styles. So, to favor sparsity, SC might select quite different elements in the dictionary for similar actions (details in Section IB3). To overcome this problem, we use the Localityconstrained Linear Coding (LLC) [16] in this paper to make locality more essential than sparsity. Moreover, instead of a small 3D cuboid, we divide the input video sequence into subsequences, blocks, and then cells (Fig. 1). The HOG3D features computed at the cell level are concatenated to form the descriptor at the block level. Given classes, the LLC followed by maximum pooling and a logistic regression classifier with L2 regularization finally give probability values for each subsequence. The video sequence is represented by the concatenation of the subsequence descriptors. Finally, we use an SVM for action classification.
We evaluate the proposed algorithm on three standard depth datasets [7, 12, 10] and two standard color datasets [17, 18]. We compare the proposed method with ten stateoftheart methods [6, 8, 7, 19, 5, 20, 21, 22, 23, 4]. Our experimental results show that our algorithm outperforms these ten methods.
Ia Proposed Algorithm
We consider an action as a function operating on a three dimensional space with being independent variables and the depth () being the dependent variable, i.e., . The discrimination of a particular action can be characterized by using the variations of the depth values along these dimensions.
IB Feature Extraction
In order to capture sufficient discriminative information such as local motion and appearance characteristics, the depth sequence is divided into equally spaced overlapping spatiotemporal subsequences of size , where is the number of frames in the whole video. Each subsequence is divided into blocks of size , where . Each block is further divided into equally spaced nonoverlapping cells of size . This hierarchy is shown in Fig. 1.
IB1 Cell Descriptor
In each cell, a 3D histogram of oriented gradients (HOG) feature [19] is computed by evaluating a gradient vector at each pixel in that cell:
(1) 
where the derivatives are given by: , , and .
The HOG feature is computed by projecting each gradient vector onto directions obtained by joining the centers of faces of a regular nsided polyhedron with its center. A regular dodecahedron is a type of regular 12sided polyhedron that is commonly used to quantize 3D gradients. It is composed of 12 regular pentagonal faces and each face corresponds to a histogram bin. Let be the matrix of the center positions of all faces:
(2) 
For a regular dodecahedron with center at the origin, these normalized vectors are given by:
where is the golden ratio, and is the length of vector . The gradient vector is projected on to give
(3) 
Since should vote into only one single bin in case it is perfectly aligned with the corresponding axis running through the origin and the face center, the projected vector should be quantized. A threshold value is computed by projecting any two neighboring vectors and , i.e.,
(4) 
The quantized vector is given by
where . We define to be scaled by the gradient magnitude, i.e.,
(5) 
and a histogram, , for each cell is computed:
(6) 
Note that this summation is equivalent to histogramming, because each vector represents votes in the corresponding bins defined by the 12 directions of the regular dodecahedron.
IB2 Block Descriptor
We combine a fixed number of cells in each neighborhood into blocks. To get the block descriptor, we vertically concatenate the histograms of all cells in that block (Fig. 2(b)):
(7) 
where denotes the total number of cells in each block.
The block descriptor in (7) is normalized and a Symmetric Sigmoid function is applied to it as a tradeoff between the gradient magnitude and gradient orientation:
(8) 
where is an index: . Finally, is normalized to give
(9) 
IB3 Subsequence Descriptor Using SC
To compare the descriptors computed using SC and LLC, we show below how we obtain the intermediate variable (see Fig. 2(c)) using both methods to get two different representations for the spatiotemporal subsequences.
For each spatiotemporal subsequence, we combine all the spatiotemporal blocks along the time dimension (see Fig. 1(b) and Fig. 2(a)). We horizontally concatenate all the block descriptors within the subsequence to give
(10) 
where denotes the number of blocks within the subsequence. In the case where , then also and variable becomes the number of frames in the depth video. i.e., .
We then create a dictionary by applying the means clustering algorithm on the computed HOG3D of the training blocks. Here denotes the number of block descriptors, each has dimensions. The SC sparsely encodes each block descriptor into a linear combination of a few atoms of dictionary by optimizing:
(11) 
where is a regularization parameter which determines the sparsity of the representation of each local spatiotemporal feature. The twofold optimization aims at minimizing the reconstruction error and the sparsity of the coefficient set simultaneously.
IB4 Subsequence Descriptor Using LLC
An alternative to obtain the coefficient set is to use the LLC. Instead of Eq. (11), the LLC code uses the following criteria:
(12) 
where denotes the elementwise multiplication, and is the locality adapter that gives different weights to the basis vectors depending on how similar they are to the input descriptor . Specifically,
where is used for adjusting the weight decay speed for the locality adapter.
For the encoding of block descriptors, we found that locality is more essential than sparsity, as locality must lead to sparsity but not necessary vice versa [16]. So, the LLC should be more discriminative than SC. We illustrate the superiority of LLC over SC by an example. Fig. 3(a) shows a block within a spatiotemporal subsequence of a subject who performed an action and Fig. 3(b) shows the block in the subsequence at the same spatial location in another video of another subject doing the same action. However, the coefficients of SC for these two blocks (Fig. 3(c)) are quite different (Euclidean distance is 1.38) which means using SC may lead to failure to identify the actions being the same. On the other hand, the LLC coefficients for these two blocks (Fig. 3(d)) are very similar to each other (both are distinct from the SC coefficients) and have a smaller distance apart (Euclidean distance is 0.2). As shown in this figure, LLC appears to better represent the spatiotemporal blocks.
After the optimization step (using Eq. (11) or (12)), we get a set of sparse codes , where each vector has only a few nonzero elements. It can also be interpreted that each block descriptor only responds to a small subset of dictionary atoms.
To capture the global statistics of the subsequence, we use a maximum pooling function given by (Fig. 2 (c)), where returns a vector with the element defined as:
(13) 
Through maximum pooling, the obtained vector is the subsequence descriptor (Fig. 2(c)).
IC Sequence Descriptor and Classification
The probability of each action class for a subsequence descriptor is . To calculate this probability, we use a logistic regression classifier with L2 regularization [24]. The logistic regression model was initially proposed for binary classification; however, it was later on extended to multiclass classification. Given data , weight , class label , and a bias term , it assumes the following probability model:
(14) 
If the training instances are with labels , for
, one estimates
by minimizing the following equation:(15) 
where denotes the L2 regularization term, and is a userdefined parameter that weights the relative importance of the two terms. The onevstherest strategy by Crammer and Singer [25] is used to solve the optimization for this multiclass problem.
For all the subsequences in the training data corresponding to the same spatial location , we train one logistic regression classifier. If there are spatial locations (i.e., the total number of overlapping subsequences), we will have classifiers. For the classifier corresponding to each spatial location , we can compute the probability of each action class: , where is the number of class labels.
For an input video sequence, the class probabilities of all the spatiotemporal subsequences are concatenated to create the sequence descriptor: . In the training stage, the sequence descriptors of all the training action videos are used to train an SVM classifier.
In the testing stage, the sequence descriptor of the test video sequence is computed and fed to the trained classifier. The class label is defined to be the one that corresponds to the maximum probability.
Ii Experimental Results
We evaluate the performance of the proposed algorithm on two different types of videos: (1) depth and (2) color. For depth data, we use three standard datasets including the MSRAction3D [10, 8], MSRGesture3D [12, 6], and MSRActionPairs3D [7]. For color data, we use two standard datasets including Weizmann [17], and UCFSports [18]. The performance of our proposed algorithm is compared with ten stateoftheart algorithms [6, 8, 7, 19, 5, 20, 21, 22, 23, 4]. Except for [20, 19], all accuracies are reported from the original papers and the codes are obtained from the original authors.
Datasets. The MSRGesture3D dataset contains 12 American Sign Language (ASL) gestures. Each gesture is performed 2 or 3 times by each of the 10 subjects. In total, the dataset contains 333 depth sequences. The MSRAction3D dataset consists of 567 depth sequences of 20 human sports actions. Each action is performed by 10 subjects 2 or 3 times. These actions were chosen in the context of interactions with game consoles and cover a variety of movements related to torso, legs, arms and their combinations. The MSRActionPairs3D dataset contains 6 pairs of actions, such that within each pair the motion and the shape cues are similar, but their correlations vary. For example, Pick up and Put down actions have similar motion and shape; however, the cooccurrence of the object shape and the hand motion is in different spatiotemporal order. Each action is performed 3 times using 10 subjects. In total, the dataset contains 360 depth sequences.
The Weizmann dataset contains 90 video sequences of 9 subjects, each performing 10 actions. The UCFSports dataset consists of videos from sports broadcasts, with a total of 150 videos from 10 action classes. Videos are captured in realistic scenarios with complex and cluttered background, and actions exhibit significant intraclass variation.
Experimental Settings. In all the experiments, the ROIs of the depth/RGB videos are resized to pixels for the sake of computation. The size of each cell is pixels (i.e., ) and the number of cells in each block is along the and dimensions (i.e., ), respectively. Variables and are set to 1; variable , the size of the dictionary, is set to 200. For RGB videos, we consider the variable mentioned in Section IA as the graylevel value of each pixel.
Iia Depth Videos
In the first experiment, we evaluate the proposed method on the depth datasets (MSRGesture3D, MSRAction3D and MSRActionPairs3D). To compare our method with previous techniques on MSRGesture3D dataset, we use leaveonesubjectout crossvalidation scheme proposed by [6]. The average accuracies of our algorithm using SC and LLC are 89.6% and 94.1%, respectively. These results prove that LLC is more discriminative than SC. Our algorithm outperformed existing stateoftheart algorithms (Table I). Note that Actionlet method [8] cannot be applied to this dataset because of the absence of 3D joint positions.
For the MSRAction3D dataset, we performed experiments, same as previous works [8, 7], using five subjects for training and five subjects for testing. The accuracy obtained is 90.9% which is higher than the accuracy 87.1% from SC, 88.9% from HON4D [7] and 89.3% from DSTIP [5] (Table I). For most of the actions, our method achieves near perfect recognition accuracy. The classification errors occur if two actions are too similar to each other, such as hand catch and high throw.
For the MSRActionPairs3D dataset, same as previous work [7], we use half of the subjects for training and the rest for testing. The proposed algorithm has achieved an average accuracy of 98.3% which is higher than the accuracy of SC, HON4D and Actionlet methods (Table I). For nine of the actions, our method achieves 100% recognition accuracy; for the remaining three actions, our method achieves 93% recognition accuracy.
Method  Gesture  Action  A/Pairs 

HOG3D [19]  85.2  81.4  88.2 
ROP [6]  88.5  86.5   
Actionlet [8]  NA  88.2  82.2 
HON4D [7]  92.4  88.9  96.7 
DSTIP [5]    89.3   
RDF [4]  92.8  88.8   
Our Method (SC)  89.6  87.1  94.2 
Our Method (LLC)  94.1  90.9  98.3 
IiB RGB Videos
In the second experiment, we evaluate the proposed method on the RGB datasets (Weizmann and UCFSports). For Weizmann dataset, we follow the experimental methodology from [17]. Testing is performed by leaveoneout (LOO) on a per person basis, i.e., for each fold, training is done on 8 subjects and testing on all video sequences of the remaining subjects. Our method achieves 100% recognition accuracy, which is higher than the accuracy 84.3% from HOG3D [19] based method (Table II). For the UCFSports dataset, most of the reported results used LOO cross validation and the best result is 91.3% [26]. The accuracy of our algorithm with LOO is 93.6% which is higher than [26]. But, there are strong scene correlations among videos in certain classes; many videos are captured in exactly the same location. Lan et al. [21] show that, with LOO, the learning method can exploit this correlation and memorize the background instead of learning the action. So, to help alleviate this problem, same as the recently published works [22, 21, 20], we split the dataset by taking one third of the videos from each action category to form the test set, and the rest of the videos are used for training. This reduces the chances of videos in the test set sharing the same scene with videos in the training set. The accuracy of our algorithm using this protocol is 93.6%. Our algorithm outperformed existing stateoftheart algorithms (Table II) with large margin. Note that [22, 21, 20] used the raw videos whereas our method uses the ROI of the videos. For fair comparison, we use the ROI of the videos and run the code of [20] that obtained from the original author. Also, [23] used different protocol and split the UCFSports dataset by taking one fifth of the videos from each action category to form the test set, and the rest of the videos were used for training.
Method  Weizmann  UCFSports 

HOG3D [19]  84.3  76.2 
HoughVoting [23]  97.8  86.6 
SparseCoding [15]    84.33 
FigureCentric [21]    73.1 
ActionParts [22]    79.4 
SDPM [20]  100  79.8 
Our Method (SC)  100  87.4 
Our Method (LLC)  100  93.6 
The experimental results show that the proposed method can be applied on color and depth videos and the accuracies obtained are higher than the stateoftheart methods. Also, the results prove that LLC is more discriminative than SC for human action recognition problem.
Iii Conclusion and Future Work
In this paper, we propose a new action recognition algorithm which uses LLC to capture discriminative information of human body variation in each spatiotemporal subsequence of the input depth video. The proposed algorithm has been tested and compared with ten stateoftheart algorithms on three benchmark depth datasets and two benchmark color datasets. On average, our algorithm is found to be more accurate than these ten algorithms.
Acknowledgment
References
 [1] D. Weinland, M. Özuysal, and P. Fua, “Making action recognition robust to occlusions and viewpoint changes,” in ECCV, 2010.
 [2] I. Everts, J. V. Gemert, and T. Gevers, “Evaluation of color stips for human action recognition,” in CVPR, 2013.
 [3] F. Shi, E. Petriu, and R. Laganiere, “Sampling strategies for realtime action recognition,” in CVPR, 2013.
 [4] H. Rahmani, A. Mahmood, A. Mian, and D. Huynh, “Real time action recognition using histograms of depth gradients and random decision forests,” in WACV, 2014.
 [5] L. Xia and J. Aggarwal, “Spatiotemporal depth cuboid similarity feature for activity recongition using depth camera,” in CVPR, 2013.
 [6] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, “Robust 3D action recognition with random occupancy patterns,” in ECCV, 2012, pp. 872–885.
 [7] O. Oreifej and Z. Liu, “HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences,” in CVPR, 2013.
 [8] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in CVPR, 2012, pp. 1290 –1297.
 [9] X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depth motion mapsbased histograms of oriented gradients,” in ACM ICM, 2012.
 [10] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3D points,” in CVPRW, 2010.
 [11] C. Keskin, F. Kirac, Y. Kara, and L. Akarun, “Real time hand pose estimation using depth sensors,” in ICCVW, 2011.
 [12] A. Kurakin, Z. Zhang, and Z. Liu, “A real time system for dynamic hand gesture recognition with a depth sensor,” in EUSIPCO, 2012.
 [13] A. W. Vieira, E. Nascimento, G. Oliveira, Z. Liu, and M. Campos, “STOP: Spacetime occupancy patterns for 3D action recognition from depth map sequences,” in CIARP, 2012.
 [14] S. Tang, X. Wang, X. Lv, T. Han, J. Keller, Z. He, M. Skubic, and S. Lao, “Histogram of oriented normal vectors for object recognition with a depth sensor,” in ACCV, 2012.
 [15] Y. Zhu, X. Zhao, Y. Fu, and Y. Liu, “Sparse coding on local spatialtemporal volumes for human action recognition,” in ACCV, 2010.
 [16] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality constrained linear coding for image classification,” in CVPR, 2010.
 [17] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as spacetime shapes,” in ICCV, 2005, pp. 1395–1402.
 [18] M. Rodriguez, J. Ahmed, and M. Shah, “Action mach a spatiotemporal maximum average correlation height filter for action recognition,” in CVPR, 2008.
 [19] A. Klaeser, M. Marszalek, and C. Schmid, “A spatiotemporal descriptor based on 3dgradients,” in BMVC, 2008.
 [20] Y. Tian, R. Sukthankar, and M. Shah, “Spatiotemporal deformable part models for action detection,” 2013.
 [21] T. Lan, Y. Wang, and G. Mori, “Discriminative figurecentric models for joint action localization and recognition,” in ICCV, 2011.
 [22] M. Raptis, I. Kokkinos, and S. Soatto, “Discovering discriminative action parts from midlevel video representations,” in CVPR, 2012.
 [23] A. Yao, J. Gall, and L. Gool, “A hough transformbased voting framework for action recognition,” in CVPR, 2010.

[24]
R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin, “Liblinear: A
library for large linear classification,”
Journal of Machine Learning Research
, 2008.  [25] S. S. Keerthi, S. Sundararajan, K.W. Chang, C.J. Hsieh, and C.J. Lin, “A sequential dual method for large scale multiclass linear svms,” in ACM SIGKDD, 2008.
 [26] X. Wu, D. Xu, L. Duan, and J. Luo, “Scalable action recognition with a subspace forest,” in CVPR, 2012.
Comments
There are no comments yet.