1 Introduction
Recognition of human actions from RGBD (Red, Green, Blue and Depth) data has attracted increasing attention in computer vision in recent years due to the advantages of depth information over conventional RGB video, e.g. being insensitive to illumination changes and reliable to estimate body silhouette and skeleton
[14]. Since the first work of such a type [10] reported in 2010, many methods [16, 24, 2, 12] have been proposed based on specific handcrafted feature descriptors extracted from depth and/or skeleton data and many benchmark datasets were created for evaluating algorithms. With the recent development of deep learning, a few methods [18, 19, 3]have been developed based on Convolutional Neural Networks (ConvNets) or Recurrent Neural Network (RNN). However, in most cases, either deeply learned or handcrafted features have been employed. Little study was reported on the advantages of using both features simultaneously.
This paper presents a novel framework that combines deeply learned features from the depth modality through ConvNets and the handcrafted features extracted from skeleton modality. The framework overcomes the weakness of ConvNets being sensitive to global translation, rotation and scaling and leverages the strength of skeleton based features, e.g. Histogram of Oriented Displacement (HOD)
[7], being invariant to scale, speed and clutter of background. In particular, depth and skeleton data are firstly augmented for deep learning and making the recognition insensitive to view variance. Secondly, depth sequences are segmented using the handcrafted features based on joints motion histogram to exploit the local temporal information. All training segments are clustered using an Infinite Gaussian Mixture Model (IGMM) through Bayesian estimation and labelled for training Convolutional Neural Networks (ConvNets) on the depth maps. Thus, a depth sequence can be reliably encoded into a sequence of segment labels. Finally, the sequence of labels is fed into a joint Hidden Markov Model and Support Vector Machine (HMMSVM) classifier to explore the global temporal information for final recognition.The proposed framework has demonstrated a novel way in effectively exploring the spatialtemporal (both local and global) information for action recognition and has a number of advantages compared to conventional discriminative methods in which temporal information is often either ignored or weekly encoded into a descriptor and to generative methods in which temporal information tends to be overemphasized, especially when the training data is not sufficient. Firstly, the use of skeleton data to segment video sequences into segments makes each segment have consistent and similar movement and, to some extent, be semantically meaningful (though this is not the intention of this paper) since skeletons are relatively highlevel information extracted from depth maps and each part of the skeleton has semantics. Secondly, the ConvNets trained over DMMs of depth maps provides a reliable sequence of labels by considering both spatial and local temporal information encoded into the DMMs. Thirdly, the use of HMM on the sequences of segment labels explores the global temporal information effectively and the SVM classifier further exploits the discriminative power of the label sequences for final classification.
The reminder of this paper is organized as follows. Section 2 describes the related work. Section 3 presents the proposed framework.
2 Related Work
Human action recognition from RGBD data has been extensively researched and much progress has been made since the seminal work [10]. One of the main advantages of depth data is that they can effectively capture 3D structural information. Up to date, many effective handcrafted features have been proposed based on depth data, such as Action Graph (AG) [10], Depth Motion Maps (DMMs) [27], Histogram of Oriented 4D Normals (HON4D) [12], Depth SpatioTemporal Interest Point (DSTIP) [23] and Super Normal Vector (SNV) [26]. Recent work [18] also showed that features from depth maps can also be deeply learned using ConvNets.
Skeleton data which is usually extracted from depth maps [14] provides a highlevel representation of human motion. Many handcrafted skeleton based features have also been developed in the past. They include EigenJoints [25], Moving Pose [28], Histogram of Oriented Displacement (HOD) [7], Frequent Local Parts (FLPs) [20] and Points in Lie Group (PLP) [15], which are all designed by hand. Recently, the work [3] demonstrated that features from skeleton can also been directly learned by deep learning methods. However, skeleton data can be quite noisy especially when occlusion exists and the subjects are not in standing position facing the RGBD camera.
Joint use of both depth maps and skeleton data have also been attempted. Wang et al. [16] designed a 3D Local Occupancy Patterns (LOP) feature to describe the local depth appearance at joint locations to capture the information for subjectobject interactions. In their subsequent work, Wang et al. [17] proposed an Actionlet Ensemble Model (AEM) which combines both the LOP feature and Temporal Pyramid Fourier (TPF) feature. Althloothi et al. [1] presented two sets of features extracted from depth maps and skeletons and they are fused at the kernel level by using Multiple Kernel Learning (MKL) technique. Wu and Shao [22]
adopted Deep Belief Networks (DBN) and 3D Convolutional Neural Networks (3DCNN) for skeletal and depth data respectively to extract high level spatialtemporal features.
3 Proposed Framework
Fig. 1 shows the blockdiagram of the proposed framework. It consists of five key components: Data augmentation to enlarge the training samples by mimicking virtual cameras through rotating the viewpoints; Segmentation to segment sequences of depth maps into segments by extracting the keyframes from skeleton data, to exploit the local temporal information. IGMM clustering to label all training segments through clustering; ConvNets on DMMs to train ConvNets to classify segments reliably; and HMMSVM to model the global temporal information of actions and classify a sequence of segment labels into an action.
3.1 Data Augmentation
The main purposes of data augmentation are to address the issue of training ConvNets on a small dataset and to deal with view variations. The method presented in [18] is adopted in this paper, where depth data is augmented by rotating the 3D cloud points captured in the depth maps and skeleton to mimic virtual cameras from different viewpoints. More details can be found in [18].
3.2 Segmentation
In order to exploit the local temporal information, depth and skeleton sequences are divided into segments such that motion across frames within each segment is similar and consistent. To this end, keyframes are first extracted using the interframe and intraframe joint motion histogram analysis, a method similar to the one described in [13] . The joint motion histograms are insensitive to the background motion compared to the use of optical flow in [13]. Specifically, skeleton data are firstly projected on to three orthogonal Cartesian planes. The motion vectors calculated between two frames of the corresponding joints in each plane are quantized by its magnitude and orientation. The combination of magnitude and orientation corresponds to a bin in the motion histogram. Given the number of joints
, the probability of the
bin in the histogram of one projection is given as:(1) 
where denotes the counts of the bin. The final motion histogram is a concatenation of the histograms in the three projections. Thus, the entropy of motion vectors in this frame can be defined as:
(2) 
where denotes the bin index and is the total bin number in the histogram. Intuitively, a peaked motion histogram contains less motion information thus produces a low entropy value; a flat and distributed histogram includes more motion information and therefore yields a high entropy value. Then, we follow the work [13] where intraframe and interframe analysis (more details can be found in [13]) are designed to extract key frames such that the extracted key frames contain complex and fastchanging motion and, thus, are salient with respect to their neighboring frames. The details of the algorithm are summarized in Algorithm 1.
If key frames are extracted from each action sample, the whole video sequence can be divided into segments, with the key frames being the beginning or the ending frames of each segment together with the first and last frames of the sequence.
3.3 IGMM Clustering
The segments of training samples are clustered using HOD [7] features extracted from the skeleton data and these segments are then labelled and used to train ConvNets on DMMs constructed from the depth maps of segments. Assume that all the action samples in one dataset are segmented to totally video segments, and let be the HODs of these segments, where the dimension of HOD is and .
In this paper, it is assumed that the HODs from all segments can be modeled by an IGMM [21]. Bayesian approach is adopted to find the best model of
Gaussian components that gives the maximum posterior probability (MAP), each Gaussian accounts for a distinct type or class of segments. Compared with traditional
means, IGMM dynamically estimates the number of clusters from the data. Mathematically, the model is specified with the following notations.(3) 
where indicates which class the HOD belongs to, , are distribution parameters of class, and , are mixture weights. Here, we do not know the number of clusters , otherwise the complete data likelihood can be computed.
To address this problem, a fully Bayesian approach was adopted instead of conventional maximum likelihood (ML) approach, where the relationship between observed data and a model in Bayes’s rule is:
(4) 
With respect to the conjugate priors for the model parameters, the same method as the one proposed in
[21] is adopted, that is Dirichlet for and Normal Times Inverse Wishart for [4] [5].(5) 
is shorthand for
(6) 
where controls how uniform the class mixture weights will be; the parameters, , , , encode the prior experience about the position of the mixture densities and the shape; the hyperparameters , affect the mixture density covariance; specifies the mean of the mixture densities, and is the number of pseudoobservations [5].
With the model defined above, the posterior distribution can be represented as:
(7) 
With some manipulations, Eq. 7 can be solved using Gibbs sampling [11] and the Chinese restaurant process [6]. Details can be found in [21, 5].
Through the IGMM clustering, the numbers of active clusters will be estimated and all segments can be labelled with its corresponding cluster through hard assignment. These labelled segments will be the training samples for the ConvNets in the framework.
3.4 PseudoColor Coding of DMMs & ConvNets Training
DMMs and PseudoColor Coding Since skeleton data are often noisy, ConvNets are trained on DMMs [27] of segments for reliable classification from the training segments labelled by the IGMM cluster. Traditional DMMs [27] are computed by adding all the absolute differences between consecutive frames in projected planes, which dilutes the temporal information. After segmentation, it is likely that there are more action pairs such as hands up and hands down within the segments. To distinguish the action pairs, we stack the motion energy within a segment with a special treatment to the first frame as described in Eq. 8.
(8) 
where denotes the three projected views in the corresponding orthogonal Cartesian planes and is the projection of the k frame under the projection view . In this way, temporal information which denotes the human body posture at the beginning of the action is captured together with the accumulated motion information, as shown in Fig 2(a)(b). In order to better exploit the motion patterns, DMMs are pseudocolored into color images in the same way as that in [18]. In particular, the pseudocolor coding is done through the following handdesigned rainbow transform:
(9) 
where presents the BGR channels, respectively; is the normalized gray value; denotes the phase of the three channels; is an amplitude modulation function which further increases the nonlinearity; the added values, guarantee nonnegativity. We use the same parameter settings as [18].
From Fig. 2 it can seen that the two simple pair actions are almost the same in the traditional DMMs, but more discriminative in the modified DMMs. The temporal information in motion maps are enhanced through the pseudocoloring method.
ConvNets Training and Classification Three ConvNets are trained on the pseudocolor coded DMMs constructed from the video segments in the three Cartesian planes. The layer configuration of the three ConvNets is same as the one in [9]
. The implementation is derived from the publicly available Caffe toolbox
[8]based on one NVIDIA GeForce GTX TITAN X card and the pretrained models over ImageNet
[9] are used for initialization in training.For an testing action sample, only the original skeleton sequences without rotation are used to extract key frames for segmentation. Three DMMs are constructed from each segment in the three Cartesian planes as input to the ConvNets and the averages of the outputs from the three ConvNets are computed to label the testing video segments. The sequence of labels will serve as input to the subsequent HMMSVM classifier.
3.5 HMMSVM for Classification
To effectively exploit the global temporal information, discrete HMMs are trained using the wellknown BaumWelch algorithm from the label sequences obtained from the ConvNets, one HMM per action. Specifically, the number of states are set to , the number of clusters estimated from the IGMM, each state can emit one of the
symbols. The observation symbol probability distribution is set as:
For each action sample, its likelihood being generated from the HMMs forms a feature vector as the input to the SVM for refining the classification.
References
 [1] S. Althloothi, M. H. Mahoor, X. Zhang, and R. M. Voyles. Human activity recognition using multifeatures and multiple kernel learning. Pattern Recognition, 47(5):1800–1812, 2014.
 [2] V. Bloom, D. Makris, and V. Argyriou. G3D: A gaming action dataset and real time action recognition evaluation framework. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),, pages 7–12, 2012.
 [3] Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1110–1118, 2015.
 [4] C. Fraley and A. E. Raftery. Bayesian regularization for normal mixture estimation and modelbased clustering. Journal of Classification, 24(2):155–181, 2007.
 [5] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis, volume 2. Taylor & Francis, 2014.
 [6] Z. Ghahramani and T. L. Griffiths. Infinite latent feature models and the indian buffet process. In Proc. Advances in neural information processing systems (NIPS), pages 475–482, 2005.

[7]
M. A. Gowayyed, M. Torki, M. E. Hussein, and M. ElSaban.
Histogram of oriented displacements (HOD): Describing trajectories
of human joints for action recognition.
In
Proc. International Joint Conference on Artificial Intelligence (IJCAI)
, pages 1351–1357, 2013.  [8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proc. ACM international conference on Multimedia (ACM MM), pages 675–678, 2014.
 [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Annual Conference on Neural Information Processing Systems (NIPS), pages 1106–1114, 2012.
 [10] W. Li, Z. Zhang, and Z. Liu. Action recognition based on a bag of 3D points. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 9–14, 2010.
 [11] R. M. Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics, 9(2):249–265, 2000.
 [12] O. Oreifej and Z. Liu. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 716–723, 2013.
 [13] L. Shao and L. Ji. Motion histogram analysis based key frame extraction for human action/activity representation. In Proc. Canadian Conference on Computer and Robot Vision (CRV), pages 88–92, 2009.
 [14] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Realtime human pose recognition in parts from single depth images. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1297–1304, 2011.
 [15] R. Vemulapalli, F. Arrate, and R. Chellappa. Human action recognition by representing 3D skeletons as points in a lie group. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 588–595, 2014.
 [16] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1290–1297, 2012.
 [17] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Learning actionlet ensemble for 3D human action recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(5):914–927, 2014.
 [18] P. Wang, W. Li, Z. Gao, C. Tang, J. Zhang, and P. O. Ogunbona. Convnetsbased action recognition from depth maps through virtual cameras and pseudocoloring. In Proc. ACM international conference on Multimedia (ACM MM), pages 1119–1122, 2015.
 [19] P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. Ogunbona. Action recognition from depth maps using deep convolutional neural networks. HumanMachine Systems, IEEE Transactions on, PP(99):1–12, 2015.
 [20] P. Wang, W. Li, P. Ogunbona, Z. Gao, and H. Zhang. Mining midlevel features for action recognition based on effective skeleton representation. In Proc. International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 1–8, 2014.
 [21] F. Wood, S. Goldwater, and M. J. Black. A nonparametric bayesian approach to spike sorting. In Proc. International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS), pages 1165–1168, 2006.
 [22] D. Wu and L. Shao. Deep dynamic neural networks for gesture segmentation and recognition. Neural Networks, 19(20):21, 2014.
 [23] L. Xia and J. Aggarwal. Spatiotemporal depth cuboid similarity feature for activity recognition using depth camera. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2834–2841, 2013.
 [24] L. Xia, C.C. Chen, and J. Aggarwal. View invariant human action recognition using histograms of 3D joints. In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 20–27, 2012.
 [25] X. Yang and Y. Tian. Eigenjointsbased action recognition using NaïveBayesNearestNeighbor. In Proc. International Workshop on Human Activity Understanding from 3D Data (HAU3D) (CVPRW), pages 14–19, 2012.
 [26] X. Yang and Y. Tian. Super normal vector for activity recognition using depth sequences. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pages 804–811, 2014.
 [27] X. Yang, C. Zhang, and Y. Tian. Recognizing actions using depth motion mapsbased histograms of oriented gradients. In Proc. ACM international conference on Multimedia (ACM MM), pages 1057–1060, 2012.
 [28] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The moving pose: An efficient 3D kinematics descriptor for lowlatency action recognition and detection. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 2752–2759, 2013.