Combining ConvNets with Hand-Crafted Features for Action Recognition Based on an HMM-SVM Classifier

This paper proposes a new framework for RGB-D-based action recognition that takes advantages of hand-designed features from skeleton data and deeply learned features from depth maps, and exploits effectively both the local and global temporal information. Specifically, depth and skeleton data are firstly augmented for deep learning and making the recognition insensitive to view variance. Secondly, depth sequences are segmented using the hand-crafted features based on skeleton joints motion histogram to exploit the local temporal information. All training se gments are clustered using an Infinite Gaussian Mixture Model (IGMM) through Bayesian estimation and labelled for training Convolutional Neural Networks (ConvNets) on the depth maps. Thus, a depth sequence can be reliably encoded into a sequence of segment labels. Finally, the sequence of labels is fed into a joint Hidden Markov Model and Support Vector Machine (HMM-SVM) classifier to explore the global temporal information for final recognition.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5


Global Temporal Representation based CNNs for Infrared Action Recognition

Infrared human action recognition has many advantages, i.e., it is insen...

Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition

It remains a challenge to efficiently extract spatialtemporal informatio...

Efficient Multi-stream Temporal Learning and Post-fusion Strategy for 3D Skeleton-based Hand Activity Recognition

Recognizing first-person hand activity is a challenging task, especially...

UC Merced Submission to the ActivityNet Challenge 2016

This notebook paper describes our system for the untrimmed classificatio...

Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks

Convolutional Neural Networks (ConvNets) have recently shown promising p...

Learning and Refining of Privileged Information-based RNNs for Action Recognition from Depth Sequences

Existing RNN-based approaches for action recognition from depth sequence...

Skeleton Based Action Recognition using a Stacked Denoising Autoencoder with Constraints of Privileged Information

Recently, with the availability of cost-effective depth cameras coupled ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recognition of human actions from RGB-D (Red, Green, Blue and Depth) data has attracted increasing attention in computer vision in recent years due to the advantages of depth information over conventional RGB video, e.g. being insensitive to illumination changes and reliable to estimate body silhouette and skeleton 

[14]. Since the first work of such a type [10] reported in 2010, many methods [16, 24, 2, 12] have been proposed based on specific hand-crafted feature descriptors extracted from depth and/or skeleton data and many benchmark datasets were created for evaluating algorithms. With the recent development of deep learning, a few methods [18, 19, 3]

have been developed based on Convolutional Neural Networks (ConvNets) or Recurrent Neural Network (RNN). However, in most cases, either deeply learned or hand-crafted features have been employed. Little study was reported on the advantages of using both features simultaneously.

This paper presents a novel framework that combines deeply learned features from the depth modality through ConvNets and the hand-crafted features extracted from skeleton modality. The framework overcomes the weakness of ConvNets being sensitive to global translation, rotation and scaling and leverages the strength of skeleton based features, e.g. Histogram of Oriented Displacement (HOD) 

[7], being invariant to scale, speed and clutter of background. In particular, depth and skeleton data are firstly augmented for deep learning and making the recognition insensitive to view variance. Secondly, depth sequences are segmented using the hand-crafted features based on joints motion histogram to exploit the local temporal information. All training segments are clustered using an Infinite Gaussian Mixture Model (IGMM) through Bayesian estimation and labelled for training Convolutional Neural Networks (ConvNets) on the depth maps. Thus, a depth sequence can be reliably encoded into a sequence of segment labels. Finally, the sequence of labels is fed into a joint Hidden Markov Model and Support Vector Machine (HMM-SVM) classifier to explore the global temporal information for final recognition.

The proposed framework has demonstrated a novel way in effectively exploring the spatial-temporal (both local and global) information for action recognition and has a number of advantages compared to conventional discriminative methods in which temporal information is often either ignored or weekly encoded into a descriptor and to generative methods in which temporal information tends to be overemphasized, especially when the training data is not sufficient. Firstly, the use of skeleton data to segment video sequences into segments makes each segment have consistent and similar movement and, to some extent, be semantically meaningful (though this is not the intention of this paper) since skeletons are relatively high-level information extracted from depth maps and each part of the skeleton has semantics. Secondly, the ConvNets trained over DMMs of depth maps provides a reliable sequence of labels by considering both spatial and local temporal information encoded into the DMMs. Thirdly, the use of HMM on the sequences of segment labels explores the global temporal information effectively and the SVM classifier further exploits the discriminative power of the label sequences for final classification.

The reminder of this paper is organized as follows. Section 2 describes the related work. Section 3 presents the proposed framework.

2 Related Work

Human action recognition from RGB-D data has been extensively researched and much progress has been made since the seminal work [10]. One of the main advantages of depth data is that they can effectively capture 3D structural information. Up to date, many effective hand-crafted features have been proposed based on depth data, such as Action Graph (AG) [10], Depth Motion Maps (DMMs) [27], Histogram of Oriented 4D Normals (HON4D) [12], Depth Spatio-Temporal Interest Point (DSTIP) [23] and Super Normal Vector (SNV) [26]. Recent work [18] also showed that features from depth maps can also be deeply learned using ConvNets.

Skeleton data which is usually extracted from depth maps [14] provides a high-level representation of human motion. Many hand-crafted skeleton based features have also been developed in the past. They include EigenJoints [25], Moving Pose [28], Histogram of Oriented Displacement (HOD) [7], Frequent Local Parts (FLPs) [20] and Points in Lie Group (PLP) [15], which are all designed by hand. Recently, the work [3] demonstrated that features from skeleton can also been directly learned by deep learning methods. However, skeleton data can be quite noisy especially when occlusion exists and the subjects are not in standing position facing the RGB-D camera.

Joint use of both depth maps and skeleton data have also been attempted. Wang et al. [16] designed a 3D Local Occupancy Patterns (LOP) feature to describe the local depth appearance at joint locations to capture the information for subject-object interactions. In their subsequent work, Wang et al. [17] proposed an Actionlet Ensemble Model (AEM) which combines both the LOP feature and Temporal Pyramid Fourier (TPF) feature. Althloothi et al. [1] presented two sets of features extracted from depth maps and skeletons and they are fused at the kernel level by using Multiple Kernel Learning (MKL) technique. Wu and Shao [22]

adopted Deep Belief Networks (DBN) and 3D Convolutional Neural Networks (3DCNN) for skeletal and depth data respectively to extract high level spatial-temporal features.

3 Proposed Framework

Fig. 1 shows the block-diagram of the proposed framework. It consists of five key components: Data augmentation to enlarge the training samples by mimicking virtual cameras through rotating the viewpoints; Segmentation to segment sequences of depth maps into segments by extracting the key-frames from skeleton data, to exploit the local temporal information. IGMM clustering to label all training segments through clustering; ConvNets on DMMs to train ConvNets to classify segments reliably; and HMM-SVM to model the global temporal information of actions and classify a sequence of segment labels into an action.

3.1 Data Augmentation

The main purposes of data augmentation are to address the issue of training ConvNets on a small dataset and to deal with view variations. The method presented in [18] is adopted in this paper, where depth data is augmented by rotating the 3D cloud points captured in the depth maps and skeleton to mimic virtual cameras from different viewpoints. More details can be found in [18].

Figure 1: The proposed action coding framework, where D represents Depth data while S denotes Skeleton data.

3.2 Segmentation

In order to exploit the local temporal information, depth and skeleton sequences are divided into segments such that motion across frames within each segment is similar and consistent. To this end, key-frames are first extracted using the inter-frame and intra-frame joint motion histogram analysis, a method similar to the one described in [13] . The joint motion histograms are insensitive to the background motion compared to the use of optical flow in [13]. Specifically, skeleton data are firstly projected on to three orthogonal Cartesian planes. The motion vectors calculated between two frames of the corresponding joints in each plane are quantized by its magnitude and orientation. The combination of magnitude and orientation corresponds to a bin in the motion histogram. Given the number of joints

, the probability of the

bin in the histogram of one projection is given as:


where denotes the counts of the bin. The final motion histogram is a concatenation of the histograms in the three projections. Thus, the entropy of motion vectors in this frame can be defined as:


where denotes the bin index and is the total bin number in the histogram. Intuitively, a peaked motion histogram contains less motion information thus produces a low entropy value; a flat and distributed histogram includes more motion information and therefore yields a high entropy value. Then, we follow the work [13] where intra-frame and inter-frame analysis (more details can be found in [13]) are designed to extract key frames such that the extracted key frames contain complex and fast-changing motion and, thus, are salient with respect to their neighboring frames. The details of the algorithm are summarized in Algorithm 1.

1:Select initial frames: from video by picking local maxima in the entropy curve calculated by the concatenated motion histogram, where denotes the number of initial frames extracted from video ;
2:Calculate the histogram intersection values between neighboring frames;
3:Weight the entropy values of by corresponding histogram intersection values;
4:Extract key frames by finding peaks in the weighted entropy curve, where denotes the number of key frames extracted from video ;
Algorithm 1 Key Frames Extraction

If key frames are extracted from each action sample, the whole video sequence can be divided into segments, with the key frames being the beginning or the ending frames of each segment together with the first and last frames of the sequence.

3.3 IGMM Clustering

The segments of training samples are clustered using HOD [7] features extracted from the skeleton data and these segments are then labelled and used to train ConvNets on DMMs constructed from the depth maps of segments. Assume that all the action samples in one dataset are segmented to totally video segments, and let be the HODs of these segments, where the dimension of HOD is and .

In this paper, it is assumed that the HODs from all segments can be modeled by an IGMM [21]. Bayesian approach is adopted to find the best model of

Gaussian components that gives the maximum posterior probability (MAP), each Gaussian accounts for a distinct type or class of segments. Compared with traditional

-means, IGMM dynamically estimates the number of clusters from the data. Mathematically, the model is specified with the following notations.


where indicates which class the HOD belongs to, , are distribution parameters of class, and , are mixture weights. Here, we do not know the number of clusters , otherwise the complete data likelihood can be computed.

To address this problem, a fully Bayesian approach was adopted instead of conventional maximum likelihood (ML) approach, where the relationship between observed data and a model in Bayes’s rule is:


With respect to the conjugate priors for the model parameters, the same method as the one proposed in 

[21] is adopted, that is Dirichlet for and Normal Times Inverse Wishart for  [4] [5].


is shorthand for


where controls how uniform the class mixture weights will be; the parameters, , , , encode the prior experience about the position of the mixture densities and the shape; the hyper-parameters , affect the mixture density covariance; specifies the mean of the mixture densities, and is the number of pseudo-observations [5].

With the model defined above, the posterior distribution can be represented as:


With some manipulations, Eq. 7 can be solved using Gibbs sampling [11] and the Chinese restaurant process [6]. Details can be found in  [21, 5].

Through the IGMM clustering, the numbers of active clusters will be estimated and all segments can be labelled with its corresponding cluster through hard assignment. These labelled segments will be the training samples for the ConvNets in the framework.

3.4 Pseudo-Color Coding of DMMs & ConvNets Training

DMMs and Pseudo-Color Coding Since skeleton data are often noisy, ConvNets are trained on DMMs [27] of segments for reliable classification from the training segments labelled by the IGMM cluster. Traditional DMMs [27] are computed by adding all the absolute differences between consecutive frames in projected planes, which dilutes the temporal information. After segmentation, it is likely that there are more action pairs such as hands up and hands down within the segments. To distinguish the action pairs, we stack the motion energy within a segment with a special treatment to the first frame as described in Eq. 8.


where denotes the three projected views in the corresponding orthogonal Cartesian planes and is the projection of the k frame under the projection view . In this way, temporal information which denotes the human body posture at the beginning of the action is captured together with the accumulated motion information, as shown in Fig 2(a)(b). In order to better exploit the motion patterns, DMMs are pseudo-colored into color images in the same way as that in [18]. In particular, the pseudo-color coding is done through the following hand-designed rainbow transform:


where presents the BGR channels, respectively; is the normalized gray value; denotes the phase of the three channels; is an amplitude modulation function which further increases the non-linearity; the added values, guarantee non-negativity. We use the same parameter settings as [18].

Figure 2: The first row of DMMs represents an action that wave hand from right to left, the second row of DMMs represents an action that wave hand from left to right. (a) are computed in traditional way, (b) are computed in proposed way and (c) are pseudo-coloring coded from (b).

From Fig. 2 it can seen that the two simple pair actions are almost the same in the traditional DMMs, but more discriminative in the modified DMMs. The temporal information in motion maps are enhanced through the pseudo-coloring method.

ConvNets Training and Classification Three ConvNets are trained on the pseudo-color coded DMMs constructed from the video segments in the three Cartesian planes. The layer configuration of the three ConvNets is same as the one in [9]

. The implementation is derived from the publicly available Caffe toolbox


based on one NVIDIA GeForce GTX TITAN X card and the pre-trained models over ImageNet 

[9] are used for initialization in training.

For an testing action sample, only the original skeleton sequences without rotation are used to extract key frames for segmentation. Three DMMs are constructed from each segment in the three Cartesian planes as input to the ConvNets and the averages of the outputs from the three ConvNets are computed to label the testing video segments. The sequence of labels will serve as input to the subsequent HMM-SVM classifier.

3.5 HMM-SVM for Classification

To effectively exploit the global temporal information, discrete HMMs are trained using the well-known Baum-Welch algorithm from the label sequences obtained from the ConvNets, one HMM per action. Specifically, the number of states are set to , the number of clusters estimated from the IGMM, each state can emit one of the

symbols. The observation symbol probability distribution is set as:

For each action sample, its likelihood being generated from the HMMs forms a feature vector as the input to the SVM for refining the classification.


  • [1] S. Althloothi, M. H. Mahoor, X. Zhang, and R. M. Voyles. Human activity recognition using multi-features and multiple kernel learning. Pattern Recognition, 47(5):1800–1812, 2014.
  • [2] V. Bloom, D. Makris, and V. Argyriou. G3D: A gaming action dataset and real time action recognition evaluation framework. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),, pages 7–12, 2012.
  • [3] Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1110–1118, 2015.
  • [4] C. Fraley and A. E. Raftery. Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification, 24(2):155–181, 2007.
  • [5] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis, volume 2. Taylor & Francis, 2014.
  • [6] Z. Ghahramani and T. L. Griffiths. Infinite latent feature models and the indian buffet process. In Proc. Advances in neural information processing systems (NIPS), pages 475–482, 2005.
  • [7] M. A. Gowayyed, M. Torki, M. E. Hussein, and M. El-Saban. Histogram of oriented displacements (HOD): Describing trajectories of human joints for action recognition. In

    Proc. International Joint Conference on Artificial Intelligence (IJCAI)

    , pages 1351–1357, 2013.
  • [8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proc. ACM international conference on Multimedia (ACM MM), pages 675–678, 2014.
  • [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Annual Conference on Neural Information Processing Systems (NIPS), pages 1106–1114, 2012.
  • [10] W. Li, Z. Zhang, and Z. Liu. Action recognition based on a bag of 3D points. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 9–14, 2010.
  • [11] R. M. Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics, 9(2):249–265, 2000.
  • [12] O. Oreifej and Z. Liu. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 716–723, 2013.
  • [13] L. Shao and L. Ji. Motion histogram analysis based key frame extraction for human action/activity representation. In Proc. Canadian Conference on Computer and Robot Vision (CRV), pages 88–92, 2009.
  • [14] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1297–1304, 2011.
  • [15] R. Vemulapalli, F. Arrate, and R. Chellappa. Human action recognition by representing 3D skeletons as points in a lie group. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 588–595, 2014.
  • [16] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1290–1297, 2012.
  • [17] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Learning actionlet ensemble for 3D human action recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(5):914–927, 2014.
  • [18] P. Wang, W. Li, Z. Gao, C. Tang, J. Zhang, and P. O. Ogunbona. Convnets-based action recognition from depth maps through virtual cameras and pseudocoloring. In Proc. ACM international conference on Multimedia (ACM MM), pages 1119–1122, 2015.
  • [19] P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. Ogunbona. Action recognition from depth maps using deep convolutional neural networks. Human-Machine Systems, IEEE Transactions on, PP(99):1–12, 2015.
  • [20] P. Wang, W. Li, P. Ogunbona, Z. Gao, and H. Zhang. Mining mid-level features for action recognition based on effective skeleton representation. In Proc. International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 1–8, 2014.
  • [21] F. Wood, S. Goldwater, and M. J. Black. A non-parametric bayesian approach to spike sorting. In Proc. International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS), pages 1165–1168, 2006.
  • [22] D. Wu and L. Shao. Deep dynamic neural networks for gesture segmentation and recognition. Neural Networks, 19(20):21, 2014.
  • [23] L. Xia and J. Aggarwal. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2834–2841, 2013.
  • [24] L. Xia, C.-C. Chen, and J. Aggarwal. View invariant human action recognition using histograms of 3D joints. In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 20–27, 2012.
  • [25] X. Yang and Y. Tian. Eigenjoints-based action recognition using Naïve-Bayes-Nearest-Neighbor. In Proc. International Workshop on Human Activity Understanding from 3D Data (HAU3D) (CVPRW), pages 14–19, 2012.
  • [26] X. Yang and Y. Tian. Super normal vector for activity recognition using depth sequences. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pages 804–811, 2014.
  • [27] X. Yang, C. Zhang, and Y. Tian. Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proc. ACM international conference on Multimedia (ACM MM), pages 1057–1060, 2012.
  • [28] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The moving pose: An efficient 3D kinematics descriptor for low-latency action recognition and detection. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 2752–2759, 2013.