Gesture and human action recognition from visual information is an active research topic in Computer Vision and Machine Learning. It has many potential applications including human-computer interactions and robotics. Since first work on action recognition from depth data captured by commodity depth sensors (e.g., Kinect) in 2010, many methods [2, 3, 4, 5, 6, 7, 8, 9]
for action recognition have been proposed based on specific hand-crafted feature descriptors extracted from depth or skeleton data. With the recent development of deep learning, a few methods have also been developed based on Convolutional Neural Networks (ConvNets)[10, 11, 12, 13, 14]
and Recurrent Neural Networks (RNNs)[15, 16, 17, 18]. However, most of the works on gesture/action recognition reported to date focus on the classification of individual gestures or actions by assuming that instances of individual gestures and actions have been isolated or segmented from a video or a stream of depth maps/skeletons before the classification. In the cases of continuous recognition, the input stream usually contains unknown numbers, unknown orders and unknown boundaries of gestures or actions and both segmentation and recognition have to be solved at the same time.
There are three common approaches to continuous recognition. The first approach is to tackle the segmentation and classification of the gestures or actions separately and sequentially. The key advantages of this approach are that different features can be used for segmentation and classification and existing classification methods can be leveraged. The disadvantages are that both segmentation and classification could be the bottleneck of the systems and they can not be optimized together. The second approach is to apply classification to a sliding temporal window and aggregate the window based classification to achieve final segmentation and classification. One of the key challenges in this approach is the difficulty of setting the size of sliding window because durations of different gestures or actions can vary significantly. The third approach is to perform the segmentation and classification simultaneously.
This paper adopts the first approach and focuses on robust classification using ConvNets that are insensitive to inaccurate segmentation of gestures. Specifically, individual gestures are first segmented based on quantity of movement (QOM)  from a stream of depth maps. For each segmented gesture, an Improved Depth Motion Map (IDMM) is constructed from its sequence of depth maps. ConvNets are used to learn the dynamic features from the IDMM for effective classification. Fig. 1 shows the framework of the proposed method.
The rest of this paper is organized as follows. Section II briefly reviews the related works on video segmentation and gesture/action recognition based on depth and deep learning. Details of the proposed method are presented in Section III. Experimental results on the dataset provided by the challenge are reported in Section IV, and Section V concludes the paper.
Ii Related Work
Ii-a Video Segmentation
There are many methods proposed for segmenting individual gestures from a video. The popular and widely used method employs dynamic time warping (DTW) to decide the delimitation frames of individual gestures [20, 21, 22]. Difference images are first obtained by subtracting two consecutive grey-scale images and each difference image is partitioned into a grid of
cells. Each cell is then represented by the average value of pixels within this cell. The matrix of the cells in a difference image is flattened as a vector called motion feature and calculated for each frame in the video excluding the final frame. This results in amatrix of motion features for a video with frames. The motion feature matrix is extracted from both test video and training video which consists of multiple gestures. The two matrices are treated as two temporal sequences with each motion feature as a feature vector at an instant of time. The distance between two feature vectors is defined as the negative Euclidean distance and a matrix containing DTW distances (measuring similarity between two temporal sequences) between the two sequences is then calculated and analysed by Viterbi algorithm  to segment the gestures.
Another category of gesture segmentation methods from a multi-gesture video is based on appearance. Upon the general assumption that the start and end frames of adjacent gestures are similar, correlation coefficients  and K-nearest neighbour algorithm with histogram of oriented gradient (HOG)  are used to identify the start and end frames of gestures. Jiang et al.  proposed a method based on quantity of movement (QOM) by assuming the same start pose among different gestures. Candidate delimitation frames are chosen based on the global QOM. After a refining stage which employs a sliding window to keep the frame with minimum QOM in each windowing session, the start and end frames are assumed to be the remained frames. This paper adopts the QOM based method and its details will be presented in Section III-A.
Ii-B Depth Based Action Recognition
With Microsoft Kinect Sensors researchers have developed methods for depth map-based action recognition. Li et al.  sampled points from a depth map to obtain a bag of 3D points to encode spatial information and employ an expandable graphical model to encode temporal information . Yang et al.  stacked differences between projected depth maps as a depth motion map (DMM) and then used HOG to extract relevant features from the DMM. This method transforms the problem of action recognition from spatio-temporal space to spatial space. In , a feature called Histogram of Oriented 4D Normals (HON4D) was proposed; surface normal is extended to 4D space and quantized by regular polychorons. Following this method, Yang and Tian  cluster hypersurface normals and form the polynormal which can be used to jointly capture the local motion and geometry information. Super Normal Vector (SNV) is generated by aggregating the low-level polynormals. In 
, a fast binary range-sample feature was proposed based on a test statistic by carefully designing the sampling scheme to exclude most pixels that fall into the background and to incorporate spatio-temporal cues.
Ii-C Deep Leaning Based Recogntiion
Existing deep learning approach can be generally divided into four categories based on how the video is represented and fed to a deep neural network. The first category views a video either as a set of still images  or as a short and smooth transition between similar frames , each color channel of the images is fed to one channel of a ConvNet. Although obviously suboptimal, considering the video as a bag of static frames performs reasonably well. The second category represents a video as a volume and extends ConvNets to a third, temporal dimension [29, 30]
replacing 2D filters with 3D equivalents. So far, this approach has produced little benefits, probably due to the lack of annotated training data. The third category treats a video as a sequence of images and feeds the sequence to an RNN[31, 15, 16, 17, 18]. An RNN is typically considered as memory cells, which are sensitive to both short as well as long term patterns. It parses the video frames sequentially and encodes the frame-level information in their memory. However, using RNNs did not give an improvement over temporal pooling of convolutional features  or over hand-crafted features. The last category represents a video as one or multiple compact images and adopts available trained ConvNet architectures for fine-tuning [10, 11, 12, 32]. This category has achieved state-of-the-art results of action recognition on many RGB and depth/skeleton datasets. The proposed gesture classification in this paper falls into the last category.
Iii Proposed Method
The proposed method consists of two major components, as illustrated in Fig. 1: video segmentation and construction of Improved Depth Motion Map (IDMM) from a sequence of depth maps as the input to ConvNets. Given a sequence of depth maps consisting of multiple gestures, the start and end frames of each gesture are identified based on quantity of movement (QOM) . Then, one IDMM is constructed by accumulating the absolute depth difference between current frame and the start frame for each gesture segment. The IDMM goes through a pseudo-color coding process to become a pseudo-RGB image as an input to a ConvNet for classification. The main objective of pseudo-color coding is to enchance the motin pattern captured by the IDMM. In the rest of this section, video segmentation, construction of IDMM, pseudo-color coding of IDMM, and training of the ConvNets are explained in detail.
|Sets||# of labels||# of gestures||# of RGB videos||# of depth videos||# of subjects||label provided||temporal segment provided|
Iii-a Video Segmentation
Given a sequence of depth maps that contains multiple gestures, The start and end frames of each gesture is detected based on quantity of movement (QOM)  by assuming that all gestures starts from a similar pose, referred to as Neural pose. QOM between two frames is a binary image obtained by applying pixel-wisely on the difference image of two depth maps. is a step function from 0 to 1 at the ad hoc threshold of 60. The global QOM of a frame at time is defined as QOM between Frame
and the very first frame of the whole video sequence. A set of frame indices of candidate delimitation frames was initialised by choosing frames with lower global QOMs than a threshold. The threshold was calculated by adding the mean to twice the standard deviation of global QOMs extracted from first and lastof the averaged gesture sequence length which was calculated from the training gestures. A sliding window with a size of was then used to refine the candidate set and in each windowing session only the index of frame with a minimum global QOM is retained. After the refinement, the remaining frames are expected to be the deliminating frames of gestures.
Iii-B Construction Of IDMM
Unlike the Depth Motion Map (DMM)  which is calculated by accumulating the thresholded difference between consecutive frames, two extensions are proposed to construct an IDMM. First, the motion energy is calculated by accumulating the absolute difference between the current frame and the Neural pose frame. This would better measure the slow motion than original DMM. Second, to preserve subtle motion information, the motion energy is calculated without thresholding. Calculation of IDMM can be expressed as:
where denotes the index of the frame and represents the total number of frames in the segmented gesture. For simplicity, the first frame of each segment is considered as the Neural frame.
In their work Abidi et al.  used color-coding to harness the perceptual capabilities of the human visual system and extracted more information from gray images. The detailed texture patterns in the image are significantly enhanced. Using this as a motivation, it is proposed in this paper to code an IDMM into a pseudo-color image and effectively exploit/enhance the texture in the IDMM that corresponds to the motion patterns of actions. In this work, a power rainbow transform is adopted. For a given intensity , the power rainbow transform encodes it into a normalized color as follows:
where , and are the normalized RGB values through the power rainbow transform. To code an IDMM, linear mapping is used to convert IDMM values to across per image.
The resulting IDMMs are illustrated as in Fig. 2.
Iii-D Network Training & Classification
One ConvNet is trained on the pseudo-color coded IDMM. The layer configuration of the ConvNets follows that in 
. The ConvNet contains eight layers, the first five are convolutional layers and the remaining three are fully-connected layers. The implementation is derived from the publicly available Caffe toolbox based on one NVIDIA Tesla K40 GPU card.
The training procedure is similar to that in 
. The network weights are learned using the mini-batch stochastic gradient descent with the momentum set to 0.9 and weight decay set to 0.0005. All hidden weight layers use the rectification (RELU) activation function. At each iteration, a mini-batch of 256 samples is constructed by sampling 256 shuffled training color-coded IDMMs. All color-coded IDMMs are resized to. The learning rate for fine-tuning is set to
with pre-trained models on ILSVRC-2012, and then it is decreased according to a fixed schedule, which is kept the same for all training sets. For the ConvNet the training undergoes 20K iterations and the learning rate decreases every 5K iterations. For all experiments, the dropout regularisation ratio was set to 0.5 in order to reduce complex co-adaptations of neurons in nets.
Given a test depth sequence, a pseudo-colored IDMM is constructed for each segmented gesture and the trained ConvNet is used to predict the gesture label of the segment.
In this section, the Large-scale Continuous Gesture Recognition Dataset of the ChaLearn LAP challenge 2016 (ChaLearn LAP ConGD Dataset))  and evaluation protocol are described. The experimental results of the proposed method on this dataset are reported and compared with the baselines recommended by the chellenge organisers.
The ChaLearn LAP ConGD Dataset is derived from the ChaLearn Gesture Dataset (CGD) . It has 47933 RGB-D gesture instances in 22535 RGB-D gesture videos. Each RGB-D video may contain one or more gestures. There are 249 gestures performed by 21 different individuals. The detailed information of this dataset is shown in Table I. In this paper, only depth data was used in the proposed method. Some samples of depth maps are shown in Fig. 3.
Iv-B Evaluation Protocol
The dataset was divided into training, validation and test sets by the challenge organizers. All three sets include data from different subjects and the gestures of one subject in validation and test sets do not appear in the training set.
Jaccard index (the higher the better) is adopted to measure the performance. The Jaccard index measures the average relative overlap between true and predicted sequences of frames for a given gesture. For a sequence , let and be binary indicator vectors for which 1-values correspond to frames in which the gesture label is being performed. The Jaccard Index for the class is defined for the sequence as:
where is the ground truth of the gesture label in sequence , and is the prediction for the label in sequence .
When and are empty, is defined to be 0. Then for the sequence with true labels, the Jaccard Index is calculated as:
For all testing sequences with gestures, the mean Jaccard Index is used as the evaluation criteria and calculated as:
Iv-C Experimental Results
The results of the proposed method on the validation and test sets and their
comparisons to the results of the baseline methods  (MFSK and
MFSK+DeepID) are shown in Table II. The codes and models can be downloaded at the author’s homepage https://sites.google.com/site/pichaossites/
|Method||Set||Mean Jaccard Index|
The results showed that the proposed method significantly outperformed the baseline methods, even though only single modality, i.e. depth data, was used while the baseline method used both RGB and depth videos.
The first three winners’ results are summarized in Table III. We can see that our method is among the top performers and our recognition rate is very close to the best performance of this challenge (0.265506 vs. 0.269235&0.286915), even though we only used depth data for proposed method. Regarding computational cost, our implementation is based on CUDA 7.5 and Matlab 2015b, and it takes about 0.8s to process one depth sequence for testing in our workstation equipped with 8 cores CPU, 64G RAM, and Tesla K40 GPU.
This paper presents an effective yet simple method for continuous gesture recognition using only depth map sequences. Depth sequences are first segmented so that each segmentation contains only one gesture and a ConvNet is used for feature extraction and classification. The proposed construction of IDMM enables the use of available pre-trained models for fine-tuning without learning afresh. Experimental results on ChaLearn LAP ConGD Dataset verified the effectiveness of the proposed method. How to exactly extract the neutral pose and fuse different modalities to improve the accuracy will be our future work.
The authors would like to thank NVIDIA Corporation for the donation of a Tesla K40 GPU card used in this challenge.
W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3D
Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2010, pp. 9–14.
-  J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 1290–1297.
-  X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depth motion maps-based histograms of oriented gradients,” in Proc. ACM international conference on Multimedia (ACM MM), 2012, pp. 1057–1060.
-  O. Oreifej and Z. Liu, “HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 716–723.
-  X. Yang and Y. Tian, “Super normal vector for activity recognition using depth sequences,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 804–811.
-  P. Wang, W. Li, P. Ogunbona, Z. Gao, and H. Zhang, “Mining mid-level features for action recognition based on effective skeleton representation,” in Proc. International Conference on Digital Image Computing: Techniques and Applications (DICTA), 2014, pp. 1–8.
-  C. Lu, J. Jia, and C.-K. Tang, “Range-sample depth feature for action recognition,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 772–779.
-  R. Vemulapalli and R. Chellappa, “Rolling rotations for recognizing human actions from 3d skeletal data,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1–9.
-  J. Zhang, W. Li, P. O. Ogunbona, P. Wang, and C. Tang, “Rgb-d-based action recognition datasets: A survey,” Pattern Recognition, vol. 60, pp. 86–105, 2016.
-  P. Wang, W. Li, Z. Gao, C. Tang, J. Zhang, and P. O. Ogunbona, “Convnets-based action recognition from depth maps through virtual cameras and pseudocoloring,” in Proc. ACM international conference on Multimedia (ACM MM), 2015, pp. 1119–1122.
-  P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. Ogunbona, “Action recognition from depth maps using deep convolutional neural networks,” Human-Machine Systems, IEEE Transactions on, vol. 46, no. 4, pp. 498–509, 2016.
-  P. Wang, Z. Li, Y. Hou, and W. Li, “Action recognition based on joint trajectory maps using convolutional neural networks,” in Proc. ACM international conference on Multimedia (ACM MM), 2016, pp. 1–5.
-  P. Wang, W. Li, S. Liu, Z. Gao, C. Tang, and P. Ogunbona, “Large-scale isolated gesture recognition using convolutional neural networks,” in Proceedings of ICPRW, 2016.
-  Y. Hou, Z. Li, P. Wang, and W. Li, “Skeleton optical spectra based action recognition using convolutional neural networks,” in Circuits and Systems for Video Technology, IEEE Transactions on, 2016, pp. 1–5.
-  Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1110–1118.
-  V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential recurrent neural networks for action recognition,” in Proc. IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4041–4049.
W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-occurrence
feature learning for skeleton based action recognition using regularized deep
LSTM networks,” in
The 30th AAAI Conference on Artificial Intelligence (AAAI), 2016.
-  A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+ D: A large scale dataset for 3D human activity analysis,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  F. Jiang, S. Zhang, S. Wu, Y. Gao, and D. Zhao, “Multi-layered gesture recognition with kinect,” Journal of Machine Learning Research, vol. 16, no. 2, pp. 227–254, 2015.
-  J. Wan, Q. Ruan, W. Li, and S. Deng, “One-shot learning gesture recognition from rgb-d data using bag of features,” Journal of Machine Learning Research, vol. 14, pp. 2549–2582, 2013.
-  J. Wan, V. Athitsos, P. Jangyodsuk, H. J. Escalante, Q. Ruan, and I. Guyon, “Csmmi: Class-specific maximization of mutual information for action and gesture recognition,” IEEE Transactions on Image Processing, vol. 23, no. 7, pp. 3152–3165, July 2014.
-  H. J. Escalante, I. Guyon, V. Athitsos, P. Jangyodsuk, and J. Wan, “Principal motion components for one-shot gesture recognition,” Pattern Analysis and Applications, pp. 1–16, 2015.
-  G. D. Forney, “The viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp. 268–278, March 1973.
-  Y. M. Lui, “Human gesture recognition on product manifolds,” Journal of Machine Learning Research, vol. 13, no. 1, pp. 3297–3321, Nov 2012.
-  D. Wu, F. Zhu, and L. Shao, “One shot learning gesture recognition from rgbd images,” in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, June 2012, pp. 7–12.
-  W. Li, Z. Zhang, and Z. Liu, “Expandable data-driven graphical modeling of human actions based on salient postures,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 18, no. 11, pp. 1499–1510, 2008.
-  J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4694–4702.
-  K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. Annual Conference on Neural Information Processing Systems (NIPS), 2014, pp. 568–576.
-  S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 1, pp. 221–231, 2013.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proc. IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2625–2634.
-  H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, “Dynamic image networks for action recognition,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  B. R. Abidi, Y. Zheng, A. V. Gribok, and M. A. Abidi, “Improving weapon detection in single energy X-ray images through pseudocoloring,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 36, no. 6, pp. 784–796, 2006.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inProc. Annual Conference on Neural Information Processing Systems (NIPS), 2012, pp. 1106–1114.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding.” in Proc. ACM international conference on Multimedia (ACM MM), 2014, pp. 675–678.
-  H. J. Escalante, V. Ponce-López, J. Wan, M. A. Riegler, B. Chen, A. Clapés, S. Escalera, I. Guyon, X. Baró, P. Halvorsen, H. Müller, and M. Larson, “Chalearn joint contest on multimedia challenges beyond visual analysis: An overview,” in Proceedings of ICPRW, 2016.
-  I. Guyon, V. Athitsos, P. Jangyodsuk, and H. J. Escalante, “The chalearn gesture dataset (CGD 2011),” Machine Vision and Applications, vol. 25, no. 8, pp. 1929–1951, 2014.
-  J. Wan, G. Guo, and S. Z. Li, “Explore efficient local features from rgb-d data for one-shot learning gesture recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1626–1639, Aug 2016.
-  X. Chai, Z. Liu, F. Yin, Z. Liu, and X. Chen, “Two streams recurrent neural networks for large-scale continuous gesture recognition,” in Proceedings of ICPRW, 2016.
-  N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Using convolutional 3d neural networks for user-independent continuous gesture recognition,” in Proceedings of ICPRW, 2016.