Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences

01/20/2015 ∙ by Pichao Wang, et al. ∙ University of Wollongong Tianjin University 0

Recently, deep learning approach has achieved promising results in various fields of computer vision. In this paper, a new framework called Hierarchical Depth Motion Maps (HDMM) + 3 Channel Deep Convolutional Neural Networks (3ConvNets) is proposed for human action recognition using depth map sequences. Firstly, we rotate the original depth data in 3D pointclouds to mimic the rotation of cameras, so that our algorithms can handle view variant cases. Secondly, in order to effectively extract the body shape and motion information, we generate weighted depth motion maps (DMM) at several temporal scales, referred to as Hierarchical Depth Motion Maps (HDMM). Then, three channels of ConvNets are trained on the HDMMs from three projected orthogonal planes separately. The proposed algorithms are evaluated on MSRAction3D, MSRAction3DExt, UTKinect-Action and MSRDailyActivity3D datasets respectively. We also combine the last three datasets into a larger one (called Combined Dataset) and test the proposed method on it. The results show that our approach can achieve state-of-the-art results on the individual datasets and without dramatical performance degradation on the Combined Dataset.



There are no comments yet.


page 3

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human action recognition has been an active research topic in computer vision due to its wide range of applications, such as smart surveillance and human-computer interactions. In the past decades, research on action recognition mainly focused on recognising actions from conventional RGB videos.

In the previous video-based motion action recognition, most researchers aimed to design hand-crafted features and achieved significant progress. However, in the evaluation conducted by Wang et al. [1], one interesting finding is that there is no universally best hand-engineered feature for all datasets.

Recently, the release of the Microsoft Kinect brings up new opportunities in this field. The Kinect device can provide both depth maps and RGB images in real-time at low cost. Depth maps have several advantages compared to traditional color images. For example, depth maps reflect pure geometry and shape cues, which can often be more discriminative than color and texture in many problems including object segmentation and detection. Moreover, depth maps are insensitive to changes in lighting conditions. Based on depth data, many works [2, 3, 4, 5] have been reported with respect to specific feature descriptors to take advantage of the properties of depth maps. However, all of them are based on hand-crafted features, which are shallow high-dimensional descriptions of local or global spatio-temporal information and their performance varies from dataset to dataset.

Deep Convolutional Neural Networks (ConvNets) have been demonstrated as an effective class of models for understanding image content, offering state-of-the-art results on image recognition, segmentation, detection and retrieval [6, 7, 8, 9]

. With the success of ImageNet classification with ConvNets

[10], many works take advantage of trained ImageNet models and achieve very promising performance on several tasks, from attributes classification [11] to image representations [12] to semantic segmentation [13]. The key enabling factors behind these successes are techniques for scaling up the networks to millions of parameters and massive labelled datasets that can support the learning process. In this work, we propose to apply ConvNets to depth map sequences for action recognition. An architecture of Hierarchical Depth Motion Maps (HDMM) + 3 Channel Convolutional Neural Network (3ConvNets) is proposed. HDMM is a technique that can transform the problem of action recognition to image classification and artificially enlarge the training data. Specifically, to make our algorithms more robust to viewpoint variations, we directly process the 3D pointclouds and rotate the depth data into different views. To make full use of the additional body shape and motion information from depth sequences, each rotated depth frame is first projected onto three orthogonal Cartesian planes, and then for each projection view, the absolute differences (motion energy) between consecutive and sub-sampled frames are accumulated through an entire depth video sequence. To weight the importances of different motion energy, a weighted factor is used to make the motion energy more important for the recent poses than the past ones. Three HDMMs are constructed after above steps and three ConvNets are trained on the HDMMs. The final classification scores are combined by late fusion of the three ConvNets.

We evaluate our method on the MSRAction3D, MSRAction3DExt, UTKinect-Action and MSRDailyActivity3D datasets individually and achieve results which are better than or comparable to the state-of-the-art. To further verify the robustness of our method, we combine the last three datasets into a single one and test the proposed method on it. The results show that that our approach could achieve consistent performance without much degradation in performance on the combined dataset.

The main contributions of this paper can be summarized as follows. First of all, we propose a new architecture, namely, HDMM + 3ConvNets for depth-based action recognition, which achieves state-of-the-art results on four datasets. Secondly, our method can handle view variant cases for action recognition to some extent due to the simply and directly processing of 3D pointclouds. Lastly, a large dataset is generated by combining the existing ones to evaluate the stability of the proposed method, because the combined dataset contains large variances of within actions, background, viewpoint and number of samples of each action across the three datasets.

The remainder of this paper is organized as follows. Section 2 reviews the related work on deep learning on 2D video action recognition and action recognition using depth sequences. Section 3 describes the proposed architecture. In Section 4, various experimental results and analysis are presented. Conclusion and future work are made in Section 5.

Ii Related Work

With the recent resurgence of neural networks invoked by Hinton and others [14], deep neural architectures have been used as an effective solution for extracting high level features from data. There are a number of attempts to apply deep architectures for 2D video recognition. In [15]

, spatio-temporal features are leaned unsupervised by a Convolutional Restricted Boltzmann Machine (CRBM) and then plugged into a ConvNet for action recognition. In

[16], 3D convolutional network is used to automatically learn spatio-temporal features directly from raw data. Recently, several ConvNet architectures for action recognition in [17] is compared based on Sport-1M dataset, comprising 1.1 M YouTube videos of sports activities. They find that for a network, operating on individual video frames, performs similarly to the networks whose input is the stack of frames, which indicates that the learned spatio-temporal features do not capture the motion effectively. In [18], spatial and temporal streams, are proposed for action recognition. Two ConvNets are trained on the two streams and combined by late fusion. The spatial stream is comprised of individual frames while the temporal stream is stacked by optical flow. However, the best results of all above deep learning methods can only match the state-of-the-art results achieved by hand-crafted features.

For depth-based action recognition, many works have been reported in the past few years. Li et al. [2] sample points from silhouette of a depth image to obtain a bag of 3D points which are clustered to enable recognition. Yang et al. [19] stack differences between projected depth maps as DMM and then use HOG to extract the features on the DMM. This method transforms the problem of action recognition from 3D space to 2D space. In [4], HON4D is proposed, in which surface normal is extended to 4D space and quantized by regular polychorons. Following this method, Yang and Tian [5]

cluster hypersurface normals and form the polynormal which can be used to jointly capture the local motion and geometry information. Super Normal Vector (SNV) is generated by aggregating the low-level polynormals. However, all of these methods are based on carefully hand designed features, which are restricted to specific datasets and applications.

Our work is inspired by [19] and [18], where we transform the problem of 3D action recognition to 2D image classification in order to take advantage of trained ImageNet models [10].

Iii HDMM + 3ConvNets

A depth map can be used to capture the 3D structure and shape information. By projecting the difference between depth maps (DMM) onto three orthogonal Cartesian planes can further characterize the motion information of an action [19]. To make our method more robust to viewpoint variances, we directly process the 3D pointclouds and rotate the depth data into different views. In order to explore speed invariance and weight the importance of motion energy in time axis, sub-sampled and weighted HDMM is generated from the rotated projected maps. Three deep ConvNets are trained on three projected planes of HDMM. Late fusion is performed by combining the softmax class posteriors from the three nets. The overall framework is illustrated in Figure 1. Our algorithms can be divided into three modules: Rotation in 3D Pointclouds, Hierarchical DMM and Networks Training & Class Score Fusion.

Fig. 1: HDMM + 3ConvNets architecture for depth-based action recognition.

Iii-a Rotation

One of the challenges for action recognition is the view invariance. To handle this problem, we rotate the depth data in 3D pointclouds, imitating the rotation of cameras around the subject as illustrated in Figure 2 (b), where the rotation is in the world coordinate system (Figure 2 (a)).

Fig. 2: Rotation in 3D Pointclouds.

Figure 2 (b) is the model for rotation of camera around the subject. Supposing camera moves from position to , it can be decomposed into two steps: first moves from to , with rotated angle denoted by and moves from to , with rotated angle denoted by . Then the coordinates of subject in rotated scene can be computed by Equation (1).


where denotes the rotation around y axis (right-handed coordinate system) while denotes the rotation around z axis and they are:

After rotation, the 3D cloudpoints are projected to the screen coordinates as illustrated in Figure 2 (a). In this way, the original depth data can be rotated to different angles, with the premise of not resulting in too much information loss.

Iii-B Hdmm

In our work, each rotated 3D depth frame is projected to three orthogonal Cartesian planes, including front, side and top views, denoted by where . Different from [19], where the motion maps are calculated by accumulating the difference with threshold between consecutive frames, we process the depth maps with three additional steps. Firstly, in order to reserve subtle motion information, for example, page turning when reading books, for each projected map, the motion energy is calculated as the absolute difference between rotated consecutive or sub-sampled frames without thresholding. Secondly, to effectively exploit speed invariance and suppress noise, several temporal scales are generated , as illustrated in Figure 3.

Fig. 3: Illustration of hierarchical temporal scales.

For a depth video sequence with frames, is obtained by stacking the motion energy across an entire depth video sequence as follows:


where denotes the frame index under projection view of the whole depth video sequences and ; represents the sub-sampled frame in corresponding temporal scale ; and . Lastly, to weight the different importance of motion energy, a weighted HDMM is adopted as in Equation (3), making the motion energy more important for actions performed currently than past.


Through this simple process, pair actions, such as sit down and stand up, can be differentiated.

After above three steps, the rotated HDMM are encoded into RGB images, with small values being encoded to R channel while large values to B channel.

Iii-C Network Training & Class Score Fusion

After we construct the RGB images from depth motion maps, three ConvNets are trained on the images of the three projected planes. The layer configuration of our three ConvNets is schematically shown in Figure 1, following [10]

: each net contains eight layers with weights, the first five convolutional layers and the remaining three fully-connected layers. Our implementation is derived from the publicly available Caffe toolbox

[20] based on one NVIDIA GeForce GTX680M card.


The training procedure is similar to [10]

: the network weights are learnt using the mini-batch stochastic gradient descent with momentum set to 0.9 and weight decay set to 0.0005; all hidden weight layers use the rectification (RELU) activation function; at each iteration, a mini-batch of 256 samples is constructed by sampling 256 shuffled training images; all the images are resized to 256

256; to artificially enlarge the training data (data augmentation), firstly, 224 224 patches are randomly cropped from the center of the selected image with a factor of 2048 data augmentation, and then it undergoes random horizontal flipping and RGB jittering; the learning rate is initially set to for directly training the networks from data without initialising the weights with pre-trained models on ILSVRC-2012, while it is set to

for fine-tuning with pre-trained models on ILSVRC-2012, and then it is decreased according to a fixed schedule, which is kept the same for all training sets; for each ConvNet we train 100 cycles and decrease the learning rate every 20 cycles. For all experimental settings, we set the dropout regularisation ratio to 0.5 to reduce complex co-adaptations of neurons in nets.

Class Score Fusion

At test period, given a depth video sequence (sample), we only use depth motion maps with temporal scaling but without rotation for testing. The averaged scores of scales for each test sample are calculated as the final score of this test sample in one channel of 3ConvNets. The final class scores for a test sample are the averages of the outputs of the three ConvNets.

Iv Experiments

In this section, we extensively evaluate our proposed framework on three public benchmark datasets: MSRAction3D [2], UTKinect-Action [21] and MSRDailyActivity3D [3]. Moreover, an extension of MSRAction3D, called MSRAction3DExt Dataset was used, which contains more subjects performing the same actions. In order to test the stability of proposed method, a new dataset are combined from the last three datasets, referred to as Combined Dataset. In all experiments, for rotation, is set to and is set to ; for weighted HDMM, is set to 0.99 and is set to 1. Different temporal scales are set according to the noise level and mean circle of actions performed in different datasets. Experimental results show that our method can outperform or match the state-of-the-art on individual datasets and maintain the accuracy on the Combined Dataset.

Iv-a MSRAction3D Dataset

The MSRAction3D Dataset [2] is an action dataset of depth sequences captured by a depth camera. It contains 20 actions performed by 10 subjects facing the camera, with each subject performing each action 2 or 3 times. The 20 actions are: “high arm wave”, “horizontal arm wave”, “hammer”, “hand catch”, “forward punch”, “high throw”, “draw X”, “draw tick”, “draw circle”, “hand clap”, “two hand wave”, “side-boxing”, “bend”, “forward kick”, “side kick”, “jogging”, “tennis swing”, “tennis serve”, “golf swing”, “pick up & throw”.

In order to facilitate a fair comparison, the same experimental setting in [3] is followed, namely, the cross-subjects settings: subjects 1, 3, 5, 7, 9 for training and subjects 2, 4, 6, 8, 10 for testing. For this dataset, we set temporal scale , and our method achieves 100% accuracy. Four scenarios are considered: (1) training on primitive data set (without rotation and temporal scaling), (2) training on data set after rotation, (3) pre-training on ILSVRC-2012 (short for pre-trained) followed by fine-tuning on data set after rotation, (4) pre-trained followed by fine-tuning on primitive data set. The results for these setting are listed in Table 1.

Training Setting Accuracy (%)
Primitive 7.12%
Rotation 34.23%
Rotation + Pre-trained + Fine-tuning 100%
Primitive + Pre-trained + Fine-tuning 100%
TABLE I: Comparison on Different Training Settings for MSRAction3D Dataset.

From this table, we can see that pre-training on ILSVRC-2012 (initialise the networks with the trained weights for ImageNet) is important, because the volume of training data is so small that it is not enough to train millions of parameters of the deep networks without good initialisation and leads to overfitting. If we directly train the networks from primitive, the performance is slightly better than random guess. We compare the performance of HDMM + 3ConvNets with other results in Table 2.

Method Accuracy (%)
Bag of 3D Points [2] 74.70%
HOJ3D [21] 79.00%
Actionlet Ensemble [3] 82.22%
Depth Motion Maps [19] 88.73%
HON4D [3] 88.89%
Moving Pose [22] 91.70%
SNV [5] 93.09%
Proposed Method 100%
TABLE II: Recognition accuracy comparison of our method and previous approaches on MSRAction3D Dataset.

The proposed method outperforms all of previous approaches, this is probably because (1) In MSRAction3D we can easily segment the subject from background just by thresholding the depth values, making the generated HDMM without much noise ; (2) Pre-trained models can initialise the image-based deep networks well.

Iv-B MSRAction3DExt Dataset

The MSRAction3DExt Dataset is an extension of MSRAction3D Dataset. It is captured with the same settings, with additional 13 subjects performing the same 20 actions 2 to 4 times. Thus, there are 20 actions, 23 subjects and 1379 video clips. Similar to MSRAction3D, we also test our method on the same four scenarios and the results are listed in Table 3. For this dataset, we still adopt cross-subjects setting for training and testing, that is odd subjects for training and even subjects for testing.

Training Setting Accuracy (%)
Primitive 10.00%
Rotation 53.05%
Rotation + Pre-trained + Fine-tuning 100%
Primitive + Pre-trained + Fine-tuning 100%
TABLE III: Comparison on Different Training Settings for MSRAction3DExt Dataset.

From Table 1 and Table 3 we can see that with the volume of dataset increasing, directly training the Nets from primitive and rotation, the performance will be much better. However, the performance of trained models will still be very poor if pre-trained model on ImageNet is not used for initialization. Our method again achieves 100% using pre-trained + fine-tuning even though this dataset has more test samples and variations of actions. From the two sets of experiments, we can conclude that the way using pre-trained + fine-tuning is very suitable for small datasets. In the following experiments, we do not train our networks from primitive any more and all of the experiments adopt pre-trained + fine-tuning settings.

We compare the performance of our method with SNV [5] in Table 4 and our method can outperform the state-of-the-art result dramatically.

Method Accuracy (%)
SNV [5] 90.54%
Proposed Method 100%
TABLE IV: Recognition accuracy comparison of our method and SNV on MSRAction3DExt Dataset.

Iv-C UTKinect-Action Dataset

The UTKinect-Action Dataset [21] is captured using a stationary Kinect sensor. It consists of 10 actions: “walk”, “sit down”, “stand up”, “pick up”, “carry”, “throw”, “push”, “pull”, “wave” and “clap hands”. There are 10 different subjects and each subject performs each action twice. This dataset is designed to investigate variations in the view point.

For this dataset, we set temporal scale , to exploit more temporal information in actions. The cross-subjects scheme is followed as in [23] which are different from [21] where more subjects were used for training in each round. We consider three scenarios for this dataset: (1) pre-trained + fine-tuning on primitive data set; (2) pre-trained + fine-tuning on data set after rotation (3) pre-trained + fine-tuning on data set after rotation and temporal scaling. The results are listed in Table 5.

Training Setting Accuracy (%)
Primitive + Pre-trained + Fine-tuning 82.83%
Rotation + Pre-trained + Fine-tuning 88.89%
Rotation + Scaling +
Pre-trained + Fine-tuning
TABLE V: Comparison on Different Training Settings for UTKinect-Action Dataset.

From Table 5 we can see that after rotation, it can obtain 6% improvement in terms of accuracy, which shows that the process of rotation in our method can improve the accuracy greatly. The confusion matrix for the final test is demonstrated in Figure 4, and it shows that the most confused actions are

hand clap and wave, which share similar shapes of depth motion maps.

Fig. 4: The confusion matrix of proposed method for UTKinect-Action Dataset.

Table 6 shows the performance of our method compared to the previous approaches on the UTKinect-Action Dataset, and it shows that the performance of proposed method can outperform the methods specially designed for view variant cases.

Method Accuracy (%)
DSTIP+DCSF [23] 85.8%
Random Forests [24] 87.90%
SNV [5] 88.89%
Proposed Method 90.91%
TABLE VI: Recognition accuracy comparison of our method and previous approaches on UTKinect-Action Dataset.

Iv-D MSRDailyActivity3D Dataset

The MSRDailyActivity3D Dataset [3] is a daily activity dataset of depth sequences captured by a depth camera. There are 16 activities: “drink”, “eat”, “read book”, “call cellphone”, “write on paper”, “use laptop”, “use vacuum cleaner”, “cheer up”, “sit still”, “toss paper”, “play game”, “lay down on sofa”, “walking”, “play guitar”, “stand up” and “sit down”. There are 10 subjects and each subject performs each activity twice, one in standing position and the other in sitting position. Compared to MSRAction3D(Ext) and UTKinect-Action datasets, actors in this dataset present large spatial and scaling changes. Moreover, most activities in this dataset involve human-object interactions.

For this dataset, we set temporal scale , a larger number of scales, to exploit more temporal information and suppress the high level noise in this dataset. We follow the same experimental setting as [3] and obtain the final accuracy of 81.88%. Three scenarios are considered for this dataset: (1) pre-trained + fine-tuning on primitive data set; (2) pre-trained + fine-tuning on data set after temporal scaling; (3) pre-trained + fine-tuning on data set after temporal scaling and rotation. The results are listed in Table 7.

Training Setting Accuracy (%)
Primitive + Pre-trained + Fine-tuning 46.25%
Scaling + Pre-trained + Fine-tuning 75.62%
Scaling + Rotation +
Pre-trained + Fine-tuning
TABLE VII: Comparison on Different Training Settings for MSRDailyActivity3D Dataset.

The performance of our method compared to the previous approaches is shown in Table 8 and the confusion matrix is shown in Figure 5.

Method Accuracy (%)
LOP [3] 42.50%
Depth Motion Maps [19] 43.13%
Joint Position [3] 68.00%
Moving Pose [22] 73.8%
Local HON4D [4] 80.00%
Actionlet Ensemble [3] 85.75%
SNV [5] 86.25%
Proposed Method 81.88%
TABLE VIII: Recognition accuracy comparison of our method and previous approaches on MSRDailyActivity3D Dataset.
Fig. 5: The confusion matrix of proposed method for MSRDailyActivity Dataset.

From Table 8 we can see that our proposed method can outperform DMM [19] greatly but can only match the state-of-the-art methods. The reasons probably are: (1) the background of this dataset is more complicated compared to MSRAction3D (Ext), we only pre-process by thresholding the depth value as for MSRAction3D (Ext), which causes lots of noise in HDMM; (2) there are so many actions that are similar, such as call cellphone, drink and eat, they have similar motion shapes but have subtle motion energies so that the HDDM are very similar and confusing.

Iv-E Combined Dataset

The Combined Dataset is a dataset consisting of MSRAction3DExt, UTKinect-Action and MSRDailyActivity3D datasets to test the scalability of the proposed methods. The Combined Dataset is very challenging due to its large variations in backgrounds, within actions, view points and number of samples in each action. The same actions in different datasets are combined into one action and rewritten into same file format (.bin file format) and filename format (axxx_sxxx_exxx_depth.bin, where a, s and e represent class ID, subject ID and example ID respectively, xxx represent the corresponding number) in Combined Dataset. The combined actions and corresponding information are listed in Table 9.

1 s001-s023 A 21 s001-s020 U&D
2 s001-s023 A 22 s001-s020 U&D
3 s001-s023 A 23 s001-s020 U&D
4 s001-s023 A 24 s001-s010 U
5 s001-s023 A 25 s001-s010 U
A&U 26 001-s010 U
7 s001-s023 A 27 s001-s010 U
8 s001-s023 A 28 s001-s010 D
9 s001-s023 A 29 s001-s010 D
A&U 30 s001-s010 D
A&U 31 s001-s010 D
12 s001-s023 A 32 s001-s010 D
13 s001-s023 A 33 s001-s010 D
14 s001-s023 A 34 s001-s010 D
15 s001-s023 A 35 s001-s010 D
16 s001-s023 A 36 s001-s010 D
17 s001-s023 A 37 s001-s010 D
18 s001-s023 A 38 s001-s010 D
19 s001-s023 A 39 s001-s010 D
20 s001-s023 A 40 s001-s010 D
TABLE IX: Information on Combined Dataset: A stands for MSRAction3DExt Dataset; U stands for UTKinect-Action Dataset; D stands for MSRDailyActivity3D Dataset; the corresponding Action Names are shown in Figure 6 (Y label of the fifth sub-figure).

For this dataset, we still use cross-subjects scheme: odd subject ID for training and even subject ID for testing, guaranteeing the training and testing subjects in original datasets with the same identities in Combined Dataset.

To compromise between different datasets, we set temporal scale for this dataset. Our algorithms are tested with two settings: (1) pre-trained + fine-tuning on primitive data set; (2) pre-trained + fine-tuning on data set after temporal scaling and rotation. The corresponding results are listed in Table 10.

Training Setting Accuracy (%)
Primitive + Pre-trained + Fine-tuning 87.20%
Rotation + Scaling +
Pre-trained + Fine-tuning
TABLE X: Comparison on Two Training Settings for the Overall Results on Combined Dataset.

From Table 10 we can see that for this big dataset, rotation and scaling are less effective compared with that on smaller dataset. One probable reason is that the training of ConvNets can benefit from large primitive data set to fine-tune millions of parameters.

We compare our method with SNV [5] on this dataset: first train one model in the Combined Dataset and then test the model on original datasets and Combined Datasets. The results and corresponding confusion matrix are shown in Table 11 and Figure 6.

Dataset Method
SNV Proposed Method
MSRAction3D 89.83% 94.58%
MSRAction3DExt 91.15% 94.05%
UTKinect-Action 93.94% 91.92%
MSRDailyActivity3D 60.63% 78.12%
Combined 86.11% 90.92%
TABLE XI: Recognition accuracy comparison of SNV and our method on Combined Dataset and its original datasets.
Fig. 6: The confusion matrix of proposed method for original MSRAction3D, MSRAction3DExt, UTKinect-Action, MSRDailyActivity and Combined datasets (from top left to down right).

From Table 11 we can see that proposed method can maintain the accuracy without dramatically variation with the increase of variations and complications of datasets; at the same time, it outperforms the SNV method largely in terms of accuracy in the Combined Dataset. The compromise between different datasets in temporal scales lead to subtle decrease of accuracy in some individual datasets. For example, for MSRDailyActivity3D, decreasing the temporal scale to leads to the drop in accuracy, due to the loss of shape and speed information in this dataset, which has high level noise and complicated background. However, for MSRAction3D (Ext), the accuracy drops because some of the actions are similar after temporal scaling, such as hand catch and high throw, due to the simplicity of these datasets.

Iv-F Analysis

In this section, we give the overall analysis of proposed methods on data augmentation and parameter selection according to the extensive experimental results.

In our method, ConvNets are adopted for feature extraction and classification. Generally, ConvNets needs large volume data to tune millions of parameters and reduce overfitting. Directly training the ConvNets with small size of data will lead to very poor performance due to overfitting, which can be seen from Table 1 and Table 3. Due to the small volume (even the Combined Dataset) of available datasets, artificially data augmentation is needed. In our method, two strategies are used for this purpose: rotation and temporal scaling, one for view point invariance, one for speed invariance. However, without initialising the ConvNets with pre-trained model on ImageNet, the artificially enlarged data set are still not enough to train the whole nets from random initialization of the million of weights, because much of the data is interdependent and less informative. The way of pre-trained + fine-tuning provides a promising direction for small datasets in our method. The pre-trained model can initialise the loss function of the nets into minimum areas and the fine-tuning can make the nets obtain the optimum solution even on small datasets.

For different datasets, different temporal scales are set to obtain the best results. The reasons are as follows: for simple action datasets (or gesture datasets), such as MSRAction3D (Ext), one scale is enough to distinguish the differences between actions (gestures), due to the short circle of motion; for activity datasets, such as UTKinect-Action and MSRDailyActivity datasets, more scales are needed, because the circle of motion in these datasets are much longer and they usually contains several primitive actions (gestures) and large number of scales can capture the motion information in different temporal scales; For noisy datasets, larger temporal scales should be set, such as MSRDailyActivity Dataset, because in order to suppress the affects of high level noise in these datasets, much scales need to be set.

V Conclusion and Future Work

In this paper, we propose a deep classification model for action recognition using depth map sequences. Our proposed method is evaluated on extensive datasets and compared to a number of state-of-the-art approaches. Our method can achieve state-of-the-art results in individual datasets and maintain the accuracy in more complicated datasets combined from available public datasets. By rotation and temporal scaling, the volume of training data can be artificially enlarged, from which the ConvNets benefit and obtain better results than training on primitive. The way of pre-trained + fine-tuning is adopted to train ConvNets on small datasets, which achieves state-of-the-art results in most cases. However, due to the high level noise and complicated background of some datasets, our method can only compete with previous methods. Moreover, our method does not consider the skeleton data on which much success has achieved for action recognition. With the development of deep learning methods, the combination of Deep Belief Networks for skeleton data (generative model) and deep Convolutional Neural Networks for depth data (discriminative model) will open a new door for 3D action recognition. In our future work, we will combine the proposed method together with object segmentation and skeleton-based method to improve the recognition accuracy.


  • [1] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid et al., “Evaluation of local spatio-temporal features for action recognition,” in BMVC, 2009.
  • [2] W. Li, Z. Zhang, and Z.Liu, “Action recognition based on a bag of 3D points,” in CVPRW, 2010.
  • [3] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in CVPR, 2012.
  • [4] O. Oreifej and Z. Liu, “Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences,” in CVPR, 2013.
  • [5] X. Yang and Y. Tian, “Super normal vector for activity recognition using depth sequences,” in CVPR, 2014.
  • [6] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” PAMI, vol. 35, no. 8, pp. 1915–1929, 2013.
  • [7] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in ICLR, 2014.
  • [8] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: An astounding baseline for recognition,” in CVPRW, 2014.
  • [9] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance of multilayer neural networks for object recognition,” in ECCV, 2014.
  • [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
  • [11] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev, “Panda: Pose aligned networks for deep attribute modeling,” in CVPR, 2014.
  • [12] M. Oquab, L. Bottou, I. Laptev, J. Sivic et al., “Learning and transferring mid-level image representations using convolutional neural networks,” in CVPR, 2014.
  • [13] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
  • [14] G. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
  • [15] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “Convolutional learning of spatio-temporal features,” in ECCV, 2010.
  • [16] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” PAMI, vol. 35, no. 1, pp. 221–231, 2013.
  • [17] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in CVPR, 2014.
  • [18] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NIPS, 2014.
  • [19] X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depth motion maps-based histograms of oriented gradients,” in ACM MM, 2012.
  • [20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv:1408.5093, 2014.
  • [21] L. Xia, C.-C. Chen, and J. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in CVPRW, 2012.
  • [22] M. Zanfir, M. Leordeanu, and C. Sminchisescu, “The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection,” in ICCV, 2013.
  • [23] L. Xia and J. Aggarwal, “Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera,” in CVPR, 2013.
  • [24] Y. Zhu, W. Chen, and G. Guo, “Fusing spatiotemporal features and joints for 3d action recognition,” in CVPRW, 2013.