1 Introduction
Recognizing human actions has remained one of the most important and challenging tasks in computer vision. It facilitates a wide range of applications such as intelligent video surveillance, humancomputer interaction, and video understanding
[Poppe2010, Weinland, Ronfard, and Boyerc2011].Traditional studies on action recognition mainly focus on recognizing actions from RGB videos recorded by 2D cameras [Weinland, Ronfard, and Boyerc2011]. However, capturing human actions in the full 3D space in which they actually occur can provide more comprehensive information. Biological observations suggest that humans can recognize actions from just the motion of a few light displays attached to the human body [Johansson1973]. Motion capture systems [CMU2003] extract 3D joint positions using markers and high precision camera arrays. Although slightly higher in price, such systems provide highly accurate joint positions for skeletons. Recently, the Kinect device has gained much popularity thanks to its excellent accuracy in human body modeling and affordable price. The bundled SDK for Kinect v2 can directly generate accurate skeletons in realtime. Due to the prevalence of these devices, skeleton based representations of the human body and its temporal evolution has become an attractive option for action recognition.
In this paper, we focus on the problem of skeleton based action recognition. The key to this problem lies mainly in two aspects. One is to design robust and discriminative features from the skeleton (and the corresponding RGBD images) for intraframe content representation [Müller, Röder, and Clausen2005, Wang, Liu, and Junsong2012, Sung et al.2012, Yang and Tian2014, Ji, Ye, and Cheng2014]
. The other is to explore temporal dependencies of the interframe content for action dynamics modeling, using hierarchical maximum entropy Markov model
[Sung et al.2011][Xia, Chen, and Aggarwal2012] or Conditional Random Fields [Sminchisescu et al.2005]. Inspired by the success of deep recurrent neural networks (RNNs) using the Long ShortTerm Memory (LSTM) architecture for speech feature learning and time series modeling [Graves, Mohamed, and Hinton2013, Graves and Schmidhuber2005], we intend to build an effective action recognition model based on deep LSTM network.To this end, we propose an endtoend fully connected deep LSTM network to perform automatic feature learning and motion modeling (Fig. 1). The proposed network is constructed by inheriting many insights from recent successful networks [Graves2012, Krizhevsky, Sutskever, and Hinton2012, Szegedy et al.2015, Du, Wang, and Wang2015] and is designed to robustly model complex relationships among different joints. The LSTM layers and feedforward layers are alternately deployed to construct a deep network to capture the motion information. To ensure the model learns effective features and motion dynamics, we enforce different types of strong regularization in different parts of the model, which effectively mitigates overfitting.
Specifically, two types of regularizations are proposed. (i) For the fully connected layers, we introduce regularization to drive the model to learn cooccurrence features of the joints at different layers. (ii) For the LSTM neurons, we derive a new dropout and apply it to the LSTM neurons in the last LSTM layer, which helps the network to learn complex motion dynamics. With these forms of regularization, we validate our deep LSTM networks on three public datasets for human action recognition. The proposed model has been shown to consistently outperform other stateoftheart algorithms for skeleton based human action recognition.
2 Related Work
2.1 Activity Recognition with Neural Networks
In contrast to the handcrafted features, there is a growing trend of learning robust feature representations from raw data with deep neural networks, and excellent performance has been reported in image classification [Krizhevsky, Sutskever, and Hinton2012] and speech recognition [Graves, Mohamed, and Hinton2013]
. However, there are only few works which leverage neural networks for skeleton based action recognition. A multilayer perceptron network is trained to classify each frame
[Cho and Chen2014]; however, such a network cannot explore temporal dependencies very well. In contrast, a gesture recognition system [Lefebvre et al.2013] employs a shallow bidirectional LSTM with only one forward hidden layer and one backward hidden layer to explore longrange temporal dependencies. A deep recurrent neural network architecture with handcrafted subnets is utilized for skeleton based action recognition [Du, Wang, and Wang2015]. However, the handcrafted hierarchical subnets and their fusion ignore the inherent cooccurrences of joints. This motivates us to design a deep fully connected neural network which is capable of fully exploiting the inherent correlations among skeleton joints in various actions.2.2 Cooccurrence Exploration
An action is usually only associated with and characterized by the interactions and combinations of a subset of the skeleton joints. For example, the joints “hand”, “arm” and “head” are associated with the action “making telephone call”. An actionlet ensemble model exploits this trait by mining some particular conjunctions of the features corresponded to some subsets of the joints [Wang, Liu, and Junsong2012]. Similarly, actions involving two people can be characterized by the interactions of a subset of the two persons’ joints [Yun et al.2012, Ji, Ye, and Cheng2014]. Inspired by the actionlet ensemble model, we introduce a new exploration mechanism in the deep LSTM architecture to achieve automatic cooccurrence mining as opposed to prespecifying in advance which joints should be grouped.
2.3 Dropout for Recurrent Neural Networks
Dropout has been demonstrated to be quite effective in deep convolutional neural networks
[Krizhevsky, Sutskever, and Hinton2012], but there has been relatively little research on applying it to RNNs. In order to preserve the ability of RNNs to model sequences, dropout applied only to the feedforward (along layers) connections but not to the recurrent (along time) connections is proposed [Pham et al.2014]. This is to avoid erasing all the information from the units (due to dropout). Note that the previous work only considers dropout at the output response for an LSTM neuron [Zaremba, Sutskever, and Vinyals2014]. However, considering that an LSTM neuron consists of internal cell and gate units, we believe one should not only look at the output of the neuron but also into its internal structure to design effective dropout schemes. In this paper, we design an indepth dropout for LSTM to address this problem.3 Deep LSTM with Cooccurrence Exploration and Indepth Dropout
Leveraging the insights from recent successful networks, we design a fully connected deep LSTM network for skeleton based action recognition. Fig. 1
shows the architecture of the proposed network, which has three bidirectional LSTM layers, two feedforward layers, and a softmax layer that gives the predictions. The proposed full connection architecture enables one to fully exploit the inherent correlations among skeleton joints. In the network, the cooccurrence exploration is applied to the connections prior to the second LSTM layer to learn the cooccurrences of joints/features. LSTM dropout is applied to the last LSTM layer to enable more effective learning. Note that each LSTM layer uses bidirectional LSTM and we do not explicitly distinguish the forward and backward LSTM neurons in Fig.
1. At each time step, the input to the network is a vector denoting the 3D positions of the skeleton joints in a frame.
In the following, we first review LSTM briefly to make the paper selfcontained. Then we introduce our method for cooccurrence exploration in the deep LSTM network. Lastly we describe our dropout algorithm which is designed for the LSTM neurons and enables effective learning of the model.
3.1 Overview of LSTM
The RNN is a successful model for sequential learning [Graves2012]. For the recurrent neurons at some layer, the output responses are calculated based on the inputs to this layer and the responses from the previous time slot
(1) 
where
denotes the activation function,
denotes the bias vector,
is the matrix of weights between the input and hidden layer and is the matrix of recurrent weights from the hidden layer to itself at adjacent time steps which is used for exploring temporal dependency.LSTM is an advanced RNN architecture which can learn longrange dependencies [Graves, Mohamed, and Hinton2013]. Fig. 2 shows a typical LSTM neuron, which contains an input gate , a forget gate , a cell , an output gate and an output response . The input gate and forget gate govern the information flow into and out of the cell. The output gate controls how much information from the cell is passed to the output . The memory cell has a selfconnected recurrent edge of weight 1, ensuring that the gradient can pass across many time steps without vanishing or exploding. Therefore, it overcomes the difficulties in training the RNN model caused by the “vanishing gradient” effect. For all the LSTM neurons in some layer, at time , the recursive computation of activations of the units is
(2) 
where denotes elementwise product,
is the sigmoid function defined as
, is the weight matrix between and (e.g., is the weight matrix from the inputs to the input gates ), and denotes the bias term of with . Four weight matrixes are associated with input . To allow the information from both the future and the past to determine the output, bidirectional LSTM can be utilized [Graves2012].3.2 Cooccurrence Exploration
The fully connected deep LSTM network in Fig. 1 has very powerful learning capability. However, it is difficult to learn directly due to the huge parameter space. To overcome this problem, we introduce a cooccurrence exploration process to ensure the deep model learns effective features.
The cooccurrence of some joints can intrinsically characterize a human action. Fig. 3 shows two examples. For “walking”, the joints from hands and feet have high correlations but they all have low correlations with the joint of root. The sets of correlated joints for “walking” and “drinking” are very different, indicating the discriminative subset of joints varies for different types of actions. Two main aspects have been considered in our design of the network and the specialized regularization we propose. (i) We expect the network can automatically explore the conjunctions of discriminative joints. (ii) We expect the network can explore different cooccurrences for different types of actions. Therefore, we design the fully connected network to allow each neuron being connected to any joints (for the first layer) or responses of the previous layer (for the second or higher layer) to automatically explore the cooccurrences. Note that the output responses are also referred to as features which are the input of the next layer. We divide the neurons in the same layer into groups to allow different groups to focus on exploration of different conjunctions of discriminative joints. Taking the group of neurons as an example (see Fig. 4
(a)), the neurons will automatically connect the discriminative joints/features. In our design, we incorporate the cooccurrence regularization into the loss function
(3) 
where is the maximum likelihood loss function of the deep LSTM network [Graves2012]. The other two terms are used for the cooccurrence regularization which can be applied to each layer. is the connection weight matrix from inputs to the units associated with , with indicating the number of neurons and the dimension of inputs (e.g., for the first layer, is the number of joints multiplied by 3). The neurons are partitioned into groups and the number of neurons in a group is , with representing the rounding up operation. is the matrix composed of the to rows of . denotes its transpose. denotes the set of internal units which are directly connected to the input of a neuron. For the LSTM layer, denotes the gates and cell in LSTM neurons. For the feedforward layer, denotes the neuron itself.
In the third term, for each group of units, a structural norm, which is defined as [Cotter et al.2005], is used to drive the units to select a conjunction of descriptive inputs (joints/features) since the norm can encourage the matrix to be column sparse. Different groups explore different connection (cooccurrence) patterns in order to acquire the capability of recognizing multiple categories of actions. The norm constraint (the second term) helps to learn discriminative joints/features.
The stochastic gradient descent method is then employed to solve (
3). The advantage of the cooccurrence learning is that the model can automatically learn the discriminative joint/feature connections, avoiding the fixed a priori blocking of joint cooccurrences across human parts [Du, Wang, and Wang2015] as illustrated in Fig. 4 (b).3.3 Indepth Dropout for LSTM
Dropout tries to combine the predictions of many “thinned” networks to boost the performance. During training, the network randomly drops some neurons to force the remaining subnetwork to compensate. During testing, the network uses all the neurons together to make predictions.
To extend this idea to LSTM, we propose a new dropout algorithm to allow the dropping of the internal gates, cell and output response for an LSTM neuron, encouraging each unit to learn better parameters. For clarity, an LSTM neuron is shown in Fig. 5 (a) in the unfolded form, where the units are explicitly connected. For recurrent neural networks, the erasing of all the information from a unit is not expected, especially when the unit remembers events that occurred many timesteps back in the past [Pham et al.2014]. Therefore, we allow the influence of dropout in LSTM to flow along layers (marked by dashed arrows) but prohibit it to flow along the time axis (marked by solid arrows) as illustrated in Fig. 5 (b). To control the influence flows, in the feedforward process, the network calculates and records two types of activations as follows. The responses of units to be transmitted along the time without dropout are
(4)  
The responses of units to be transmitted across layers with dropout applied are
(5)  
where and are dropout binary mask vectors for input gates, forget gates, cells, output gates and output responses, respectively, with an element value of 0 indicating that dropout happens. Note that for the first LSTM layer, the inputs are the skeleton joints of a frame; for the higher LSTM layer, the inputs are the response outputs of the previous layer.
During the training process, the errors backpropagated to the output responses are
(6) 
where denotes the vector of errors backpropagated from the upper layer at the same time slot, denotes the vector of errors from the next time slot in the same layer, and denote the dropout errors from the upper layer and recurrent errors from the next time slot, respectively.
By taking derivative of with respect to based on (5), we get the errors from to which represent the errors from upper layer with dropout involved
(7) 
Similarly, based on (4), we get the errors backpropagated from to which represent the errors from the next time slot in the same layer without dropout
(8) 
Then, the errors backpropagated to the output gates are the summation of the two types of errors
(9) 
In the same way, we derive the errors propagated to the cells, forget gates, and input gates, based on (4) and (5).
During the testing process, we use all the neurons but multiplying the units of LSTM neurons (in the LSTM layer where dropout is applied) by the probability values of
, where is the dropout probability of that unit. Note that the simple dropout which only drops the output responses [Zaremba, Sutskever, and Vinyals2014] is a special case of our proposed dropout.3.4 Action Recognition using the Learned Model
With the learned deep LSTM network, the probability that a sequence belongs to the class is
(10) 
where denotes the number of classes, represents the length of the test sequence, , and denote the output responses of the last bidirectional LSTM layer. Then, the class with the highest probability is chosen as action class.
4 Experiments
We validate the proposed model on the SBU kinect interaction dataset [Yun et al.2012], HDM05 dataset [Müller et al.2007], and CMU dataset [CMU2003] whose groundtruth was labeled by ourselves. We have also tested our model on the Berkeley MHAD action recognition dataset [Ofli et al.2013] and achieved 100% accuracy. To investigate the impact of each component in our model, we conduct experiments under different configurations represented as follows:

Deep LSTM is our basic scheme without regularizations;

Deep LSTM + Cooccurrence is the scheme with our proposed cooccurrence regularization applied;

Deep LSTM + Simple Dropout is the scheme with the dropout algorithm proposed by Zaremba et al. [Zaremba, Sutskever, and Vinyals2014] applied to our basic scheme;

Deep LSTM + Indepth Dropout is the scheme with our proposed indepth dropout applied;

Deep LSTM + Cooccurrence + Indepth Dropout is our final scheme with both cooccurrence regularization and indepth dropout applied.
Downsampling the skeleton sequences in temporal is performed to have the frame rate of 30fps on the HDM05 dataset and CMU dataset. To reduce the influence of noise in the captured skeleton data, we smooth each joint’s position of the raw skeleton using the filter in the temporal domain [Savitzky and Golay1964, Du, Wang, and Wang2015]. The number of groups () in our model is set to 5, 10, and 10 for the first three layers experimentally. We set the dropout probability to 0.2 for each unit in an LSTM neuron, which makes the overall dropout probability of an LSTM neuron approach 0.5 (this can be derived based on (5)). Note that when dropout is applied, the number of neurons in the corresponding layer is doubled as suggested by previous work [Srivastava et al.2014]. We set the parameters and in (3) experimentally.
4.1 SBU Kinect Interaction Dataset
The SBU kinect interaction dataset is a Kinect captured human activity recognition dataset depicting two person interaction, which contains 230 sequences of 8 classes (6,614 frames) with subject independent 5fold cross validation. The smoothed positions of joints are used as the input of the deep LSTM network for recognition. The number of neurons is set to 100, 100, 110, 110, 200 for the first to fifth layers respectively, where 2 indicates bidirectional LSTM is used and thus the number of neurons is doubled.
We have compared our schemes with other skeleton based methods [Yun et al.2012, Ji, Ye, and Cheng2014, Du, Wang, and Wang2015]. Note that we add an additional layer to fuse the two subnets corresponding to the two persons when extending Hierarchical RNN scheme for use in the two person interaction scenario [Du, Wang, and Wang2015]. We summarize the results in terms of the average recognition accuracy (5fold cross validation) in Table 1.


Methods  Acc.(%)  



49.7  

80.3  

79.4  

86.9  

80.35  


Deep LSTM  86.03  

89.44  
Deep LSTM + Simple Dropout  89.70  
Deep LSTM + Indepth Dropout  90.10  

90.41  

Table 1 shows that our basic scheme of Deep LSTM achieves comparable performance to the method using handcrafted complex features [Ji, Ye, and Cheng2014]. The proposed schemes of Deep LSTM + Cooccurrence and Deep LSTM + Indepth Dropout can improve the recognition accuracy by 3.4% and 4.1% respectively over Deep LSTM, indicating that the cooccurrence exploration boosts the discrimination of features and the proposed LSTM dropout is capable of learning a more effective model. Deep LSTM + Indepth Dropout is superior to Deep LSTM + Simple Dropout. Note that the deep LSTM network achieves remarkable (5.6%) performance improvement in comparison with the hierarchical RNN network [Du, Wang, and Wang2015]. That is because allowing full connection of joints/features with neurons rather than imposing a priori subnet constraints facilitates the interaction among joints especially when the joints do not belong to the same part, or same person. Our scheme with combined regularizations achieves the best performance.
4.2 HDM05 Dataset
The HDM05 dataset contains 2,337 skeleton sequences performed by 5 actors (184,046 frames after downsampling). For fair comparison, we use the same protocol (65 classes, 10fold cross validation) as used by Cho and Chen [Cho and Chen2014]. The preprocessing is the same as that done in the hierarchical RNN scheme [Du, Wang, and Wang2015] (centralize the joints’ positions to human center for each frame and smooth the positions). The number of neurons is 100, 110, 120, 120, 200 for the five layers respectively. Table 2 shows the results in terms of average accuracy. Our basic deep LSTM achieves better results than the Multilayer Perception model, which suggests that LSTM exhibits better motion modeling ability than the MLP. With the proposed cooccurrence learning and indepth dropout regularization, our full model also performs better than the manually designed hierarchical RNN approach.


Methods  Acc.(%)  



95.59  

96.92  


Deep LSTM  96.80  

97.03  
Deep LSTM + Simple Dropout  97.21  
Deep LSTM + Indepth Dropout  97.25  

97.25  

4.3 CMU Dataset
We have categorized the CMU motion capture dataset into 45 classes for the purpose of skeleton based action recognition^{1}^{1}1http://www.escience.cn/people/zhuwentao/29634.html. The categorized dataset contains 2,235 sequences (987,341 frames after downsampling) and is the largest skeleton based human action dataset so far. This dataset is much more challenging because: (i) the lengths of sequences vary greatly; (ii) the withinclass diversity is large, e.g., for “walking”, different people walk at different speeds and along different paths; (iii) the dataset contains complex actions such as dance, doing yoga.
We have evaluated the performance on both the entire dataset (CMU) and a subset of the dataset (CMU subset). For this subset, we have chosen eight representative action categories containing 664 sequences (125,667 frames after downsampling), with actions of jump, walk back, run, sit, getup, pickup, basketball, cartwheel. The same preprocessing as used for the HDM05 dataset is performed. The number of neurons is set to 100, 100, 120, 120, 100 for the five layers. Threefold cross validation is conducted and the results in terms of average accuracy are shown in Table 3. We can see that the proposed model achieves significant performance improvement, indicating that it can better learn the discriminative features and model longrange temporal dynamics even for this challenging dataset.



Methods  CMU subset  CMU  



83.13  75.02  


Deep LSTM  86.00  79.53  

87.20  79.60  
Deep LSTM + Simple Dropout  87.80  80.59  
Deep LSTM + Indepth Dropout  88.25  80.99  

88.40  81.04  

4.4 Discussions
To further understand our deep LSTM network, we visualize the weights learned in the first LSTM layer on the SBU kinect interaction dataset in Fig. 6 (a). Each element represents the absolute value of the weight between the corresponding skeleton joint and input gate of that LSTM neuron. It is observed that the weights in the diagonal positions marked by the red ellipse have high values, which means the cooccurrence regularization helps learn the human parts automatically. In contrast to the part based subnet fusion model [Du, Wang, and Wang2015], the learned cooccurrences of joints by our model do not limit the connections to be in the same part, as there are many large weights outside the diagonal regions, e.g., in the regions marked by white circles, making the network more powerful for action recognition. This also signifies the importance of the proposed full connection architecture. By averaging the energy of the weights in the same group of neurons for each joint, we obtain Fig. 6 (b) which has five groups of LSTM neurons. It is observed that different groups have different weight patterns, preferring different conjunctions of joints.
5 Conclusion
In this paper, we propose an endtoend fully connected deep LSTM network for skeleton based action recognition. The proposed model facilitates the automatic learning of feature cooccurrences from the skeleton joints through our designed regularization. To ensure effective learning of the deep model, we design an indepth dropout algorithm for the LSTM neurons, which performs dropout for the internal gates, cell, and output response of the LSTM neuron. Experimental results demonstrate the stateoftheart performance of our model on several datasets.
Acknowledgment
We would like to thank David Wipf from Microsoft Research Asia for the valuable discussions, and thank Yong Du from Institute of Automation, Chinese Academy of Sciences for providing Hierarchical RNN code for the comparison.
References
 [Cho and Chen2014] Cho, K., and Chen, X. 2014. Classifying and visualizing motion capture sequences using deep neural networks. In International Conference on Computer Vision Theory and Applications, 122–130.
 [CMU2003] CMU. 2003. CMU graphics lab motion capture database. http://mocap.cs.cmu.edu/.
 [Cotter et al.2005] Cotter, S. F.; Rao, B. D.; Engan, K.; and KreutzDelgado, K. 2005. Sparse solutions to linear inverse problems with multiple measurement vectors. IEEE Transactions on Signal Processing 53(7):2477–2488.

[Du, Wang, and Wang2015]
Du, Y.; Wang, W.; and Wang, L.
2015.
Hierarchical recurrent neural network for skeleton based action
recognition.
In
IEEE Conference on Computer Vision and Pattern Recognition
, 1110–1118.  [Graves and Schmidhuber2005] Graves, A., and Schmidhuber, J. 2005. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18(5):602–610.
 [Graves, Mohamed, and Hinton2013] Graves, A.; Mohamed, A.; and Hinton, G. 2013. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 6645–6649.
 [Graves2012] Graves, A. 2012. Supervised sequence labelling with recurrent neural networks, volume 385. Springer.
 [Ji, Ye, and Cheng2014] Ji, Y.; Ye, G.; and Cheng, H. 2014. Interactive body part contrast mining for human interaction recognition. In IEEE International Conference on Multimedia and Expo Workshops, 1–6.
 [Johansson1973] Johansson, G. 1973. Visual perception of biological motion and a model for it is analysis. Perception and Psychophysics 14(2):201–211.
 [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 1097–1105.

[Lefebvre et al.2013]
Lefebvre, G.; Berlemont, S.; Mamalet, F.; and Garcia, C.
2013.
BLSTMRNN based 3D gesture classification.
In
Proceedings of the International Conference on Artificial Neural Networks and Machine Learning
, 381–388.  [Müller et al.2007] Müller, M.; Röder, T.; Clausen, M.; Eberhardt, B.; Krüger, B.; and Weber, A. 2007. Documentation mocap database HDM05. Technical Report CG20072, Universität Bonn.
 [Müller, Röder, and Clausen2005] Müller, M.; Röder, T.; and Clausen, M. 2005. Efficient contentbased retrieval of motion capture data. ACM Transactions on Graphic 24(3):677–683.
 [Ofli et al.2013] Ofli, F.; Chaudhry, R.; Kurillo, G.; Vidal, R.; and Bajcsy, R. 2013. Berkeley MHAD: A comprehensive multimodal human action database. In Proceedings of the IEEE Workshop on Applications on Computer Vision, 53–60.
 [Pham et al.2014] Pham, V.; Bluche, T.; Kermorvant, C.; and Louradour, J. 2014. Dropout improves recurrent neural networks for handwriting recognition. In International Conference on Frontiers in Handwriting Recognition, 285–290.
 [Poppe2010] Poppe, R. 2010. A survey on visionbased human action recognition. Image and Vision Computing 28(6):976–990.
 [Savitzky and Golay1964] Savitzky, A., and Golay, M. J. 1964. Smoothing and differentiation of data by simplified least squares procedures. Analytical chemistry 36(8):1627–1639.
 [Sminchisescu et al.2005] Sminchisescu, C.; Kanaujia, A.; Li, Z.; and Metaxas, D. 2005. Conditional models for contextual human motion recognition. In IEEE Conference on Computer Vision, volume 2, 1808–1815.
 [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15:1929–1958.
 [Sung et al.2011] Sung, J.; Ponce, C.; Selman, B.; and Saxena, A. 2011. Human activity detection from RGBD images. In AAAI workshop on Pattern, Activity and Intent Recognition, 47–55.
 [Sung et al.2012] Sung, J.; Ponce, C.; Selman, B.; and Saxena, A. 2012. Unstructured human activity detection from RGBD images. In IEEE International Conference on Robotics and Automation, 842–849.
 [Szegedy et al.2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, 1–9.
 [Wang, Liu, and Junsong2012] Wang, J.; Liu, Z.; and Junsong, Y. 2012. Mining actionlet ensemble for action recognition with depth cameras. In IEEE Conference on Computer Vision and Pattern Recognition, 1290–1297.
 [Weinland, Ronfard, and Boyerc2011] Weinland, D.; Ronfard, R.; and Boyerc, E. 2011. A survey of visionbased methods for action representation, segmentation and recognition. Computer Vision and Image Understanding 115(2):224–241.
 [Xia, Chen, and Aggarwal2012] Xia, L.; Chen, C.C.; and Aggarwal, J. K. 2012. View invariant human action recognition using histograms of 3D joints. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, 20–27.
 [Yang and Tian2014] Yang, X., and Tian, Y. 2014. Effective 3D action recognition using EigenJoints. Journal of Visual Communication and Image Representation 25(1):2–11.
 [Yun et al.2012] Yun, K.; Honorio, J.; Chattopadhyay, D.; Berg, T. L.; and Samaras, D. 2012. Twoperson interaction detection using body pose features and multiple instance learning. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 28–35.
 [Zaremba, Sutskever, and Vinyals2014] Zaremba, W.; Sutskever, I.; and Vinyals, O. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.