1 Introduction11footnotetext: Both authors contributed equally.
Gesture recognition has been a hot and challenging research topic for several decades. There are two main kinds of source data: video and motion capture data (Mocap). Mocap records the human actions based on the human skeleton information. Its classification is very important in computer animation, sports science, human–computer interaction (HCI) and filmmaking.
Recently the low cost and high mobility of RGB-D sensors, such as Kinect, have become widely adopted by the game industry as well as in HCI. Especially in computer vision, the gesture recognition using data from the RGB-D sensors is gaining more and more attention. However, the computational difficulty in directly processing 3–D cloud data from depth information often leads to utilizing the human skeleton extracted from the depth information[Shotton et al., 2011] instead.
However, conventional recognition systems are mostly applied on a small dataset with a couple of dozens different actions, which is often the limitation imposed by the design of a system. Conventional designs may be classified into two categories: a whole motion is represented by one feature matrix or vector[Raptis et al., 2011], and classified by a classifier as a whole [Müller and Roder, 2006]; or a library of key features [Wang et al., 2012] is learned from the whole dataset, and then each motion is represented as a bag or histogram of words [Raptis et al., 2008] or a path in a graph [Barnachon et al., 2013].
In the first type of system, principal component analysis (PCA) is often used to form equal-size feature matrices or vectors from variable-length motion sequences [Zhao et al., 2013, Vieira et al., 2012]. However, due to a large number of inter- and intra-class variations among motions, a single feature matrix or vector is likely not enough to capture important discriminative information. This makes these systems inadequate for a large dataset.
. These approaches also suffer from a large number of action classes due to a potentially excessive size of codebook, in the case of using a classifier such as support vector machines[Ofli et al., 2013] as well as an overly complicated structure, if one tries to build a motion graph.
In this paper, we recognize actions from skeleton data with two major contributions:
we propose to build the recognition system based on joint distribution model of the per-frame feature set containing information of the relative positions of joints, their temporal difference and the normalized trajectory of the motion;we propose a novel variant of a multi-layer perceptron, called a hybrid MLP, that simultaneously classifies and reconstructs the features, which outperforms a regular MLP, extreme learning machines (ELM) as well as SVM; meanwhile a deep autoencoder is trained to visualize the features in 2-dimensional space, comparing with PCA using the two leading principal components, we clearly see that autoencoder can extract more distinctive information from features. We test our system on a publicly available database containing 65 action classes and more than 2,000 motions with above 95% accuracy.
2 Feature Extraction
In this section we describe how to extract the proposed features from each frame. Fig. 1 (a) shows some frames from a motion sequence throwfar. Since the original coordinates are dependent on a performer and the space to which the performer belongs, those coordinates are not directly comparable between different performers even if they all perform the same action. Hence, we normalize the orientation such that each and every skeleton has its root at the origin with the same orientation matrix of identity. For example, Fig. 1 (b) shows the orientation-normalized versions of the skeleton in Fig. 1 (a). We further normalize the length of all connected joints to be to make them independent of a performer. The concatenation of 3D coordinates of joints forms the feature (PO), which describes relative relationships among the joints.
Some actions are similar to each other in a frame-level. For instance, the actions of standing up and sitting down are just reverse in time but with almost identical frames, which results in almost identical PO features for corresponding frames in those actions. Hence, we compute the temporal differences (TD) between pairs of PO feature by
where and are the PO feature vector at the -th frame and the temporal offset (), respectively. TD feature preserves the temporal relationship of the same joint. Normalized trajectory (NT) extracts the absolute trajectory of the motion. Fig.2(a) shows two motions: walk in a left circle and walk in a right circle. However, in this figure the trajectories of left circle and right circle are not distinguishable. In order to incorporate the trajectory information, we set the same orientation and starting position for the root in the first frame and use a relative position of the root of all the rest frames in the motion sequences from the initial frame, normalized into in each dimension. See Fig. 2 (b) for the effect of this transformation. The final feature for each frame is a concatenation of three features. The dimension of the feature is , where is the number of joints in use in PO.
For skeleton extracted from RGB-D sensor, often the rotation matrix and translation vector related with the joints are not available. In this case any skeleton can be selected as a stardard template frame, the rotation matrix between the other skeletons and the template can be calculated as in [Chen and Koskela, 2013]. In the similar way the features from skeleton data with only 3D joint coordinates can be extracted.
3 Deep Neural Networks: Multi-Layer Perceptrons
A multi-layer perceptron (MLP) is a type of deep neural networks that is able to perform classification (see, e.g., [Haykin, 2009]
). An MLP can approximate any smooth, nonlinear mapping from a high dimensional sample to a class through multiple layers of hidden neurons.
The output or prediction of an MLP having hidden layers and output neurons given a sample is typically computed by
where and are component-wise nonlinear functions, and
is a set of parameters. We have omitted a bias without loss of generality. A logistic sigmoid function is usually used for the last nonlinear function. Each output neuron corresponds to a single class.
Given a training set
, an MLP is trained to approximate the posterior probabilityof each output class given a sample by maximizing the log-likelihood
where a subscript indicates the -th component. We omitted
to make the above equation uncluttered. Training can be efficiently done by backpropagation[Rumelhart et al., 1986].
3.1 Hybrid Multi-Layer Perceptron
It has been noticed by many that it is not trivial to train deep neural networks to have a good generalization performance (see, e.g., [Bengio and LeCun, 2007] and references therein), especially when there are many hidden layers between input and output layers. One of promising hypotheses explaining this difficulty is that backpropagation applied on a deep MLP tends to utilize only a few top layers [Bengio et al., 2007]
. A method of layer-wise pretraining has been proposed to overcome this problem by initializing the weights in lower layers with unsupervised learning[Hinton and Salakhutdinov, 2006].
Here, we propose another strategy that forces backpropagation algorithm to utilize lower layers. The strategy involves training an MLP to classify and reconstruct simultaneously by training a deep autoencoder with the same set of parameters, except for the weights between the penultimate and output layers to reconstruct an input sample as well as possible.
A deep autoencoder is a symmetric feedforward neural network consisting of an encoder
and a decoder
and are component-wise nonlinear functions.
The parameters of a deep autoencoder is estimated by maximizing the negative squared difference which is defined to be
is a hyperparameter. Whenis , the trained model will be purely an MLP, while it will be an autoencoder if . We call an MLP that was trained with this strategy with non-zero a hybrid MLP222 A similar approach was proposed in [Larochelle and Bengio, 2008]
in the case of restricted Boltzmann machines..
There are two advantages in the proposed strategy. First, the weights in lower layers naturally have to be utilized, since those weights must be adapted to reconstruct an input sample well. This may further help achieving a better classification accuracy similarly to the way unsupervised layer-wise pretraining which also optimized the reconstruction error , in the case of using autoencoders, helps obtaining a better classification accuracy on novel samples. Secondly, in this framework, it is trivial to use vast amount of unlabeled samples in addition to labeled samples. If stochastic backpropagation is used, one can compute the gradients of by combining the gradients of and separately using separate sets of labeled and unlabeled samples.
4 Classifying an Action Sequence
An action sequence is composed of an certain amount of frames. We use a multi-layer perceptron to model a posterior distribution of classes given each frame. Let us define be a binary indicator variable. If is one, the sequence belongs to the action , and otherwise, belongs to another action. Since each sequence consists of frames, let us further define
as a binary variable indicating whether the-th frame belongs to the action .
When a given sequence is of an action , every frame in the sequence is also of an action . In other words, if , for all . So, we may check the joint probability of all frames in the sequence to determine the action of the sequence:
In this paper, we assume temporal independence among the frames in a single sequence, which means that the class of each frame depends only on the features of the frame. Then, Eq. (6) can be simplified into
With this assumption, the problem of gesture recognition is reduced to first train a classifier to perform frame-level classification and then to combine the output of the classifier according to Eq. (7
). A multi-layer perceptron which approximates the posterior probability distribution over classes by Eq. (2) is naturally suited to this approach.
In the experiment we tried to evaluate the performance of our proposed recognition system through a public dataset. We assessed the performance of deep neural networks including regular and hybrid MLP by comparing them against extreme learning machines (ELM) and support vector machines (SVM). The effectiveness of the feature set was evaluated by the classification accuracy and the visualization in 2D space by deep autoencoders.
The Motion Capture Database HDM05 [Müller et al., 2007] is a well organized large MOCAP dataset. It provides a set of short cut MOCAP clips, and each clip contains one integral motion. In the original dataset, there are 130 gesture groups. However, there are some gestures that essentially belong to a single class. For instance, walk 2 steps and walk 4 steps belong to a single action walk. Hence, we combined some of the classes based on the following rules:
Motions repeating with different times are combined into one action.
Motions with the only difference of the starting limb are combined into one action.
After the reorganization the whole dataset consisting of 2,337 motion sequences and 613,377 frames is divided into 65 actions333See the appendix for the complete list of 65 actions.
We used 10-fold cross validation to assess the performance of a classifier. The data was randomly split into 10 balanced partitions of sequences. PO feature was formed by joints:head, hands and feet. The parameter in TD was set as second interval between frames. The total dimension of the feature vector is
. To test the distinctiveness of the features, we reported the classification accuracy for each frame, and evaluated the system performance by the accuracy of each sequence. The standard deviations were also calculated for 10-fold cross validation.
We trained deep neural networks having two hidden layers of sizes 1000 and 500 with rectified linear units444The activation of a rectified linear unit is , where is the input to the unit.. A learning rate was selected automatically by a recently proposed ADADELTA [Zeiler, 2012]. Usually the optimal can be selected by the validation set and on a grid search. To illustrate the influences of in hybrid MLP, we selected four different values for : , , and . The parameters were simply initialized randomly, and no pretraining strategy was used555We used a publicly available Matlab toolbox deepmat for training and evaluating deep neural networks: https://github.com/kyunghyuncho/deepmat.
When a tested classifier outputs the posterior probability of a class given a frame, we chose the class of a sequence by
based on Eq. (7). If a classifier does not return a probability but only the chosen class, we used a simple majority voting.
5.3 Qualitative Analysis: Visualization
In order to have a better understanding of what a deep neural network learns from the features, we visualized the features using a deep autoencoder with two linear neurons in the middle layer [Hinton and Salakhutdinov, 2006]. The deep autoencoder had three hidden layers of size 1000, 500 and 100 between the input and middle layers. It should be noted that no label information was used to train these deep autoencoders. In the experiment, we trained two deep autoencoders using with and without the normalized trajectories (NT) to see what the relative feature (PO+TD) provides to the system and the impact of the absolute feature. Since in previous works PCA has been often used for dimensionality reduction of motion features, we also tried to visualize features using the two leading principal components.
In Fig. 3, we visualized three distinct, but very similar, actions; rotateArmsRBackward, rotateArmsBothBackward and rotateArmsLBackward. These actions in the figure were clearly distinguishable when the deep autoencoder was used. However, rotateArmsRBackward and rotateArmsLBackward were not distinguishable at all when only the PO and TD features were used by PCA (see Fig. 3 (c)). Even when all three features (PO+TD+NT) were used, the visualization by PCA did not help distinguishing these actions clearly.
In Fig. 4, two actions, jogLeftCircle and jogRightCircle, were visualized. When only PO and TD features were used, neither the deep autoencoder nor PCA was able to capture differences between those actions. However, the deep autoencoder was able to distinguish those actions clearly when all three proposed features were used (see Fig. 4 (b)).
The former visualization shows that a deep neural network with multiple nonlinear hidden layers can learn more discriminative structure of data. Furthermore, according to the latter visualization, we can see that the normalized trajectories help distinguish locomotions with different traces, however, with only a powerful model as a deep neural network. Through the experiment we could see that deep neural networks are able to learn highly discriminative information from our features of motion.
5.4 Quantitative Analysis: Recognition
|PO+TD||70.40% (1.32)||83.82% (0.79)||84.35% (0.91)||84.39% (0.87)||84.57% (1.56)||84.23% (1.27)|
|PO+TD+NT||74.28% (1.56)||87.06% (0.82)||87.42% (1.43)||87.96% (1.38)||87.34% (0.66)||87.28% (1.38)|
|PO+TD||91.57% (0.88)||94.95% (0.82)||95.20% (1.38)||95.46% (0.99)||95.59% (0.76)||95.55% (1.14)|
|PO+TD+NT||92.76% (1.53)||95.12% (0.58)||94.86% (0.99)||95.21% (0.86)||94.82% (1.17)||95.04% (0.86)|
In Tab. 1, the frame-level accuracies obtained by various classifiers with two different sets of features can be found. We can see that the NT feature clearly increases the classification accuracy around % for all the classifiers. Comparing the different classifiers, we can see that the MLPs were able to obtain significantly higher accuracies than the ELM and perform slightly better than SVM. Furthermore, although it is not clearly significant statistically, we can see that a hybrid MLP often outperforms the regular MLP with a right choice of .
A similar trend of the MLPs outperforming the other classifiers could be observed also in sequence-level performance shown in Tab. 1. Again in the sequence-level classification, we observed that the hybrid MLP with a right choice of marginally outperformed the regular MLP, and it also outperformed SVM and ELM. For both frame-level and sequence-level accuracy, the highest accuracy for PO+TD features is from hybrid MLP with and for the whole feature set with .
However, the classification accuracies obtained using the two sets of features (PO+TD vs PO+TD+NT) are very close to each other. Compared to the % differences in the classification for each frame, the differences between the performance obtained using the two sets are within the standard deviations. Even though NT feature increases the frame accuracy significantly it did not have the same effect on the sequence level. One potential reason is that once a certain level of frame level recognition is achieved, the sequence level performance using our posterior probability model saturates.
In this paper, we proposed a gesture recognition system using multi-layer perceptrons for recognizing motion sequences with novel features based on relative joint positions (PO), temporal differences (TD) and normalized trajectories (NT).
The experiments with a large motion capture dataset (HDM05) revealed that (hybrid) multi-layer perceptrons could achieve higher recognition rate than there is the other classifiers could, for 65 classes with an accuracy of above 95%. Furthermore, the visualization of feature set of the motion sequences by deep autoencoders showed the effectiveness of the proposed feature sets and enabled us to study what deep neural networks learned. Interestingly, a powerful model like a deep neural network combined with an informative feature set was able to capture the discriminative structure of motion sequences, which was confirmed by both the recognition and visualization experiments. This suggests that a deep neural network is able to extract highly discriminative features from motion data.
One limitation of our approach is that temporal independence was assumed when combining the per-frame posterior probabilities in a sequence. In future it will be interesting to investigate possibilities of modeling temporal dependence.
This work was funded by Aalto MIDE programme (project UI-ART), Multimodally grounded language technology (254104) and Finnish Center of Excellence in Computational Inference Research COIN (251170) of the Academy of Finland.
- [Barnachon et al., 2013] Barnachon, M., Bouakaz, S., Boufama, B., and Guillou, E. (2013). A real-time system for motion retrieval and interpretation. Pattern Recognition Letters.
- [Bengio et al., 2007] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Schölkopf, B., Platt, J., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 153–160. MIT Press, Cambridge, MA.
- [Bengio and LeCun, 2007] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In Bottou, L., Chapelle, O., DeCoste, D., and Weston, J., editors, Large Scale Kernel Machines. MIT Press.
- [Chen and Koskela, 2013] Chen, X. and Koskela, M. (2013). Classification of rgb-d and motion capture sequences using extreme learning machine. In Proceedings of 18th Scandinavian Conference on Image Analysis.
- [Chung and Yang, 2013] Chung, H. and Yang, H.-D. (2013). Conditional random field-based gesture recognition with depth information. Optical Engineering, 52(1):017201–017201.
- [Haykin, 2009] Haykin, S. (2009). Neural Networks and Learning Machines. Pearson Education, 3rd edition.
- [Hinton and Salakhutdinov, 2006] Hinton, G. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507.
- [Huang et al., 2006] Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. (2006). Extreme learning machine: Theory and applications. Neurocomputing, 70(1–-3):489–501.
[Larochelle and Bengio, 2008]
Larochelle, H. and Bengio, Y. (2008).
Classification using discriminative restricted Boltzmann machines.
Proceedings of the 25th international conference on Machine learning (ICML 2008), pages 536–543, New York, NY, USA. ACM.
- [Müller and Roder, 2006] Müller, M. and Roder, T. (2006). Motion templates for automatic classification and retrieval of motion capture data. In Proceedings of the Eurographics/ACM SIGGRAPH symposium on Computer animation, volume 2, pages 137–146, Vienna, Austria.
- [Müller et al., 2007] Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., and Weber, A. (2007). Documentation mocap database HDM05. Technical Report CG-2007-2, U. Bonn.
- [Ofli et al., 2013] Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., and Bajcsy, R. (2013). Berkeley mhad: A comprehensive multimodal human action database. In Applications of Computer Vision (WACV), 2013 IEEE Workshop on, pages 53–60.
- [Raptis et al., 2011] Raptis, M., Kirovski, D., and Hoppe, H. (2011). Real-time classification of dance gestures from skeleton animation. In Proceedings of the 2011 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages 147–156. ACM.
- [Raptis et al., 2008] Raptis, M., Wnuk, K., Soatto, S., et al. (2008). Flexible dictionaries for action classification. In Proc. MLVMA’08.
- [Rumelhart et al., 1986] Rumelhart, D. E., Hinton, G., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(Oct):533–536.
- [Shotton et al., 2011] Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In Proc. Computer Vision and Pattern Recognition.
- [Vieira et al., 2012] Vieira, A., Lewiner, T., Schwartz, W., and Campos, M. (2012). Distance matrices as invariant features for classifying MoCap data. In 21st International Conference on Pattern Recognition (ICPR), Tsukuba, Japan.
- [Wang et al., 2012] Wang, J., Liu, Z., Wu, Y., and Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1290–1297. IEEE.
- [Zeiler, 2012] Zeiler, M. D. (2012). ADADELTA: An adaptive learning rate method. arXiv:1212.5701 [cs.LG].
- [Zhao et al., 2013] Zhao, X., Li, X., Pang, C., and Wang, S. (2013). Human action recognition based on semi-supervised discriminant analysis with global constraint. Neurocomputing, 105(0):45 – 50.
APPENDIX: 65 Actions in HDM05 Dataset