Action recognition by learning pose representations

08/02/2017 ∙ by Alessia Saggese, et al. ∙ University of Groningen University of Salerno 0

Pose detection is one of the fundamental steps for the recognition of human actions. In this paper we propose a novel trainable detector for recognizing human poses based on the analysis of the skeleton. The main idea is that a skeleton pose can be described by the spatial arrangements of its joints. Starting from this consideration, we propose a trainable pose detector, that can be configured on a prototype skeleton in an automatic configuration process. The result of the configuration is a model of the position of the joints in the concerned skeleton. In the application phase, the joint positions contained in the model are compared with the ones of their homologous joints in the skeleton under test. The similarity of two skeletons is computed as a combination of the position scores achieved by homologous joints. In this paper we describe an action classification method based on the use of the proposed trainable detectors to extract features from the skeletons. We performed experiments on the publicly available MSDRA data set and the achieved results confirm the effectiveness of the proposed approach.



There are no comments yet.


page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic estimation of poses and recognition of actions are of great interest for the community of researchers in computer vision because of their various applications: think, as an example, to assisted living in robotics or to intelligent surveillance systems, where action recognition is a key step for the evaluation and the classification of people behaviors


Comprehensive reviews of methods for recognition of human poses and actions were published in the past years [18, 1]. Although it is not possible to define a strict partition for the existing methods, typically the two categories are considered, depending on the typology of the representation of the human pose or of the human action. Indeed, there are methods based on local descriptors and methods based onglobal descriptors.

In the case of methods based on local descriptors, image patches are analyzed in a bottom-up approach. Usually, the descriptors computed on these local patches are combined in a hierarchical architecture, e.g. the bag of features approach [8, 13, 14]. One of the main advantages deriving from the use of local descriptors is that they can be computed on the whole image, without the need to perform preliminary detection and tracking of the objects. This approach can be also employed in cases where the scenes are crowded. However, this advantage is typically paid in terms of high effort required for the extraction of the descriptors. Furthermore, the accuracy of methods based on local descriptors strongly depends on the amount as well as on the reliability of the interest points detected by the algorithms.

In the second group, global descriptors are based on a top-down approach. The subjects of interest are detected and tracked by using traditional background subtraction and updating algorithms, and features derived from optical flow, skeletons, silhouettes or edges are extracted [7, 26, 5, 16]. In this case, only information related to the movement and to the pose of the human is taken into account. The higher the accuracy in detecting the human subject is, the better the performance of these approaches will be. For these reasons, several methods based on global descriptors in the last years have been proposed.

One of the milestones in the research on action recognition was the work of Johansson et al. [11], that demonstrated that several human actions can be recognized by only looking at the position of the limb joints. In their experiments, these joints were marked by light-point sources and human observers could recognize actions by having no additional information. Nowadays, several devices, such as the Microsoft Kinect and the MOCAP systems, allow to reliably extract the skeleton and the limb joint points of people moving in a scene. The skeleton is a set of rigid segments representing the bones, connected each other by joints that correspond to articulations. A particular configuration of the skeleton, expressed in terms of spatial arrangement of joints, can be considered a pose. The temporal evolution of a set of consecutive poses can be associated to a specific action. Within this framework, the theoretical and experimental studies conducted by Johansson et al. are very important because demonstrated that the analysis of human skeleton is sufficient in many situations to recognize the poses and thus the actions performed by a human subject.

The importance of skeleton information has been further confirmed in a more recent experiment in [31], aimed to evaluate the discriminant power of different representations based on skeletons and more traditional low-level appearance features. Indeed, it has been demonstrated that pose features outperform low-level appearance features, even when skeletons highly corrupted by noise. Furthermore, although the skeleton extraction typically requires a computationally-intensive pre-processing step, it is less sensitive to intra-class variations. These are the main reasons why in the last years several methods for action recognition based on skeleton analysis have been proposed [25, 6, 29, 20, 19].

The problem of recognizing human actions by analyzing the skeleton can be approached at two levels: a) given two skeleton poses, evaluate their similarity b) given two sequences of poses, encode such sequences and evaluate their similarity.

In this paper we focus on the construction of effective representations of the skeleton poses that we then employ as feature extractors in a method for recognizing human actions. Our main idea is based on the fact that a pose is described by a particular spatial arrangement of joint positions. We propose trainable features that learn a model of a given prototype skeleton pose in an automatic configuration process. The features that we propose can be used to evaluate the similarity of a skeleton pose to the ones used for training.

The concept of trainable features was originally introduced in [2]

, where COSFIRE filters were proposed for object recognition and keypoint detection. The trainable character of COSFIRE features stands in the fact that their structure is not fixed in the implementation but it is rather learned in a automatic configuration process performed on given prototype samples. The automatic configuration of features avoids to design a set of hand-crafted features to transform the raw data into a suitable representation or feature vector to be used in combination with a classifier system. This is a kind of

representation learning, where the important characteristics of the patterns of interest are directly learned from training samples. Trainable features derived from COSFIRE have been successfully applied to various problems in image processing, such as contour detection [3], delineation of blood vessels in medical images [4, 22], color-object detection [10] and adapted to audio event detection [24].

In this paper we propose a trainable pose detector that is automatically configured by modeling the position of the joints with respect to a reference point in a given prototype skeleton (we consider the body barycenter as reference point). In the application phase, we compute the response of a pose detector by combining a score value that is computed for each joint in the model. The score of a joint indicates the similarity of the position of such joint with the one of its homologous in the prototype skeleton. The proposed pose detector introduces tolerance in the position of each joint in order to account for deformations of the skeleton with respect to the ones configured in the model.

In the proposed method, we configure a number of pose detectors from training samples and use them as feature extractors to construct a feature vector that describes the pose of a skeleton. The so constructed vectors can be, then, used to train any possible classifier. In our experiments, we employed a multi-class Support Vector Machine (SVM). We carried out experiments on the publicly available MSRDA data set and obtain comparable results with the ones achieved by existing methods based on skeleton pose analysis.

The paper is organized as follows: in Section 2 we provide details about the proposed pose feature detector and the classification method; in Section 3 we report about the experimental results that we achieved and compare them with the ones obtained by existing methods, while we draw conclusions in Section 4.

2 Method

The basic idea underlying the proposed approach is to automatically learn a representation of skeleton poses by modeling the spatial arrangement of skeleton joints with respect to a reference point (in this work we consider the barycenter of the body). Two skeletons can be considered similar if the relative position of their homologous joints is similar. Hence, computing the similarity between two skeletons can be formulated as computing the similarity between the relative position of their homologous joints. We propose trainable skeleton pose detectors, inspired by the principles of trainable COSFIRE filters for representation learning in pattern recognition. The proposed detectors are trained in an automatic configuration process performed on a given prototype skeleton and in the application face it is able to detect the same pattern and modified versions of it. In the following, we provide details of the configuration and application phases.

2.1 Configuration

Given a prototype skeleton , the proposed pose detector learns a model of the position of its joints with respect to a reference point in a configuration process. The value is an identifier of the corresponding limb joint point to the skeleton joint point, while is the total number of joints in the prototype skeleton.

We construct a model of the prototype skeleton where each joint is described by a -tuple . The values and are the polar coordinates of the position of the -th joint with respect to the reference point :


The value is a parameter and represents the weight assigned to the joint as a measure of its importance in the concerned action.

We divide the skeleton in two parts, corresponding to the upper and lower parts of the body, in order to increase the selectivity of the configured detectors for actions that involve only a specific part of the body. As an example, the action of waving hands involves only the upper part of the body, disregarding whether the subject is sitting or standing up. In such case, the weights of the joints in the lower body part are set to , so that they do not contribute to the configuration phase and to the computation of the skeleton pose. This procedure helps to reduce the infra-class variations (e.g. waving hand for sat and stood up persons can be considered the same action although part of the skeleton pose is very different) and to increase the selectivity of the configured filters for specific poses.

2.2 Detection of skeleton similarity

We compute the similarity of a test skeleton to a prototype one by combining a score for each joint in the configured model. The value of the score of a joint depends on the position that it has in the test skeleton when compared to the position of the homologous joint in the model. In practice, we compare the distance relative to the reference point in the prototype and test skeletons, weighted with a Gaussian function, which allows for tolerance to spatial deformations. Formally, the score of the joint in the test skeleton with respect to its homologous in the prototype skeleton is computed as:


where is the euclidean distance between the position of the joint in the test skeleton and of its homologous in the prototype skeleton. The value

is the standard deviation of the Gaussian weighting function computed for the

-th joint. It is important to highlight that regulates the tolerance to the position of the -th joint with respect to the position of its homologous in the model determined in the configuration process. The value of as a linear function of the skeletal distance between the position of the reference point and the one of the joint :


The value of increases with the skeletal distance of the -th joint point from the reference point (i.e. the barycenter of the body). This determines a larger tolerance in the position of further joints and goes accordingly to the fact that terminal joints have more mobility than joints closer to the body. The distance is computed as the sum of the length of the segments that connect the joints and (see Fig. (a)a). The parameters and regulate the amount of tolerance with respect to deformation from the prototype pattern in the application phase. The weighting function that we considered to account for tolerance in the position of the skeleton joints contributes to robustness to deformations of the prototype pattern. This property provides generalization capabilities to the proposed pose detector that strongly responds to the same skeleton pattern used for configuration but also to similar (or deformed) versions of it.

Figure 4: A prototype skeleton with examples of (a) skeletal distance between the reference point (green spot) and the joint that corresponds to the position of the right hand (in blue); the (b) considered joint positions (green lines) to configure a model of the upper part of the body; the (c) Gaussian weighting functions introduced to achieve tolerance to variations of the position of the joints in the configured model.

We compute the similarity between a test skeleton and a prototype skeleton as the combination of the measure of similarity between the positions of homologous joints in and

. It is formally defined as the weighted geometric mean of the measure

of the joint position similarity:

Figure 9: A prototype filter (a) and the response to this filter of different skeletons (b,c,d), having response value , and respectively. The blue lines represent the euclidean distance between the joints and the homologue prototype.

A few examples of application of the proposed pose detector are shown in Fig. 9, where the responses of the three skeletons in Fig. (b)b, Fig. (c)c and Fig. (d)d with respect to the prototype skeleton in Fig. (a)a are shown. Note that, as expected, the response achieved in the skeleton in Fig. (b)b, is higher than the response achieved in Fig. (c)c and in Fig. (a)a, being vs and respectively.

2.3 Robustness to reflection and scale

In order to achieve robustness with respect to scale and reflection transformations of the skeletons, we introduce and apply modified versions of the configured detectors. Examples are shown in Fig. 12. For each joint in the model , we also apply a reflected version (see Fig.(b)b), by modifying the original tuples as follows:


The modified angle value allows to identify those actions that can be performed in a symmetric way. As an example, a subject can perform the action of drinking both with the right and the left hands. The two actions are, however, symmetric with respect to the central axis of the body.

In order to also accounts for correct detection of actions performed at various distances from the camera, we apply a scaled version of the detector (see Fig. (a)a) by modifying the tuples in the original models as follows:


The factor introduces tolerance to actions that are performed at different distances from the camera. In our experiments, we set this value to (for actions performed at a closer distance) and (for actions performed at a further distance).

Figure 12: Examples of (a) scaled and (b) reflected versions the proposed pose detector. The green lines represent the positions of the transformed joint points in the scale- and reflection-tolerant versions of the detector.

2.4 Classification

We configure a number of pose detectors for each of the actions of interest and use each configured detector as a skeleton feature extractor. We thus construct a feature vector of elements, whose values are computed as the response of the configured detectors when applied to test skeletons. The constructed feature vectors are used in combination with a classifier to associate a pose to a specific class of interest.

We employ a multi-class SVM classifier by combining the output of a pool of linear SVM classifiers, each one trained to recognize poses of a specific class of interest. We train the -th SVM using as positive examples the samples from the class and as negative examples the samples from all the other classes (one-vs-all scheme). In the classification step, each classifier assigns to the skeleton under test a score and we select the class that corresponds to the classifier with the highest score. In case all the classifiers compute a negative score we assign the skeleton under test to the background class, i.e. the pose is not recognized.

The so trained classifier is able to recognize skeleton poses in single frames, without taking into account their temporal evolution. In our experiments, we perform the classification of poses at level of single frames and, successively, aggregate the classification outputs to action-level. We classify an action by majority voting on the frame-level decisions.

3 Experimental results

3.1 Data sets

We performed experiments on a publicly available data set, namely the MSR Daily Activity 3D dataset (hereinafter MSRDA) [27].

The MSRDA data set is composed of classes of actions, which can be considered daily actions: drink (B1), eat (B2), read book (B3), call cellphone (B4), write on a paper (B5), use laptop (B6), use vacuum cleaner (B7), cheer up (B8), sit still (B9), toss paper (B10), play game (B11), lay down on sofa (B12), walk (B13), play guitar (B14), stand up (B15), sit down (B16). Each action is performed by people. In turn, each person repeats the actions twice, one while sitting on a sofa and the other while standing. A few example frames are shown in Fig. 19. As stated by the data set authors, the position of the skeleton joints is noisy, thus implying that several samples are corrupted and are difficult to be recognized by approaches base only on the analysis of the skeleton. We report two examples of noisy skeletons, over-imposed on the original images, in Fig. (c)c and (f)f.

Figure 19: Example frames extracted from the MSRDA data set from the action classes (a) Eat and (b) Read Book together with their (c,d) corresponding skeletons. Examples of (e,f) noisy skeletons.

3.2 Results and discussion

We evaluate the performance of the proposed method for action recognition by computing the average recognition rate, the error rate and the miss rate. We compute the evaluation metrics by considering the classification outputs at action-level. The error and miss rates refer to samples that are classified to the wrong class and to the background class, respectively.

We achieved an average recognition rate of on the MSRDA data set, with an error rate of and a miss rate of . We report detailed results for each class of actions in the MSRDA data set in Table 1. Classification errors are mainly due to mis-classification of the following action classes: read book (B3), write a paper (B5), use laptop (B6) and play game (B11). This is due to the fact that part of the poses that constitute these actions are in common, so making difficult to be distinguished by using only the skeleton information The actions sit down (B15), stand up (B16), sit still (B9) are also subjected to errors. In these cases, the absence of information related to the temporal sequence of poses does not contribute to correct decisions of the proposed approach.

Recognition Rate Error Rate Miss Rate
Classes B1
Table 1: Results achieved by the proposed method for each class in the MSRDA data set.

In Table 2 we compare the average recognition rate achieved by the proposed method with the ones obtained by other existing approaches on the MSRDA data set. Note that a direct comparison is possible only with those methods that do not take into account information on the temporal sequence of poses. Other methods, instead, improve the classification performance and the inter-class robustness by including information about the temporal evolution of poses in the classification models. The method that obtains the highest value of the recognition rate () by only analyzing the skeleton information is based on Joint position [27]. The results obtained by the proposed method are comparable to the ones reported in [27].

Method Reference Accuracy
LOP [27]
Depth Motion Maps [30]
Dynamic Temporal Warping [15] 0.54
3DTSD [12] 0.55
Eigen Joints [28]
TSD [12] 0.58
TSD+3DTSD [12] 0.63
Proposed approach - 0.64
TSD+RTD [12] 0.65
Joint Position [27]
Hierarchical Classification [12] 0.72
HON4D [17] 0.80
Actionlet Ensamble [27]
Table 2: Comparison of the results achieved by the proposed approach with the ones achieved by existing methods on the MSRDA data set.

One advantage of the proposed approach is its trainability. The selectivity of the pose detectors is not fixed in the implementation, but it is rather learned during an automatic configuration process, which is carried out on prototype patterns. In order to configure a limited set of detectors for a given action, we sampled the skeleton pose sequence by selecting only a confined number of poses which we used as prototypes for the configuration of a number of pose estimators.

They key aspect of this approach is that the features are not designed by the user but are learned from training samples by means of an automatic configuration procedure. Hence, trainable features avoid a step of feature engineering, which usually require extensive domain knowledge to construct a set of hand-crafted features. The automatic configuration of features for pose detection that is part of the proposed method can be considered a kind of representation learning

that, similar to deep learning approaches, construct suitable representations from training samples. In contrast with deep learning methods, the proposed trainable pose detectors do not require large amount of data to be configured. A single detector require only one training sample. However, the proposed approach learns representations that are less general than the one learned with deep learning.

Most of the classification errors are attributable to impossibility of distinguishing action due to the lack of information about the temporal sequence of poses. In further works, the proposed pose detection method can be employed as basic descriptor and coupled with a methodology for analysis of temporal sequences. In this way, a higher level classification approach would integrate on a larger time scale the classifications performed at frame-level. An interesting way of improving the performance of the proposed method is to reduce the number of configured detectors and select only the most relevant ones, i.e. feature selection procedures can be applied as shown in similar works on selection of relevant trainable features 

[21, 23]. Reducing the number of configured detectors determines less computation resources and allows to configure a system for the detection of a larger number of classes. Further improvements of the proposed method concern an extensive sensitivity analysis, which is meant to asses the robustness of the proposed approach with respect to the values of its parameters. Furthermore, at the light of the high discriminant capability on the skeleton data demonstrated by the proposed approach, a combination with other typologies of descriptors using different information (for instance extracted by the depth images) as well as the introduction of the temporal information, will surely allow to deal in a more effective way with the problem of recognizing human actions.

4 Conclusions

In this paper we proposed a trainable feature detector for the recognition of human poses based on the analysis of the skeleton. The automatic configuration process of the proposed approach can be considered a kind of representation learning, where the features are learned directly from training data instead of being had-crafted by an expert.

The proposed detectors are able to evaluate the spatial arrangement of the joints of a skeleton in comparison with a given a prototype pattern of interest. The similarity between a prototype skeleton and a test skeleton is measured by taking into account some tolerance in the relative positions of the joints. This allows for generalization capabilities and robustness to distortion and noise.

The results that we achieved on a publicly available data set (average recognition rate of on the MSRDA data set) confirm the effectiveness of the proposed method and are comparable with the ones obtained by existing methods based on skeleton analysis.


  • [1] Aggarwal, J., Ryoo, M.: Human activity analysis: A review. ACM Comput. Surv. 43(3), 16:1–16:43 (Apr 2011)
  • [2] Azzopardi, G., Petkov, N.: Trainable cosfire filters for keypoint detection and pattern recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on 35(2), 490–503 (Feb 2013)
  • [3] Azzopardi, G., Petkov, N.: A CORF computational model of a simple cell that relies on lgn input outperforms the gabor function model. Biological Cybernetics 106, 177–189 (2012)
  • [4] Azzopardi, G., Strisciuglio, N., Vento, M., Petkov, N.: Trainable COSFIRE filters for vessel delineation with application to retinal images. Medical Image Analysis 19(1), 46 – 57 (2015)
  • [5] Carletti, V., Foggia, P., Percannella, G., Saggese, A., Vento, M.: Recognition of human actions from rgb-d videos using a reject option. In: ICIAP 2013, vol. 8158, pp. 436–445 (2013),
  • [6] Chaudhry, R., Ofli, F., Kurillo, G., Bajcsy, R., Vidal, R.: Bio-inspired dynamic 3d discriminative skeletal features for human action recognition. In: IEEE CVPRW. pp. 471–478 (June 2013)
  • [7] Chen, Y., Wu, Q., He, X.: Human action recognition based on radon transform. In: Multimedia Analysis, Processing and Communications. vol. 346, pp. 369–389. Springer Berlin Heidelberg (2011)
  • [8] Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: IEEE Workshop on PETS. pp. 65–72 (2005)
  • [9] Foggia, P., Saggese, A., Strisciuglio, N., Vento, M.: Exploiting the deep learning paradigm for recognizing human actions. In: IEEE AVSS (2014)
  • [10] Gecer, B., Azzopardi, G., Petkov, N.: Color-blob-based COSFIRE filters for object recognition. Image and Vision Computing 57, 165 – 174 (2017)
  • [11] Johansson, G.: Visual perception of biological motion and a model for its analysis. Perception and Psychophysics 14(2), 201–211 (1973),
  • [12] Koperski, M., Bilinski, P., Bremond, F.: 3D Trajectories for Action Recognition. In: IEEE ICIP. Paris, France (Oct 2014)
  • [13] Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE CVPR. pp. 2046–2053 (June 2010)
  • [14] Lee, H., Morariu, V., Davis, L.: Robust pose features for action recognition. In: IEEE CVPRW. pp. 365–372 (June 2014)
  • [15] Müller, M., Röder, T.: Motion templates for automatic classification and retrieval of motion capture data. In: ACM SIGGRAPH. pp. 137–146 (2006)
  • [16] Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Sequence of the most informative joints (smij): A new representation for human skeletal action recognition. In: CVPRW 2012. pp. 8–13 (2012)
  • [17] Oreifej, O., Liu, Z.: Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In: IEEE CVPR. pp. 716–723 (June 2013)
  • [18] Poppe, R.: A survey on vision-based human action recognition. Image and Vision Computing 28(6), 976 – 990 (2010)
  • [19] Roudposhti, K.K., Nunes, U., Dias, J.: Probabilistic social behavior analysis by exploring body motion-based patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(8), 1679–1691 (Aug 2016)
  • [20] Roudposhti, K.K., Dias, J., Peixoto, P., Metsis, V., Nunes, U.: A multilevel body motion-based human activity analysis methodology. IEEE Transactions on Cognitive and Developmental Systems PP(99), 1–1 (2016)
  • [21] Strisciuglio, N., Azzopardi, G., Vento, M., Petkov, N.: Multiscale blood vessel delineation using B-COSFIRE filters. In: CAIP, LNCS, vol. 9257, pp. 300–312 (2015)
  • [22] Strisciuglio, N., Azzopardi, G., Vento, M., Petkov, N.: Unsupervised delineation of the vessel tree in retinal fundus images. In: VIPIMAGE, pp. 149–155 (2015)
  • [23] Strisciuglio, N., Azzopardi, G., Vento, M., Petkov, N.: Supervised vessel delineation in retinal fundus images with the automatic selection of B-COSFIRE filters. Machine Vision and Applications p. 1–13 (2016)
  • [24] Strisciuglio, N., Vento, M., Petkov, N.: Bio-inspired filters for audio analysis. In: BrainComp 2015, Revised Selected Papers, pp. 101–115 (2016)
  • [25] Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: IEEE CVPR. pp. 588–595 (June 2014)
  • [26] Wang, Y., Huang, K., Tan, T.: Human activity recognition based on r transform. In: CVPR 2007. pp. 1 –8 (june 2007)
  • [27] Wu, Y.: Mining actionlet ensemble for action recognition with depth cameras. In: IEEE CVPR. pp. 1290–1297. CVPR ’12 (2012)
  • [28]

    Yang, X., Tian, Y.: Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In: IEEE CVPRW. pp. 14–19 (June 2012)

  • [29] Yang, X., Tian, Y.: Effective 3d action recognition using eigenjoints. Journal of Visual Communication and Image Representation 25(1), 2 – 11 (2014), visual Understanding and Applications with RGB-D Cameras
  • [30] Yang, X., Zhang, C., Tian, Y.: Recognizing actions using depth motion maps-based histograms of oriented gradients. In: ACM International Conference on Multimedia. pp. 1057–1060. MM ’12, ACM (2012)
  • [31] Yao, A., Gall, J., Fanelli, G., Van Gool, L.J.: Does human action recognition benefit from pose estimation? In: BMVC (2011)