Learning discriminative trajectorylet detector sets for accurate skeleton-based action recognition

04/20/2015 ∙ by Ruizhi Qiao, et al. ∙ The University of Adelaide 0

The introduction of low-cost RGB-D sensors has promoted the research in skeleton-based human action recognition. Devising a representation suitable for characterising actions on the basis of noisy skeleton sequences remains a challenge, however. We here provide two insights into this challenge. First, we show that the discriminative information of a skeleton sequence usually resides in a short temporal interval and we propose a simple-but-effective local descriptor called trajectorylet to capture the static and kinematic information within this interval. Second, we further propose to encode each trajectorylet with a discriminative trajectorylet detector set which is selected from a large number of candidate detectors trained through exemplar-SVMs. The action-level representation is obtained by pooling trajectorylet encodings. Evaluating on standard datasets acquired from the Kinect sensor, it is demonstrated that our method obtains superior results over existing approaches under various experimental setups.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The recognition of human actions is an active research field in recent years and much effort has been made to address this problem [21]. Intuitively, a temporal sequence of 3D skeleton joint locations captures sufficient information to distinguish between actions, but recording skeleton sequence was very expensive with the traditional motion capture technology, which limits the applications to which it has been applied [13]. Recently, with the advent of RGB-D cameras such as Microsoft Kinect [6], the acquisition of 3D skeleton data for action recognition has become much easier and faster [20]. This advance promotes a number of skeleton-based action recognition approaches [22, 5, 27]. The key challenge of these approaches is how to extract discriminative features from the noisy temporally evolving skeletons.

Fig. 1: Skeleton sequences from two action classes. Only the red skeletons show significant differences between the two sequences. In this example, less than of the frames are required to tell whether the skeleton is clapping or waving.

The trajectory of skeletal joints in space-time is the direct representation of human actions. Earlier works [28, 19]

model human action trajectory descriptors of variable-lengths and classify them based on the similarity matching of trajectories. In

[5], an action representation is encoded with a histogram voted by the displacements of joint trajectories with respect to their orientations. In these works, the global feature is extracted from the whole trajectory. However, only a short section of the trajectory is actually distinctive and can provide usable information about the action being undertaken. For example, as illustrated in Figure 1

, only moments when the subject moves its hands, during the two actions of waving and clapping, are indicative of the performed action class, while all the remaining poses are irrelevant and potentially distracting. The abundant non-informative local patterns may cause large variance to the global trajectory. Compared to the global representation, later works

[30, 31] explore discriminative patterns to create local descriptors at frame-level. Despite its robustness, a frame-level descriptor, without additional temporal information, hardly depicts the movement of actions, and is insufficient for recognition.

Different from the above-mentioned approaches which either represent an action with the whole sequence or extract local features at the frame level, we argue that the discriminative information of an action is better captured by a short interval of trajectories. This interval usually consists of several frames. In other words, its temporal range is longer than a single frame but much shorter than the whole skeleton sequence. To extract features from the trajectory interval, we make our first contribution by designing a novel local descriptor called trajectorylet to capture the static and dynamic motion information within the short interval.

Furthermore, as we have observed, not all trajectorylets in a sequence are equally important for classification and the recognition performance generally benefits from focusing on the discriminative ones. In skeleton-based action recognition, recent works [31, 3] directly learn the discriminative frames from the training set. Unlike the aforementioned works, our approach does not explicitly look for the discriminative trajectorylets, but rather provides a method for creating a set of detectors that fire on specific template trajectorylets. Our approach firstly applies exemplar-SVM [10] to learn a large number of candidate detectors and then selects detectors according to their discriminative performance over the trajectorylets in the training set. We further cluster detectors into multiple clusters, and remove the redundancy of the learned detectors by selecting one representative detector from each cluster. The selected detectors form a template detector set and their detection scores on a trajectorylet is utilized as the coding vector of that trajectorylet. The action level representation is then obtained by pooling all trajectorylet coding vectors and temporal pyramid pooling can also be incorporated to capture the long range temporal information of the action sequence. In extensive experiments, this framework brings significant performance improvement over state-of-the-art approaches for skeleton-based action recognition.

In summary, our first contribution is the trajectorylet, a novel local descriptor that captures static and dynamic information in a short interval of joint trajectories. In our second contribution, a novel framework is proposed to generate robust and discriminative representation for action instances from a set of learned template trajectorylet detectors.

Following briefly reviewing related literature in Section II, we propose the design of our local feature and detector learning method in Section III. We then present the action-level representation of an action instance in Section IV. Our framework is experimentally evaluated in Section V and summarized in Section VI.

Ii Related work

The key challenge of skeleton-based action recognition is how to construct the action representation from a sequence of skeletal joints. Some video-based methods [12, 24] extract trajectories of multiple tracking points, and compute descriptors along them, such as HOG, HOF and MBH. For skeleton-based methods, trajectories are directly obtained from the space-time evolution of skeletal joints. The most straightforward way is to model the trajectory holistically, either by extracting statistics from the sequence or modelling its generative process. In [5], a histogram records the displacements of joint orientations over the whole trajectory. In [16], the action is modelled with the pairwise affinities trajectories of joint angles. In [29]

, the action sequence is modelled by the Hidden Markov Model with quantized histogram of spherical coordinates of joint locations as frame-level feature. In

[22], 3D geometric relationships between various body parts are modelled with a Lie group to represent the whole action.

Besides directly modelling the trajectory holistically, it has also been noted that only a small fraction of patterns of a skeletal sequence are actually distinctive and thus many approaches have been proposed to identify those discriminative patterns, whether these patterns are defined spatially or temporally.

It has been found that not all skeletal joints are informative for distinguishing one action from the others, therefore it is beneficial to select a subset of joints. Ofli et al. [15] select a subset of most informative joints according to criteria such as mean or variance of joint angles. In [25], joints are grouped into actionlets, and the most discriminative collection of them are mined via the multiple kernel learning approach. In [2], a subset of joints within a short-time interval is extracted according to the spatio-temporal hierarchy of the moving skeleton, and a linear combination of them is learned via a discriminative metric learning approach. In [23], the distinctive set of body parts are mined from their co-occurring spatial and temporal configurations. In [1], an evolution algorithm is employed to select an optimal subset of joints for action representation and classification is performed by using DTW-based sequence matching.

As most of the frames in an action sequence are comprised of non-distinctive static poses, features at a few discriminative temporal locations are informative enough to represent an action. In video-based action recognition, a number of key frame selection approaches have been proposed. In [32], key frames are selected by ranking the conditional entropy of the codewords assigned to the frames. In [18]

, the locations of key frames are modelled as latent variables and estimated for each action instance by dynamic programming. In recent works on skeleton-based action recognition, distinctive canonical poses


are learned via logistic regression, and discriminative frames

[31] are identified by their approximated confidence of belonging to a specific action class. In [30], distinctiveness of each frame is calculated by a measurement of accumulated motion energy.

Iii The proposed action representation

Our model utilizes the relationships between the positions of the skeletal joints in the current and preceding frames to form a local trajectorylet. Because human skeleton size varies from different action instances, we perform a skeleton size normalization on the raw skeletal joints according to [31]. We also subtract the position of the hip center from each joint and concatenate them to form a feature column: , making the origin point of the coordinate system across all frames and subjects.

Iii-a Trajectorylet

Although holistic trajectories of joints depict the movement of human body, distinctive patterns are usually overwhelmed by common ones. For example, in long-term actions such as draw circle and draw tick, only the last moment of drawing movement distinguishes them, before which both trajectories share the same movement of raising up hand for a long time. On the other hand, as depicted in Figure 2, frame-level local descriptors record current poses and some local dynamics, but they fail to capture the movement that spans a long temporal range. To distinguish walk from run, for instance, we need to examine the displacement and speed of the joints within a sufficient period of time, rather than the static poses. Based on these observations, we propose our trajectorylet local descriptor, which captures the static and dynamic information of trajectories in a short period of time. Compared with frame-level descriptors, trajectorylet depicts richer dynamic information. On the other hand, its temporal range is much smaller than the whole trajectory sequence and therefore it is less affected by potentially irrelevant frames.

Fig. 2: The joint coordinate information at frame-level may provide little information to distinguish between some action classes, such as the above drawing actions. One of the advantages of trajectorylets is their ability to focus on the dynamics of distinctive sections of individual actions.

More specifically, considering a trajectorylet of length starting from frame , we extract the static positions of the joints from each frame occurring before time :


In order to retrieve the dynamic information within this interval, we inspect multiple levels of temporal dynamics such as displacement and velocity.


where indicates the relative joint displacements of frame from the first frame; indicates the joint velocities of frame from its previous frame within the trajectorylet. The static positions of store the absolute spatial location of the trajectorylet. The temporal dynamics and approximate the relatively kinematic evolution within this short time interval. Combining both static and dynamic information we define the -th trajectorylet for an action instance with frames as


where .

PCA is applied on trajectorylets to reduce the their dimension for our detector learning module. We still denote the final descriptor as . Figure 3 visualizes components in a trajectorylet , including one static component and two dynamic components.

Fig. 3: Visualization of trajetorylet of length 5 at a single joint (left hand). The red point is the position at the starting frame, and the green points are its positions at succeeding frames in this interval. The yellow segments are joint displacements from the first frame. The black segments are joint velocities at each frame. The top trajetorylet is part of drawing circle and the bottom trajetorylet is part of high waving. The differences between them are clearly distinguished by their positions, displacements and velocities over a short period of time.

Iii-B Learning candidate detectors of discriminative trajetorylet using ESVM

As we have previously discussed, only a small fraction of the trajectorylets from an sequence contains sufficient information for identifying the associated action. Most of the trajectorylets, especially those that contain the static posture, are shared by multiple action classes. Our aim is to learn a set of detectors that fire on the distinctive trajectorylets. To this end, we firstly resort to exemplar-SVM (ESVM) [10] to learn a large set of detectors for a large number of sampled trajectorylets, one for each sampled trajectorylet. Then for each action instance we select a few discriminative trajectorylet detectors as the candidate detectors of discriminative trajectorylet.

An ESVM learns a decision boundary that achieves largest margin between an exemplar sample and a set of negative examples. If we take each trajectorylet as a positive exemplar of its associated class , , and trajectorylets that belong to other action classes as the negative examples, we can train an exemplar-SVM for it and formally this can be formulated as:



is the hinge loss function, and

is the negative set of trajectorylets that do not belong to class . and denote the weights of loss for positive and negative samples respectively, and ensures that a greater penalty will be applied to the incorrectly classified positive exemplars.

For each ESVM, the trained detector returns higher scores on trajectorylets that are most similar to . If the current exemplar trajectorylet is common in multiple action classes, the returned trajectorylets are abundant in multiple classes. On the contrary, if the current exemplar trajectorylet is unique for a single class, most returned trajectorylets belong to the same class with the current exemplar trajectorylet. Thus we can employ the distribution of action classes of the returned trajectorylets to estimate the discriminative power of one detector.

Fig. 4: Overview of our feature learning framework.

Given an action instance , we extract trajectorylet descriptors , and train the associated detectors . A selection method is implemented to find the most discriminative trajectorylet detector among the candidates. More specifically, we apply each detector to the trajectorylets , sampled from the whole training set and compute the detection scores 111In order to measure the scores on the same scale, we adjust the trained parameters with unit norm before computing the scores.. From we choose a subset , with the top scores, corresponding to the trajectorylets that are most compatible with current detector . For the trajectorylets detected by , we denote as the number of trajectorylets belonging to action class . The histogram gives a clear view of the distinctiveness of detector .

If is flat across many classes, is a common pattern shared by many classes and its detector is therefore not distinctive. If the is centered mostly at the correct class, trajectorylet is a distinctive pattern for this class and hence is an effective detector of this distinctive pattern. In practice, if the correct class corresponding to is , we denote as the ratio of correctly detected trajectorylets and a detector with higher is selected because it fires primarily on trajectorylets with the same class of it, verifying the distinctiveness of this detector. We summarize this approach in Algorithm 1.

Input: Training action instance of class , trajectorylets within it ; sampled training trajectorylets ; number of trajectorylets to retain: ; maximum number of detectors to be selected for the instance: . Initialize: Set of discriminative detectors for instance : ; number of discriminative detectors selected for the instance . for do
        Solve ESVM . Compute detection scores on sampled trajectorylet set Compute from the top scored samples. Compute the ratio of correctness of .
end for
Sort by magnitude, storing the resulting (sorted) indeces in . for in do
        . . if then
        end if
end for
Output: Discriminative detectors for instance : .
Algorithm 1 Find discriminative detectors for an action instance

Iii-C Template detector set

As the detectors are discovered from every action instance, the size of the detector set grows with the number of training instances, which will lead to a very high-dimensional action representation and make the computation intractable. On the other hand, the above method might select similar distinctive detectors multiple times, resulting in a highly redundant detector set. To control the size of detector set and remove the redundancy of candidate detector set, we perform spectral clustering on candidate detectors and then select one detector from each cluster as the final detector set used for trajectorylet encoding. To build the affinity graph for spectral clustering, we need to specify the similarity measurement between two detectors. Here we measure this similarity by considering the “active detection scores” of two detectors which refer to the detection scores with positive values. We evaluate it by firstly calculating detection scores on

sampled trajectorylets and setting negative detection scores to zero. This process gives a dimensional active detection score vector for each detector and the similarity between two detectors are measured as follows:


where represents the norm, and and denote the active detection score vectors for the two compared detectors. The value

measures the similarity between two detectors and is used to build the affinity matrix

for the detector set , that is, . We apply spectral clustering to and obtain clusters of detectors. The detectors within the same cluster fire on similar trajectorylets. From each cluster, we select a representative detector that produces the highest score on the sampled trajectorylets. In practice, given a sufficient large , the collection of representative detectors can cover all discriminative trajectorylets. We call this collection the template trajectorylet detector set.

Iv Global descriptor and classification

For the detectors in the template detector set, we evaluate their detection scores on each trajectorylet and max-pool those detection scores to obtain the action representation. Formally, let

be the -th trajectorylet of the -th action, and be the -th detector in the template detector set. We define the action representation for the -th action as:


We use a one-versus-all SVM to classify actions among the action classes .

The learned feature mapping governed by the template detector set serves as a global descriptor of the action instance. It maps temporally continuous trajectorylets into a higher-level representation. Also, can not only map a complete sequence of action, but also works for a temporal sub-sequence. This allows us to build a temporal pyramid representation of the action instance. For a 3-level temporal pyramid, the sub-sequences are , and the -th dimension of subfeature for sub-sequence is


The concatenated incorporates the temporal information of the skeleton sequence. Therefore we are able to train a one-versus-all SVM with this feature that takes into account the global temporal information of the whole action sequence.

V Experiments

We organize the experimental evaluation in four parts. We first compare our proposed method against other state-of-the-art methods on two standard datasets obtained from the Kinect sensor. Then we analyze the performance of our method under different parameter settings. Since our method consists of two modules, the trajectorylet descriptor and the template detector learning based middle-level feature representation, we conduct two experiments to separately evaluate their impact on the classification performance. To examine the first module, we compare our descriptor against the descriptor of [31], which is most related to our trajectorylet descriptor, by keeping the other settings of the recognition system the same. We also compare our descriptor with its several alternative variants. To examine the second module, we compare our method with alternative way to obtain constructed from three state-of-the-art middle-level feature representation methods: VLAD [7], LSC [9], and LLC [26].

Implementation details: The ESVMs are implemented by liblinear [4], which produces about 5 candidate detectors per second on an Intel Core i7 CPU at 3.40GHz. We set the regularization parameters as and for all ESVMs. There are on average and local descriptors in the negative sets for MSR Action3D and MSR DailyActivity3D respectively. The dimensionality of trajectorylets is reduced to 50% percent of it by PCA. As the testing data will not be known in advance, the PCA coefficients and covariance matrix are learned from the training data only. Unless indicated otherwise, the length of trajectory descriptor is set to . The regularization parameter for the final one-versus-all SVM is determined by a five-fold cross-validation. We apply a 3-level temporal pyramid on MSR DailyActivity3D only, because it contains complex actions which involves several sub-actions and the long-range temporal information can be useful in such a case.

V-a MSR Action3D

Horizontal arm wave High arm wave High throw
Hammer Hand catch Forward kick
Forward punch Draw x Side kick
High throw Draw tick Jogging
Hand clap Draw circle Tennis swing
Bend Two hand wave Tennis serve
Tennis serve Forward kick Golf swing
Pickup & throw Side boxing Pickup & throw

TABLE I: The classes in the three action subsets of the MSR Action3D dataset.

Protocol of [8]
AS1 AS2 AS3 Average
3DBag [8] 72.9 71.9 79.2 74.7
HO3DJ [29] 88.0 85.5 63.5 79.0
EigenJoints [30] 74.5 76.1 96.4 82.3
HOD [5] 92.4 90.1 91.4 91.2
Lie Group [22] 95.3 83.9 98.2 92.5
EJS [1] 91.6 90.8 97.3 93.2
Moving Pose [31] 222We use the code of [31] to obtain this result, as the original work did not report the results according to the protocol of [8]. 91.6 99.1 95.7

TABLE II: Results on 3 subsets of the MSR Action3D dataset.

The MSR Action3D dataset consists of human actions expressed with skeletons composed of 20 3D body joint positions in each frame. The 20 joints are connected by 19 limbs. There are 20 action classes performed by 10 subjects for 2 or 3 times each, making up 567 action instances. Each action instance contains a temporal sequence of a moving skeleton, usually in 30-50 frames. As in [25] and [31], we drop 10 instances because they contain erroneous data. The experiment setup is that of a cross-subject test [8], i.e. instances of half of the subjects are used for training and instances of the other half subjects are used for testing. We construct with top responding trajectorylets, and select best detectors for each training instance. We use the clustering method of section III-C to obtain the template trajectorylet detector set. The final number of template detectors is set to .

In Table II, we compare our approach with other state-of-the-art methods using the protocol of [8], by which the 20 action classes are grouped into 3 action subsets AS1, AS2, and AS3. The training and testing is performed on each action set separately. AS1 and AS2 group actions with similar movements while AS3 group complex actions. The action classes of each action subset are listed in Table I. On average, our proposed method is more accurate than all other methods. On AS2, all other methods get moderate accuracy and in contrast our method outperforms the second best by 5.9%. Note that on AS3, our method achieves perfect recognition.

Protocol of [25] Accuracy
Recurrent Neural Network [11] 42.5
Dynamic Temporal Warping [14] 54.0
Canonical Poses [3] 65.7
DBM+HMM [27] 82.0
JAS (skeleton data only) [16] 83.5
Actionlet Ensemble [25] 88.2
HON4D [17] 88.9
Lie Group [22] 89.5
LDS [2] 333 It should be noted that the result of [2] here is not obtained under the same setting as ours. This approach selected a subset of 17 actions performed by 8 subjects, 5 for training and 3 for testing, consisting of 379 action instances in total. 90.0
Pose based [23] 90.2
Moving Pose [31] 91.7
TABLE III: Results on the entire MSR Action3D dataset.

In Table III, a more challenging protocol of [25]

is used. Here the model is trained and tested over all 20 action classes. The results show that our method still obtains a highly accurate recognition rate, outperforming the current best state-of-the-art by a margin of 4.2%. The confusion matrix of our method on this dataset under the second protocol is displayed in Figure 

5, where 16 of 20 action classes are perfectly classified. The only highly misclassified class is hammer, because its distinctive pattern involves human-object interaction, which is not captured by the skeleton data.

Fig. 5: Confusion matrix of our approach on the MSR Action3D dataset: except for the hammer class, all other action classes are classified with more than 80% accuracy. 16 out 20 action classes are perfectly classified.

V-B MSR DailyActivity3D

In MSR DailyActivity3D, there are 16 action classes performed by 20 subjects twice, making up 320 action instances. Each subject performs an action class in two variants (e.g. sitting versus standing, or in front of versus behind an object). This dataset has longer sequences, usually in 100-300 frames. We still follow the cross-subject test in protocol of [25] and [31], where training and testing are conducted over all action classes. Because this dataset contains more local information than MSR Action3D, we construct with top responding trajectorylets, select best detectors for each training instance, and reduce the final number of clustered detectors to .

Methods Accuracy
Dynamic Temporal Warping [14]
Actionlet Ensemble (skeleton data only) [25]
Moving Pose [31] 444Although the reported result in [31] is 73.8%, we never achieved this accuracy with their code due to environmental factors. For a fair comparison, we used the result 70.6%, which is the best performance under the same environment and setting with our approach.
TABLE IV: Results on the MSR DailyActivity3D dataset.

We compare our approach with other state-of-the-art methods in Table IV. As the purpose of this experiment is to address skeleton-based action recognition, some best reported results [17, 25] on this dataset using additional RGB-D data are not comparable to our method, and therefore we cite the result of [25] using only skeleton data. Although MSR DailyActivity3D share the same data structure as the MSR Action3D, it is much more challenging because: 1) the activities are complex combinations of multiple sub-actions, 2) human-object interaction information is not available in skeleton data, 3) partial occlusion by interacting objects causes the skeleton data to be highly noisy. However, the results show that our approach still outperforms all other state-of-the-art methods. As shown in Figure 6, most of the poorly classified actions involves interaction with objects, such as read book, call cellphone, and use laptop. On the other hand, non-interactive action classes like cheer up, walk, and sit down, are recognized with high accuracy. This demonstrates that our method is able to capture distinctive patterns of actions in terms of “movement”, but may be confused if some actions share similar “movement” patterns despite the presences of different interacting objects, because they are not described in the skeleton data.

Fig. 6: Confusion matrix of our approach on the MSR DailyActivity3D dataset: although this is a challenging dataset for skeleton-based action recognition, 11 out of 16 classes are classified with more 70% accuracy.

V-C Parameter analysis

In this section we analyse how the parameter settings affect the performance. Using the same protocol of [25], we provide results of MSR Action3D dataset from other parameter settings. Figure 7 illustrates the performances of our method as ranges from {25, 50, 100, 200, 300, … , 1000}, while keeping and . When we set the size of detector set more than 500, the results tend to converge to a value above 94.5%. Table V presents results of choosing different pairs of and while keeping .

Fig. 7: Recognition accuracies obtained from varying on the MSR Action 3D dataset: when the results become stable.
5 91.7
10 92.7 93.4
20 93.1 93.1 94.1 94.8
30 94.8 95.2 94.1 93.8 94.8
50 95.5 95.9 95.9 94.8 94.8 94.2

TABLE V: Results from different pairs of the and on MSR Action3D: we can obtain the best performance from multiple choices.

For the MSR DailyActivity3D dataset, Figure 8 illustrates the performances of our method as K ranges from {25, 50, 100, 200, 300, … , 1000}, while keeping and . When is set to more than 500, the results become stable. The effect of choosing different pairs of and is listed in in Table VI. When is large enough, the results variation becomes small. It can be observed that, on both datasets, there are multiple choices of parameters that are able to produce the optimal result and this verifies the robustness of our approach.

Table VII shows the results under different temporal pyramid settings for the two datasets. A typical 3-level pyramid is the best choice for MSR DailyActivity3D as low level pyramids fail to grasp the temporal information while higher level ones brings too much noise. On the other hand, when temporal pyramid is applied to MSR Action3D, the performance is worsened.

Fig. 8: Recognition accuracy obtained from varying on the MSR Daily Activity 3D dataset: when the results become stable.
5 68.7
10 68.1 69.4
20 68.7 71.2 70.0 69.4
30 70.0 73.1 73.8 71.2 71.2
50 73.1 74.3 75.0 75.0 74.3 71.9

TABLE VI: Results from different pairs of the and on MSR DailyActivity3D.
TP level 1 2 3 4
Action 95.9 92.4 89.7 N/A
DailyActivity 66.3 70.6 75.0 68.8

TABLE VII: Results obtained from different temporal pyramid levels on MSR Action3D and MSR DailyActivity3D datasets.

V-D Power of local trajectorylet descriptor

The moving pose descriptor proposed in [31] captures local information at frame-level of human skeleton actions. Our trajectorylet can be seen as a natural extension of it in the sense that we extend the dynamic information from frame-level range to a longer temporal range. In order to demonstrate the power of our descriptor we now apply our template detector learning framework to moving pose descriptor and compare its performance with that of trajectorylet.

In order to evaluate the effect of varying on performance we have varied the length our trajectorylet from (3, 5, 7). Table VIII shows that using the same detector learning and classification approach, trajectorylets achieve better results on both datasets for all tested values of . As seen, this extension of moving pose descriptor is superior over the original design. It is worth noting that performance does not necessarily improve as the length of trajectorylets increases. A moderate length of trajectorylet () leads to the best performance.

We also test the effect of using different components of the trajectorylet descriptor. In our experiment, we examine the performance of single dynamic components, including static pose , relative joint displacement , velocity , and their combinations. We also further define an acceleration component analogous to (2) and (3):


The results of varying settings of a trajectorylet with are listed in Table IX. We find that the dynamic components of and alone do not show promising results. However, when combined with static , the performance is significantly improved. Table IX also shows that the additional acceleration component in (9) does not improve the performance.

high wave hori. wave hammer hand catch forward punch
high throw draw x draw tick draw circle hand clap
2-hand wave boxing bend forward kick side kick

t. swing t. serve g. swing pick up & throw

Fig. 9: Some examples responding on the template detector set of MSR Action3D. The black curves represent current trajectorylets of the red skeleton. The fact that the our approach identifies discriminative patterns of movement seems clear.
Descriptors MSR Action MSR DailyActivity
Moving Pose [31] 91.7 71.3
Ours() 93.1 72.5
Ours() 73.1

TABLE VIII: Comparison of using different descriptors.
Component MSR Action MSR DailyActivity
92.4 72.5
91.7 50.3
90.3 42.5
93.8 73.1

TABLE IX: Comparison of different using different components of trajectorylet ().
Method MSR Action MSR DailyActivity
VLAD [7]
LLC [26]
LSC [9]
TABLE X: Comparison of feature learning methods.

V-E Power of template detector learning

Our method generates action representation from learned detector set of discriminative trajectorylets. In this section, we compare this method with three state-of-the-art bag-of-feature techniques that learn middle-level feature from the same local trajectorylet feature: VLAD (vector of locally aggregated descriptors)[7], LLC (locality-constrained linear coding)[26], and LSC (localized soft-assignment coding) [9].

We train codebook of the same size

with k-means for all three methods, and set the neighbourhood size of codewords as

for LSC and LLC. The results listed in Table X show, for the task of action recognition, our proposed feature learning framework produces the most discriminative action representation, compared with the state-of-the-art methods. Figure 9 illustrates some trajectorylets fired on the template detector set of MSR Action3D. It is clear that they show representative patterns for the corresponding action classes.

Vi Conclusion

This work describes an effective skeleton-based action approach that achieves high accuracy on the relevant benchmark datasets. The keys to this performance are two factors. We propose trajectorylet, a novel local descriptor that captures static and dynamic information in a short interval of joint trajectories. We also devise a novel framework to generate robust and discriminative representation for action instances by learning a set of distinctive trajectorylet detectors. On two benchmark datasets acquired from the Kinect sensor, our method outperforms, to our knowledge, all existing approaches by a significant margin. We also separately demonstrate the validity of our local descriptors and template detector learning method. To further expand our framework, we plan to incorporate local temporal information to enable real-time detection, as well as investigate the RGB data to study the involvement of human-object interactions.


  • [1] A. A. Chaaraoui, J. R. Padilla-López, P. Climent-Pérez, and F. Flórez-Revuelta. Evolutionary joint selection to improve human action recognition with RGB-D devices. Expert Syst. Appl., 41(3):786–794, February 2014.
  • [2] R. Chaudhry, F. Ofli, G. Kurillo, R. Bajcsy, and R. Vidal. Bio-inspired dynamic 3d discriminative skeletal features for human action recognition. In Proc. Workshops of IEEE Conf. Comp. Vis. Patt. Recogn., pages 471–478, June 2013.
  • [3] C. Ellis, S. Z. Masood, M. F. Tappen, J. J. Laviola, Jr., and R. Sukthankar. Exploring the trade-off between accuracy and observational latency in action recognition.

    Int. J. Computer Vision

    , 101(3):420–436, 2013.
  • [4] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification.

    J. Machine Learning Research

    , 9:1871–1874, 2008.
  • [5] M. A. Gowayyed, M. Torki, M. E. Hussein, and M. El-Saban. Histogram of oriented displacements (hod): describing trajectories of human joints for action recognition. In

    Proc. Int. Joint Conf. Artificial Intelligence

    , 2013.
  • [6] J. Han, L. Shao, D. Xu, and J. Shotton. Enhanced computer vision with microsoft kinect sensor: A review. IEEE T. Cybernetics, 43(5):1318–1334, 2013.
  • [7] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating local descriptors into a compact image representation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3304–3311, June 2010.
  • [8] W. Li, Z. Zhang, and Z. Liu. Action recognition based on a bag of 3d points. In Proc. Workshops of IEEE Conf. Comp. Vis. Patt. Recogn., 2010.
  • [9] L. Liu, L. Wang, and X. Liu. In defense of soft-assignment coding. In Proc. IEEE Int. Conf. Comp. Vis., pages 2486–2493, Nov 2011.
  • [10] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection and beyond. In Proc. IEEE Int. Conf. Comp. Vis., 2011.
  • [11] J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. In Proc. Int. Conf. Mach. Learn., pages 1033–1040, New York, NY, USA, June 2011. ACM.
  • [12] R. Messing, C. Pal, and H. Kautz. Activity recognition using the velocity histories of tracked keypoints. In Proc. IEEE Int. Conf. Comp. Vis., pages 104–111, Sept 2009.
  • [13] T. B. Moeslund, A. Hilton, and V. Krüger. A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst., 104(2):90–126, 2006.
  • [14] M. Müller and T. Röder. Motion templates for automatic classification and retrieval of motion capture data. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages 137–146, 2006.
  • [15] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy. Sequence of the most informative joints (smij): A new representation for human skeletal action recognition. In Proc. Workshops of IEEE Conf. Comp. Vis. Patt. Recogn., pages 8–13, June 2012.
  • [16] E. Ohn-bar and M. M. Trivedi. Joint angles similiarities and hog 2 for action recognition. In Proc. Workshops of IEEE Conf. Comp. Vis. Patt. Recogn., 2013.
  • [17] O. Oreifej and Z. Liu. HON4D: Histogram of oriented 4d normals for activity recognition from depth sequences. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2013.
  • [18] M. Raptis and L. Sigal. Poselet key-framing: A model for human activity recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2650–2657, Washington, DC, USA, 2013. IEEE Computer Society.
  • [19] Z. Shao and Y. Li. A new descriptor for multiple 3d motion trajectories recognition. In Proc. IEEE Int. Conf. Robotics and Automation, pages 4749–4754, May 2013.
  • [20] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1297–1304, Washington, DC, USA, 2011. IEEE Computer Society.
  • [21] P. Turaga, R. Chellappa, V. Subrahmanian, and O. Udrea. Machine recognition of human activities: A survey. IEEE T. Circuits & Systems for Video Technology, 18(11):1473–1488, 2008.
  • [22] R. Vemulapalli, F. Arrate, and R. Chellappa. Human action recognition by representing 3d skeletons as points in a Lie group. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 588–595, June 2014.
  • [23] C. Wang, Y. Wang, and A. Yuille. An approach to pose-based action recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 915–922, June 2013.
  • [24] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3169–3176, June 2011.
  • [25] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1290–1297, June 2012.
  • [26] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3360–3367, June 2010.
  • [27] D. Wu and L. Shao. Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 724–731, June 2014.
  • [28] S. Wu, Y. Li, and J. Zhang. A hierarchical motion trajectory signature descriptor. In Proc. IEEE Int. Conf. Robotics and Automation, pages 3070–3075, May 2008.
  • [29] L. Xia, C. Chen, and J. Aggarwal. View invariant human action recognition using histograms of 3d joints. In Proc. Workshops of IEEE Conf. Comp. Vis. Patt. Recogn., pages 20–27. IEEE, 2012.
  • [30] X. Yang and Y. Tian. Eigenjoints-based action recognition using naïve-bayes-nearest-neighbor. In Proc. Workshops of IEEE Conf. Comp. Vis. Patt. Recogn., pages 14–19, 2012.
  • [31] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The “moving pose”: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In Proc. IEEE Int. Conf. Comp. Vis., 2013.
  • [32] Z. Zhao and A. Elgammal. Information theoretic key frame selection for action recognition. In Proc. British Mach. Vis. Conf., 2008.