Learning Linear Dynamical Systems with High-Order Tensor Data for Skeleton based Action Recognition

by   Wenwen Ding, et al.
Xidian University
NetEase, Inc

In recent years, there has been renewed interest in developing methods for skeleton-based human action recognition. A skeleton sequence can be naturally represented as a high-order tensor time series. In this paper, we model and analyze tensor time series with Linear Dynamical System (LDS) which is the most common for encoding spatio-temporal time-series data in various disciplines dut to its relative simplicity and efficiency. However, the traditional LDS treats the latent and observation state at each frame of video as a column vector. Such a vector representation fails to take into account the curse of dimensionality as well as valuable structural information with human action. Considering this fact, we propose generalized Linear Dynamical System (gLDS) for modeling tensor observation in the time series and employ Tucker decomposition to estimate the LDS parameters as action descriptors. Therefore, an action can be represented as a subspace corresponding to a point on a Grassmann manifold. Then we perform classification using dictionary learning and sparse coding over Grassmann manifold. Experiments on MSR Action3D Dataset, UCF Kinect Dataset and Northwestern-UCLA Multiview Action3D Dataset demonstrate that our proposed method achieves superior performance to the state-of-the-art algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4


Towards To-a-T Spatio-Temporal Focus for Skeleton-Based Action Recognition

Graph Convolutional Networks (GCNs) have been widely used to model the h...

Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition

Skeleton data carries valuable motion information and is widely explored...

Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition

3D action recognition - analysis of human actions based on 3D skeleton d...

Robust Event Detection based on Spatio-Temporal Latent Action Unit using Skeletal Information

This paper propose a novel dictionary learning approach to detect event ...

Analyzing Linear Dynamical Systems: From Modeling to Coding and Learning

Encoding time-series with Linear Dynamical Systems (LDSs) leads to rich ...

Skeleton Based Action Recognition using a Stacked Denoising Autoencoder with Constraints of Privileged Information

Recently, with the availability of cost-effective depth cameras coupled ...

Video-Based Action Recognition Using Rate-Invariant Analysis of Covariance Trajectories

Statistical classification of actions in videos is mostly performed by e...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human action recognition and behavior analysis based on spatio-temporal data are one of the hottest topics in the field of computer vision due to its many applications in smart surveillance, web-video search and retrieval, human-computer interfaces, and health-care. After the recent release of cost-effective depth sensors, we witness another growth of research on 3D data. The introduction of 3D data largely alleviates the low-level difficulties of illumination changes, occlusions, background extraction, Furthermore, the 3D positions of skeletal joints will be quickly and accurately estimated from a single depth image

Shotton et al. (2013). These recent advances have resulted in a keen interest in skeleton-based human action recognition.

The coupling of the spatial texture and the temporal dynamics is more challenging for understanding the human action than static data. The global dynamic of action sequences is usually captured by modeling the temporal variations with Linear Dynamical System (LDS) Turaga et al. (2010). The traditional model treats the latent and observation state at each human skeleton as a vector. Such vector representation fails to match the structural properties of the skeleton. In the most of previous works Aggarwal and Xia (2014); Presti and La Cascia (2016), representation for skeleton-based action recognition usually concatenates all the attribute of skeletal joint points together to get a single vector. In contrast, we consider a skeleton as directed graph, which the nodes are joint points and edges are rigid bodies between adjacent joint points. The representation and storage of a graph are mostly used in the matrix as well as a 2-order tensor. Therefore, the tensor-based time series is the most natural way for expressing human action sequences since they are multi-dimensional data objects not only capturing spatial and temporal information but also preserving higher order dependencies.

Figure 1: The general framework of the proposed approach.

These ideas pose us a new way to model and compare action sequence dynamics. In order to keep the original spatio-temporal information of an action video, and improve the performance of human action recognition, this paper proposes a generalized LDS (gLDS) framework shown in Fig. 1. First, human skeletons consisted by 3D human joint points in Euclidean space are extracted from depth camera. An action video (a time series of human skeleton) is represented as a -order tensor time series while each skeleton is converted to a -order tensor. Based on this action representation, a tensor is decomposed into a core tensor multiplied by a matrix along each mode. Then a subspace, spanned by columns of the observability matrix of the gLDS model, can be learned by using gLDS parameters acquired with the mode-n matricization of Tucker model. Therefore, an action can be represented as a subspace corresponds to a point on a Grassmann manifold . Finally, dictionary learning and sparse coding over Grassmann manifold, have been used to perform human action classification.

The contribution of the paper are the following: (1) We propose a novel skeleton-based tensor representation which not only keeps the original spatio-temporal information but also avoids the curse of dimensionality caused by the vectorization. (2) We model tensor time series utilizing gLDS model which generalizes vector-based states to tensor-based states via a multi-linear transformation. The gLDS models each tensor observation in the time series as the projection of the corresponding member of a sequence of latent tensors. (3) Compared to subspace methods

Doretto et al. (2003) the gLDS decomposes tensor-based time series to reveal the principal components which construct human skeleton. Therefore, the gLDS model achieves higher recognition accuracy for different datasets. (4) Simulation experiments shows that proposed tensor-based representation performs better than many existing skeletal representations by evaluating it on three different datasets. We also show that the proposed approach outperforms various state-of-the-art skeleton-based human action recognition approaches.

The remainder of this paper is organized as follows. Section 2 presents the related work. Section 3 briefly introduces some fundamental concepts of tensor and LDS. Section 4 elaborates gLDS and describes how gLDS parameters are learned in the tensor time series. Section 6 presents our experimental results and discussion and Section 7 concludes this paper.

2 Related Work

In this section we focus on the most recent methods of skeleton-based human action recognition from depth cameras. Two categories are reviewed: joint-based approaches and dynamics descriptors-based approaches. Joint-based approaches model the entire human skeleton by using a single feature representation, whereas dynamics descriptors-based approaches treat the skeleton sequence as 3D trajectories and model the dynamics of such time series.

Joint-based approaches: The methods belonging to this category attempt to correlate joints locations. In order to add temporal information to this representation, Yang Yang and Tian (2014)

employed the differences of 3D joint position between the current frame and the preceding frame, and between the current frame and the initial one. Action classification was performed by using the Naive-Bayes nearest neighbor rule in a lower dimensional space constructed by using principal component analysis (PCA). Li

Li et al. (2010) employed a bag-of-3D-points graph approach to encode actions based on body silhouette points projected to the three orthogonal Cartesian planes. More complex representation is introduced in Vemulapalli et al. (2014), where the relative 3D geometry between different rigid bodies is explicitly estimated. Their relative geometry between rigid body parts can be described as special Euclidean group using the rotation and translation. Therefore, the entire human skeleton in each frame can be described as a point in . A sequence of skeletons is a curve in the Lie group .

Dynamics descriptors-based approaches:

Methods for 3D skeleton-based action recognition in this category focus on modeling the dynamics of either subsets or all the joints in the skeleton. This can be accomplished by considering linear dynamical systems (LDS) or Hidden Markov models (HMM) or mixed approaches.

Xia Xia et al. (2012) mapped 3D skeletal joints to a spherical coordinate system and computed a histogram of 3D joint locations (HOJ3D) to achieve a compact posture representation. Linear Discriminant Analysis (LDA) was used to project the HOJ3D and computed the visual words. The temporal evolutions of those visual words were modeled by a discrete Hidden Markov Model (HMM). The parameters obtained from the LDS modeling of the skeletal joint trajectories likely describe positions and velocities of the individual joints. In Chaudhry et al. (2013)

, bio-inspired shape context features were computed by considering the directions of a set of points sub-sampled over the segments in each bodypart. The temporal evolutions of these bio-inspired features were modeled using an LDS and the method learns the corresponding estimated parameters by representing the action sequence. The Local Tangent Bundle Support Vector Machine (LTBSVM) in

Slama et al. (2015) used LDS to describe an action as a collection of time series of 3D locations of the joints. The dynamics captured by the LDS during an action sequence can be represented by the observability matrix O. The subspace spanned by columns of O corresponded to a point on a Grassmann manifold. While class samples were presented by a Grassmann point cloud, a Control Tangent (CT) space representing the mean of each action class was learned. Each observed sequence was projected on all CTs to form a Local Tangent Bundle (LTB) representation and linear SVM was adopted to perform classification.

3 Briefs of Fundamental Concepts

3.1 Tensor Notation and Operations

In this paper, vectors (tensors of order one) are denoted by lowercase boldface symbols like v. Matrices (tensors of order two) are denoted by uppercase boldface symbols like A. High and general tensors (order three or higher) are denoted by calligraphy symbols like . The order of a tensor is the number of dimensions, also known as ways or modes.

Let be an n-order tensor with as the th element and are integer numbers indicating the number of elements for each dimension. The vectorization is obtained by shaping the tensor into a vector, where is the scalar product of the size of each dimension. In particular, the elements of are given by , where . Unfolding along the p-mode is denoted as . The p-mode product of a tensor by a matrix is denoted by and is a tensor with entries


The tucker decomposition is a higher-order generalizations of the matrix singular value decomposition (SVD) and principal component analysis (PCA). It decomposes a tensor

into a core tensor multiplied (or transformed) by a matrix along each mode. Thus, we have


where is called the core tensor and its entries show the level of interaction between the different components. is the factor matrices (which are usually orthogonal) and can be thought of as the principal components in each mode. A matrix representation of this decomposition can be obtained by unfolding and as


where denotes the Kronecker product.

3.2 Linear Dynamical Systems

Given a time series, , LDS is the fundamental tools for encoding spatio-temporal data using the following Gauss-Markov process:


where is the -dimensional observed state, represent of -dimensional hidden state of the LDS at each time instant and represents the order of the LDS. is the transition matrix that linearly relates the states at time instants and and is the observation matrix that linearly transforms the hidden state to the output . and

are noise components modeled as normal with mean equal to zero and co-variance matrix

and respectively. Since C describes the spatial appearance and A represents the dynamics, the tuple can be adopted as an intrinsic characteristic of the LDS model Bissacco et al. (2001). Therefore, for LDS model, the goal is to learn tuple of LDS model given by Equ. 4. Given the observed sequence, subspace methods in Doretto et al. (2003) is widely used to learn the optimal model parameters. In this method, for seek a closed-form solution, it uses as the singular value decomposition of the observations , as shown in (a)a. If and , the estimated model parameters A and C are given by


where , and

is the identity matrix of size


Since C denotes the spatial appearance and A denotes the dynamics, the extended observability matrix can be adopted as the feature descriptor for an LDS model. The subspace spanned by columns of this finite observability matrix corresponds to a point on a Grassmann manifold which is a quotient space of orthogonal group . Points on the Grassmann manifold are equivalent classes of orthogonal matrices, with , where two matrices are equivalent if their columns span the same d-dimensional subspace.

4 Spatio-temporal Modeling of Tensor-based Action Sequence

4.1 Tensor Time Series

Figure 2: An action sequence can be represented as a 4-order tensor while each skeleton has different views.

The dynamic and continuity of movement imply that the action cannot be resumed as a simply set of skeleton because of the temporal information contained in the sequence. Instead of feature vectorization, we will consider the tensor time series, which is an ordered, finite collection of tensors that all share the same dimensionality. Such representation allows us to embody the action through searching tensor components that can better capture the variation in each mode as well as the independent of other modes.

A human skeleton has rigid bodies while it has joint points. Rigid bodies are skeleton segments between adjacent joint points and , which can be described by using the direction of as well as and . Each skeleton can be described as a 2-order tensor utilizing the following the set of rigid bodies:


where denotes the rigid body between joint point and joint point , denotes the 3D position of a joint point . In practice, this means that 9 parameters are needed to define the position of a 3D rigid body. A 3-order tensor time series with frames, called 3RB, can be conveniently constructed by these orderly arrangement skeletons.

If a human skeleton has different view, an action sequence can also be indicated as a -order tensor while each skeleton is represented as -order tensor, as shown in Fig. 2.

4.2 Tensor-based LDS

The gLDS model presents it only for the three-way case , as shown in Fig. 3, but the generalization to ways is straightforward.

Lemma 1

Let , , , , , , , , . Then


The product is only defined if the column number of C matches the product of the dimension of . Note that this tensor product generalizes the standard matrix-vector product in the case . We shall primarily work with tensors in their vector and matrix representations. Hence, we appeal to the following


Proof. .

Figure 3: The gLDS model with three modes.

In our gLDS equations, according to Lemma 1, the observed states and hidden states can be extended as tensor time series as follow:


where , , , , , , , and denote the process and measurement noise components, respectively. The goal of system identification is to estimate the parameters A and C from the tensor time series .

Let , , and notices that


Now the tucker decomposition to the tensor time series as shown in Fig. 4,


where , , and . According to the specified dimensions in , tucker decomposition computes the best rank approximation of tensor , where , and . Tucker decomposition is to reveal the latent semantic associations between human skeleton changes over time. Then, we consider the special case of mode-(3) unfolding to the Equ. (10) and (11).


where , and are the mode-(3) unfolding of the tensor , and respectively. Transpose on both sides of the Equ. 4.2, we have

Figure 4: Tucker decomposition of a third-order tensor . The column spaces of , and represent the subspaces for the three modes. The core tensor is non-diagonal, accounting for possibly complex interactions among tensor components.

From the definition to unfolding of tensor, we notice the equation and . Now consider the problem of finding the best estimate of in the sense of Frobenius: , = subject to Equ. 4.2. It follows immediately from the fixed rank approximation property of tucker decomposition that an estimation of the subspace mapping matrix and the underlying state sequence is given by setting

0:  n-order tensor time series , dimension of subspaces and the truncation parameter of observation
0:  the subspace S correspond to
1:  ; Tucker decomposition of
7:   compute SVD of O
Algorithm 1 Learning the gLDS model with n-order tensor time series

Then can be determined uniquely, again in the sense of Frobenius, by solving the following linear problem: , which is trivially done in closed-form using the state estimated from Equ. 15:


where denotes the Moore-Penrose inverse. Given the above estimates of and , the covariance matrices and can be estimated directly from residuals.

Starting from the initial state , the expected observation sequence of gLDS is obtained as



denotes an arbitrary eigenvalue of

. The transition matrix

needs to be stable with eigenvectors inside the unit circle. Therefore, we utilize a constraint generation method

Siddiqi et al. (2008), which achieves a stable result efficiently by iteratively checking the stability criterion and generating new constraints. As proposed in Bissacco et al. (2001) the extended observability can be approximated by taking -order observability matrices, which can be written as . In this way, an action can be alternately identified as an -dimensional subspace of . Thus, given a database of videos, we estimate parameters of gLDS as described above for each video. Since, an action video can be alternately identified as the subspace spanned by columns of corresponds to a point on a Grassmann manifold. Algorithm 1 provides the pseudo-code for extraction subspace using gLDS with n-order tensor time series.

4.3 Discussion

The method introduced in 3.2 is shown to be a valid approach to learn the parameters of LDS model. The order- of the LDS model is related with the best rank approximation of , which is determined by the truncation matrices that collect the first components of the SVD in the way indicated by the dashed lines in Fig. (a)a. For , the order- will not be more than . The C, denotes spatial appearance, is not associated with . This causes a non-optimal estimation of , especially while the variation scope for duration of time series has great fluctuation.

In this paper, we show that the standard SVD can be replaced by tensor decomposition and unfolding. This is a more natural and flexible decomposition, since it permits us to perform dimension reduction in the spatial structure and temporal components of the video sequence. As shown in (b)b, the spatial structure and are independent of . The estimation of is only associated with and . Thus the ill-conditioned estimation of C is avoided effectively.

Figure 5: (a) Standard SVD of a matrix Y and its components U, V (unitary matrices) and S (diagonal matrix). The dashed lines indicate the row/column truncation. (b) The standard SVD can be replaced by tensor decomposition and unfolding. After this, is equal with in its dimension.

5 Sparse coding on the Grassmann manifold

In order to represent a subspace as the combination of few subspaces of a dictionary, a seemingly straightforward method Goh and Vidal (2008); Xie et al. (2013) is through embedding Grassmann manifolds into Euclidean spaces via the tangent bundle of the manifold. This method not only requires intensive computing but also makes its estimation numerically not accurate. To avoid these limitations, another common extrinsic method Mehrtash and Richard (2015) performs sparse coding and dictionary learning on Grassmann manifolds by embedding the manifolds into the space of symmetric matrices. Let be the set of idempotent and symmetric matrices of rank . For any , this projection mapping function is performed by , where is optimized subspace spanned by the matrix X to that of the orthonormal basis. Thus, an important metric, called the chordal metric, will be used in a more general space to recast the problem which is hard to define tractable arithmetical calculations and distance metric on Grassmann manifold:


This metric will be used to recast the coding and consequently dictionary-learning problem in terms of chordal distance.

Formally, given a dictionary , a query sample and coefficients , , the coding objective function with a penalty term can be written as:


Here, The -norm regularization is employed to the coefficients for sparsity assurance and is the sparsity penalty parameter. We refer the reader to Mehrtash and Richard (2015) for a general introduction to sparse coding and further mathematical details on their extrinsic solution for sparse coding and dictionary learning in the space of linear subspaces.

6 Experiments

In this section, we evaluate the proposed gLDS with tensor time series testing on three different datasets: MSR-Action3D Li et al. (2010), UTKinect-Action Xia et al. (2012) and Northwestern-UCLA Multiview Action3D Dataset 111http://users.eecs.northwestern.edu/jwa368 /mydata.html .

6.1 Alternative Representations of Tensor time Series

To show the effectiveness of the proposed gLDS model, we compare the performance of the following four representations of tensor time series:

2-order joint positions (2JP): Each action sequence can be viewed as a 2-order tensor time series , where each frame is a vector which concatenates 3D coordinates of all the joint points.

2-order rigid bodies (2RB): Each action sequence can be viewed as a 2-order tensor time series , where each frame is a vector which concatenates the parameters of all the rigid bodies.

3-order Joint positions (3JP): Each action sequence can be viewed as a 3-order tensor time series .

3-order Screw Motions (3SM): Recently proposed in Vemulapalli et al. (2014), screw motion between two rigid bodies is represented as point in . The Lie algebra of is denoted as . Mapping the point from to , a 6-dimensional vector representation will be acquired. Therefore, each action sequence can be viewed as a 3-order tensor time series .

6.2 Evaluation Settings and Parameters

The feature dimension depends on the number of 3D joint points (20 values for the Microsoft SDK skeleton and 15 for the PrimeSense NiTE skeleton). In the case of MSR-Action3D, UTKinect-Action datasets and Northwestern-UCLA Multiview Action3D Dataset, each skeleton has 19 rigid bodies and 20 joint points. To make the skeletal data invariant to absolute location of the human in the scene, all 3D joint coordinates is transformed from the world coordinate system to a person-centric coordinate system by placing the hip center at the origin.

The dynamic of action is captured by using gLDS model with tensor time series. In this process, the size of core tensor can be significantly smaller than for the tensor time series . We set , and in order to approximate the value of original tensor time series, where is the subspace dimension and represents the truncation parameter of time series. Each action sequence is a point on the Grassmann manifold with while skeleton is represented as 3RB.

Method AS1 AS2 AS3 Overall
Bag of 3D PointsLi et al. (2010) 72.9 71.9 79.2 74.7
HOJ3DXia et al. (2012) 88.0 85.5 63.3 78.9
EigenjointsYang and Tian (2014) 74.5 76.1 96.4 83.3
LARPVemulapalli et al. (2014) 95.29 83.87 98.22 92.46
2JP-LDS 88.34 87.82 98.11 91.42
2RB-LDS 90.24 88.6 96.49 91.78
2JP-gLDS 95.04 87.17 98.68 93.63
2RB-gLDS 96.21 88.13 96.12 93.49
3SM-gLDS 94.64 86.55 98.65 93.28
3JP-gLDS 95.31 87.24 98.65 93.73
3RB-gLDS 96.81 89.14 98.83 94.85
Table 1: Comparison: Recognition rate (%) on the MSR-Action3D dataset in cross-subject setting based on AS1, AS2, and AS3.
Method LARP Vemulapalli et al. (2014) LTBSVM Slama et al. (2015) 3RB-gLDS
Accuracy 89.48 91.21 94.96
Table 2: Recognition rate (%) on the MSR-Action3D dataset based on experimental protocol of Wang et al. (2014a)

6.3 MSR-Action 3D Dataset

The MSR Action3D Dataset Li et al. (2010) is an action dataset of depth sequences captured by a depth camera. Following experimental protocol of Li et al. (2010), the dataset was divided into subsets , and , each consisting of actions. The and group actions with similar movements, while the subset group complex actions with more joints engaged. We performed recognition on each subset separately using cross-subject test setting which is one half of the subjects was used as training and the other half was used as testing data. We repeated the experiment with different subspace dimension and report the mean performance. Table 1 compares the proposed approach to some methods use 3D joint positions extracted from depth videos. We can observe that the accuracy clearly outperforms the other methods. Our approach using gLDS achieves a total accuracy of 94.85 on the MSRAction3D in cross-subject experiment, significantly outperforming the other joint-based action recognition methods, including Histogram of 3D joints (HOJ3D) Xia et al. (2012), EigenJoints Yang and Tian (2014) and Lie Algebra Relative Pairs (LARP) Vemulapalli et al. (2014), which achieved accuracies of 78.9, 83.3 and 92.46, respectively. The average accuracy of 3RB-gLDS is 3.07 better than 2RB-gLDS. Better performance on subsets , and indicates that the proposed gLDS is better than others in differentiating similar and complex actions.

Figure 6: Recognition rate variation with learning approach and subspace dimension.

Following experimental protocol of Wang et al. (2014a), instead of dividing the dataset into three subsets, our method achieves an total accuracy of 94.96 as shown in Tabel 2, which is applied to the entire dataset consisting of 20 actions. This experimental setting is more difficult compared to that of Li et al. (2010). To evaluate the effect of the changing of the subspace dimensions, we conduct several tests on the MSR-action3D dataset with different dimensions of subspaces. Fig.6 shows the variation of recognition performances with the change of the subspace dimension. We remark that until dimension 16, the recognition rate generally increases with the increase of the size of the subspaces dimensions. This is expected, since a small dimension causes a lack of information but also a big dimension of the subspace keeps noise and brings confusion between inter-classes. We also compare in this figure, our new introduced learning algorithm 3RB to 2RB and LTBSVM Slama et al. (2015).

Figure 7 shows the confusion matrices for MSRAction3D. For most actions, about 14 classes of 20 actions, video sequences are 100

correctly classified. We can see that the classification errors occur if two actions are too highly similar to each other, such as

high arm wave and hand catch.

Figure 7:

The confusion matrix for MSR-Action3D dataset.

6.4 UT-Kinect Dataset

Sequences of UT-Kinect dataset are taken using a stationary Kinect sensor. To allow for comparison, we followed the experiment protocol proposed by Xia et al. (2012) which is Leave One Sequence Out Cross Validation (LOOCV) on the 199 sequences. For each test, one sequence is used for testing and the other 199 sequences were used for training. Table 3 shows comparison between the recognition accuracy produced by our approach and other methods such as HOJ3D Xia et al. (2012) and LTBSVM Slama et al. (2015). The accuracy of each action is more than 80 and the overall accuracy of our approach is 7.98 and 5.56 better than HOJ3D Xia et al. (2012) and LTBSVM Slama et al. (2015) respectively. These means that our approach is efficient and robust to change in action types thanks to the used learning approach gLDS.

Action LITSVMSlama et al. (2015) HOJ3DXia et al. (2012) 3RB-gLDS
Walk 100 96.5 85
Stand up 100 91.5 100
Pick up 100 97.5 100
Carry 100 97.5 95
Wave 100 100 85
Throw 60 59 95
Push 65 81.5 100
Sit down 80 91.5 100
Pull 85 92.5 100
Clap hands 95 100 100
Overall 88.5 90.92 96.48
Table 3: Recognition rates (%) on UT-Kinect Dataset using the experiment protocol of Xia et al. (2012)

6.5 Northwestern-UCLA Multiview Action3D Dataset

Northwestern-UCLA Multiview 3D event dataset contains RGB, depth and human skeleton data captured simultaneously by 3 Kinect cameras. Thus, each action sequence having 3 different views can be represented as a 4-order tensor time series 4RB, which the size is . This can help us to capture the embedded variation from different views. We follow experiment protocol of Wang et al. (2014b) which use the samples from 9 subjects as training data, and leave out the samples from 1 subject as testing data. Tabel 4 compares the recognition accuracy of our proposed gLDS and other approaches. Our method achieves higher accuracy which is about 10.8 higher than than MST-AOG Wang et al. (2014b) under the cross-subject setting. In contrast, under the cross-view setting, the overall accuracy of our proposed method has not been greatly improved, which is only about 1.3 higher than MST-AOG Wang et al. (2014b). This can be explained by the fact that our approach based on only 3D coordination of joint points are not enough to find view invariant features.

Method C-Subject C-View
Virtual View Li and Zickler (2012) 50.7 47.8
Hankelet Li et al. (2012) 54.2 45.2
Poselet Sadanand and Corso (2012) 54.9 24.5
MST-AOG Wang et al. (2014b) 81.6 73.3
Proposed 92.99 74.6
Table 4: Recognition rates (%) on Multiview-3D dataset

7 Conclusions and Future Perspectives

In this paper, we have developed a novel action representation, the gLDS model, that take 3D skeleton sequence as tensor time series without unfolding the human skeletons on column vectors. It takes advantage of tucker decomposition to estimate the parameters of gLDS model as action descriptors. Our extensive experiments have demonstrated that gLDS significantly improves the accuracy and robustness for cross-subject action recognition. The major contributions include several skeleton-based tensor representation. The next step is to employ gLDS to multi-person interactions.


This work was supported in part by the National Natural Science Foundation of China under Grant No. 61571345, the National Natural Science Foundation of China under Grant No. 9153801, and the National Natural Science Foundation of China under Grant No. 61550110247.



  • Shotton et al. (2013) J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, R. Moore, Real-time human pose recognition in parts from single depth images, Communications of the ACM 56 (1) (2013) 116–124.
  • Turaga et al. (2010) P. Turaga, A. Veeraraghavan, A. Srivastava, R. Chellappa, Statistical analysis on manifolds and its applications to video analysis, in: Video search and mining, Springer, 115–144, 2010.
  • Aggarwal and Xia (2014)

    J. Aggarwal, L. Xia, Human activity recognition from 3d data: A review, Pattern Recognition Letters 48 (2014) 70–80.

  • Presti and La Cascia (2016) L. L. Presti, M. La Cascia, 3D skeleton-based human action classification: A survey, Pattern Recognition 51 (53) (2016) 130–147.
  • Doretto et al. (2003) G. Doretto, A. Chiuso, Y. N. Wu, S. Soatto, Dynamic textures, International Journal of Computer Vision 51 (2) (2003) 91–109.
  • Yang and Tian (2014) X. Yang, Y. Tian, Effective 3d action recognition using eigenjoints, Journal of Visual Communication and Image Representation 25 (1) (2014) 2–11.
  • Li et al. (2010) W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3d points, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, IEEE, 9–14, 2010.
  • Vemulapalli et al. (2014) R. Vemulapalli, F. Arrate, R. Chellappa, Human action recognition by representing 3d skeletons as points in a lie group, in: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, IEEE, 588–595, 2014.
  • Xia et al. (2012) L. Xia, C.-C. Chen, J. Aggarwal, View invariant human action recognition using histograms of 3d joints, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEE, 20–27, 2012.
  • Chaudhry et al. (2013) R. Chaudhry, F. Ofli, G. Kurillo, R. Bajcsy, R. Vidal, Bio-inspired dynamic 3D discriminative skeletal features for human action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 471–478, 2013.
  • Slama et al. (2015) R. Slama, H. Wannous, M. Daoudi, A. Srivastava, Accurate 3D action recognition using learning on the Grassmann manifold, Pattern Recognition 48 (2) (2015) 556–567.
  • Bissacco et al. (2001) A. Bissacco, A. Chiuso, Y. Ma, S. Soatto, Recognition of human gaits, in: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 2, IEEE, II–52, 2001.
  • Siddiqi et al. (2008) S. M. Siddiqi, B. Boots, G. J. Gordon, A constraint generation approach to learning stable linear dynamical systems, in: Advances in Neural Information Processing Systems, 471–478, 2008.
  • Goh and Vidal (2008) A. Goh, R. Vidal, Clustering and dimensionality reduction on Riemannian manifolds, in: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 1–7, 2008.
  • Xie et al. (2013)

    Y. Xie, J. Ho, B. Vemuri, On a nonlinear generalization of sparse coding and dictionary learning, in: Journal of Machine Learning Research: Workshop and Conference Proceedings, 1480–1488, 2013.

  • Mehrtash and Richard (2015) H. Mehrtash, H. Richard, Extrinsic Methods for Coding and Dictionary Learning on Grassmann Manifolds, International Journal of Computer Vision 51 (2) (2015) 91–109.
  • Wang et al. (2014a) J. Wang, Z. Liu, Y. Wu, Learning Actionlet Ensemble for 3D Human Action Recognition, in: Human Action Recognition with Depth Cameras, Springer, 11–40, 2014a.
  • Wang et al. (2014b) J. Wang, X. Nie, Y. Xia, Y. Wu, S.-C. Zhu, Cross-view action modeling, learning and recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2649–2656, 2014b.
  • Li and Zickler (2012) R. Li, T. Zickler, Discriminative virtual views for cross-view action recognition, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2855–2862, 2012.
  • Li et al. (2012) B. Li, O. I. Camps, M. Sznaier, Cross-view activity recognition using hankelets, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 1362–1369, 2012.
  • Sadanand and Corso (2012) S. Sadanand, J. J. Corso, Action bank: A high-level representation of activity in video, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 1234–1241, 2012.