I Introduction
3D Action recognition uses the information of 3D coordinates of joints [28]. The methods of action recognition in the literature can be divided into two main categories [25]. The first category deals with the time series of joints and extracts features, such as Dynamic Time Warping (DTW), from these time series [9]
. The second category classifies every frame into some predefined body poses and uses probabilistic approach to recognize the sequence as an action
[17].Some of the action recognition methods are based on convolutional networks [22, 33]. Some other methods, however, have a manifoldbased or subspacebased approaches. In these approaches, the actions lie on subspaces of data [21]. Riemannian manifolds [5] and Grassmann manifolds [30] have been extensively used for action recognition. Some of the methods embed actions in sparse manifolds [26] to make use of betting on sparsity principle [20].
Some of the action recognition methods use basic subspace learning methods for recognizing poses or actions. For example, histograms of 3D joint locations (HOJ3D) utilizes Linear Discriminant Analysis (LDA) which is equivalent to Fisher Discriminant Analysis (FDA) [10]. LDA is also used in [27] for recognizing involuntary actions. Fisherposes is a method which uses FDA [20, 7] for embedding the body poses and their recognition [17]. The reader can refer to [17] and the references therein to see papers in action recognition which use LDA or FDA. FDA is very useful for pose embedding and action recognition because it maximizes and minimizes the interclass and intraclass scatters of body poses, respectively [13].
There exist many subspace learning methods which are based on generalized eigenvalue problem [12]
. Some examples of these methods are Principal Component Analysis (PCA)
[23, 11], Supervised PCA (SPCA) [2], and FDA [13]. The subspace of PCA maximizes the scatter of projected data while SPCA has a subspace which maximizes the statistical dependence of projected data and the class labels. Roweis Discriminant Analysis (RDA) [15, 16] is a family of infinite number of subspace learning methods based on generalized eigenvalue problem. It generalizes PCA, SPCA, and FDA and also includes Double Supervised Discriminant Analysis (DSDA) [16].Basic subspace learning methods, such as PCA, SPCA, and FDA, can be used for 3D action recognition for embedding poses and classifying actions. There have appeared different face recognition methods using basic subspace learning approaches [1]. Some examples are eigenfaces [32], Fisherfaces [4], supervised eigenfaces [2, 11], and double supervised eigenfaces [16]. Similar need to methods based on basic subspace learning is sensed for action recognition literature.
In this paper, we propose Roweisposes for action recognition. This method is based on basic subspace learning approaches which make use of generalized eigenvalue problem [12]. This method, which uses RDA [16], generalizes the Fisherposes method [17]. Some of the special cases of Roweisposes are based on PCA, SPCA, and DSDA and we name them eigenposes, supervised eigenposes, and double supervised eigenposes, respectively. The Roweisposes method includes infinite number of subspace learning methods for embedding the body poses useful for action recognition.
The remainder of this paper is organized as follows. Section II reviews FDA and how it can be restated using the total scatter. Moreover, the Fisherposes method is reviewed in this section. Section III reviews the theory of PCA and SPCA as well as RDA and the Roweis map. The Roweisposes method is proposed in Section IV
where preprocessing, training the Roweisposes subspace, pose recognition, windowing, training hidden Markov model, and action recognition are detailed. The experiments are reported in Section
V to show the effectiveness of the proposed method. Finally, Section VI concludes the paper and enumerates the future work.Ii Review of Fisherposes
Iia Fisher Discriminant Analysis
Let be a dataset of sample size and dimensionality . We denote the number of classes by , the sample size of th class by , and the th instance of the th class by . FDA [20, 13], first proposed in [7], maximizes the interclass (between) scatter and minimizes the intraclass (within) scatter of the projected data:
(1)  
subject to 
where and are the between and within scatters, respectively, defined as:
(2)  
(3) 
and the total mean and the mean of the th class are and . The solution to Eq. (1) is the generalized eigenvalue problem [12].
The total scatter can be considered as the summation of the between and within scatters [35]:
(4) 
where the covariance matrix, or the total scatter, is defined as:
(5)  
where is the centering matrix:
(6) 
with
as identity matrix and
as the vectors of ones. The
is because the centering matrix is symmetric and idempotent.Hence, the Fisher criterion can be written as:
(7) 
The is a constant and can be dropped in the optimization problem because the variable and not the objective is the goal; therefore, the optimization in FDA can be expressed as:
(8)  
subject to 
whose solution is the generalized eigenvalue problem [12]
. Hence, the FDA subspace is spanned by the eigenvectors of this generalized eigenvalue problem.
IiB Fisherposes
The Fisherposes method [17] is an action recognition approach which uses 3D skeletal data as input and constructs a Fisher subspace, for discrimination of body poses, using FDA. Instead of using the raw data, it applied some preprocessing on the 3D data. These preprocessing steps include skeleton alignment by translating the hip joint to the origin and aligning the shoulders to cancel the orientation of body. Moreover, the scales of skeletons are removed and some informative joints are selected amongst all the available joints.
After the preprocessing step, different body poses are selected out of the dataset where every action can be decomposed into a sequence of some of these poses. The body poses are considered as classes and the instances of a body pose are used as the data of that class. The information of joints in a body pose are concatenated to form a vector. Using these data vectors, the FDA subspace is trained to discriminate the body poses of an action recognition dataset.
Using Euclidean distance of the projected frame onto the FDA subspace from the projection of training data, the pose of a frame is recognized. Windowing is also applied to eliminate the frames which do not belong to any of the poses well enough. The distance of projection onto the subspace is used as a criterion for windowing. Finally a Hidden Markov Model (HMM) [14] is used to learn the sequences of recognized poses as different actions. For some datasets in which some actions contain similar poses without consideration of movement of body, histogram of trajectories is also used to discriminate those actions.
Iii Roweis Discriminant Analysis
Iiia PCA and SPCA
PCA [23, 11] finds a subspace which maximizes the scatter of projected data. Its optimization is:
(9)  
subject to 
where is the total scatter defined in Eq. (5). The solution to Eq. (9) is the eigenvalue problem for [12]. Hence, PCA subspace is spanned by the eigenvectors of total scatter.
SPCA [2]
uses the empirical estimation of the HilbertSchmidt Independence Criterion (HSIC)
[19]:(10) 
where and
are the kernels over the first and second random variables, respectively. The idea of HSIC is to measure the dependence of two random variables by calculating the correlation of their pulled values to the Hilbert space. SPCA uses HSIC and to maximize the dependence of the projected data
and the labels . It uses the linear kernel for the projected data and an arbitrary valid kernel for the labels . Therefore, the scaled Eq. (10) in SPCA is where is because of the cyclic property of the trace. The optimization problem in SPCA is:(11)  
subject to 
where is the kernel matrix over the labels of data, either for classification or regression. The solution to Eq. (11) is the eigenvalue problem for [12]. Hence, the SPCA directions are the eigenvectors of .
IiiB RDA and Roweis Map
Comparing Eqs. (8), (9), and (11) shows that they follow a general form of optimization:
(12)  
subject to 
whose solution is the generalized eigenvalue problem [12].
Following this general form, RDA [15, 16] solves this optimization problem:
(13)  
subject to 
where and are the first and second Roweis matrices which are:
(14)  
(15) 
respectively, where:
(16) 
The variables and are the first and second Roweis factors. Changing the Roweis factors gives a Roweis map including infinite number of subspace learning methods. This map, as well as the supervision level , is illustrated in Fig. 1. As this figure shows, four extreme special cases of RDA are PCA (with ) [23, 11], SPCA (with ) [2], FDA (with ) [7, 13], and Double Supervised Discriminant Analysis (DSDA) (with ) [15, 16]. The RDA subspace is spanned by the eigenvectors of the generalized eigenvalue problem [12].
Iv Roweisposes for 3D Action Recognition
We propose Roweisposes, for 3D action recognition, which generalizes the Fisherposes method (with ). In addition to this generalization, it also includes new methods for action recognition, namely eigenposes (with ), supervised eigenposes (with ), and double supervised eigenposes (with ). The Roweisposes method uses 3D skeletal information of frames and its subspace discriminates the body poses for an action dataset.
Iva Preprocessing and Data Preparation
Inspired by Fisherposes [17], some preprocessing steps are applied to the joints. These steps are translating hip to the origin, shoulder alignment, removing scale of joints, and selecting the most informative joints. The details of these steps can be found in [17] and we do not repeat for the sake of brevity. Different body poses are selected from the actions, either manually [17] or automatically [18]. We take the body poses as the classes to be discriminated by the RDA subspace. The data for training this subspace are created by vectorizing the information of joints in the frame indicating the body pose. In other words, if the number of selected joints is denoted by and the 3D coordinate of the th joint is denoted by , the vector of body pose in a frame is vectorized as:
(17) 
hence, .
IvB Training the Roweisposes Subspace
Using the vectorized frames of body poses and their pose labels, we train an RDA subspace for discriminating the poses. This subspace is named Roweisposes. The Roweisposes subspace is spanned by the eigenvectors of the generalized eigenvalue problem according to Eq. (13).
After training this subspace, every training body frame, indexed by , is projected onto this subspace:
(18) 
where denotes the projected data, is the dimensionality of subspace, and is the projection matrix whose columns are the leading eigenvectors of the generalized eigenvalue problem .
IvC Pose Recognition
The body pose of an unknown train/test frame, , is then recognized using Euclidean distances from the means of projected classes onto the Roweisposes subspace. The minimum distance determines the body pose:
(19) 
where denotes the norm, is the number of classes (poses), and is the mean of the projection of the th class onto this subspace:
(20) 
IvD Windowing
Some of the frames are the middle frames as transition between two poses. These frames are better to be removed from the sequence which is going to be fed to HMM. This helps HMM learn the sequences of frames more accurately. However, this elimination of frames may make some sequences very short. In order to have a tradeoff between appropriate sequence length and the notaccurate frame removal, we use windowing. In windowing, we do not let all the frames within the window be removed. Similar approach of windowing in Fisherposes is used here (see [17, Algorithm 1]).
IvE Training HMM model
The next step is to train an HMM model [14] using the recognized poses of the training body frames using the Roweisposes subspace. The training frames of every sequence are projected onto the subspace and their poses are recognized using Eq. (19
). Sequences of the recognized body frames are fed to HMM and expectation maximization and the BaumWelch algorithm
[3] are employed to train the model.IvF Action Recognition
In the test phase, the frames of the test sequence are projected onto the Roweisposes subspace and their poses are recognized using Eq. (19). The sequence of recognized frames are then fed to the HMM. For the test phase in HMM, we use the Viterbi algorithm to determine the action of the test sequence [24]. The action with the highest likelihood is determined as the action of that test sequence [14].
V Experiments
Va Datasets
For validating the effectiveness of the proposed Roweisposes, we used three publicly available datasets following the paper [17]. In the following, we introduce the characeristics of these datasets.
VA1 TST Dataset
The first dataset is the TST fall detection [8] which includes two categories of normal and fall actions. The normal actions are sitting, grasping, walking, and lying down, and the fall actions are falling front, back, side, and falling backward which ends up sitting. The number of subjects performing the actions are 11.
VA2 UTKinect Dataset
The UTKinect dataset [34] contains 10 actions which are walking, sitting down, standing up, picking up, carrying, throwing, pushing, pulling, waving, and clapping hands. This dataset has 10 subjects performing these actions.
VA3 UCFKinect Dataset
The UCFKinect dataset [6] includes 16 actions which are balancing, climbing ladder, ducking, hopping, kicking, leaping, punching, running, stepping back, stepping front, stepping left, stepping right, twisting left, twisting right, and vaulting. The number of subjects in this dataset is 16.
VB Experimental Setup
The selected joints in the skeletal data of the three datasets were followed by the selections in the Fisherposes method. The reader can refer to [17, Fig. 2] for viewing the selected joints in these datasets. The selected poses for the three datasets follow [17, Fig. 5]
. For the experiments, we used leaveonepersonout cross validation. The hyperparameters for windowing and histogram of trajectories were determined according the the paper
[17]. Euclidean distance was used for recognizing the body poses after projection onto the Roweisposes subspace. Employing the regularized Mahalanobis distance is deferred to future work (see Section VI).TST  UTKinect  UCFKinect  

Grassmann manifold [30]  –  88.50%  97.91% 
HOJ3D [34]  70.83%  90.92%  – 
kinetic energy [29]  84.09%  –  – 
CRR [31]  89.39%  95.98%  – 
Riemannian manifold [5]  –  91.50%  – 
MDTW [9]  92.30%  96.80%  97.90% 
Sparseness Embedding [26]  94.27%  92.00%  – 
Roweisposes ()  81.44%  38.50%  87.19% 
Roweisposes () [17]  76.14%  82.50%  79.22% 
Roweisposes ()  82.20%  70.50%  86.80% 
Roweisposes ()  76.52%  79.00%  71.72% 
Roweisposes ()  79.17%  83.50%  80.02% 
Roweisposes ()  80.68%  82.50%  86.25% 
Roweisposes ()  79.92%  41.00%  88.36% 
Roweisposes ()  80.30%  82.50%  69.45% 
Roweisposes ()  81.82%  80.50%  86.25% 
VC Performance of Roweisposes
The average of accuracies over the cross validation folds are reported in Table I. The first part of table reports the related work and the stateoftheart performances on the TST, UTKinect, and UCFKinect datasets. The second part of table contains the performances of the four extreme cases of Roweisposes which are eigenposes (with ), Fisherposes (with ), supervised eigenposes (with ), and double supervised eigenposes (with ). the thirs part of table reports the performance of some of the middle special cases of Roweisposes in the Roweis map. These cases are (), (), (), (), and ().
Regarding the comparison of the special cases of Roweisposes against each other, we see that, except some exceptions, higher supervision level mostly results in better performance. In other words, usually, Fisherposes and supervised eigenposes have superior or comparable performance than eigenposes. This is especially true for the UTKinect dataset while in the other datasets, the performances are comparable. This is expected because more level of supervision make more use of the labels and thus improves the recognition by learning a more discriminative subspace for poses. The performance of double supervised eigenposes is not necessarily much better than the other cases. This fact has been shown to be also true for facial image and the MNIST dataset in the paper [16]. That paper shows that only in some regression problems, which is not the case study of this paper, DSDA outperforms other cases.
The performance of the cases is slightly different that reported for Fisherposes in paper [17]. This has two small reasons. The first reason is that FDA as a special case of RDA uses the optimization problem (8) but the Fisherposes method in paper [17] uses problem (1). The second reason is that this paper uses Euclidean distance in Eq. (19) for simplicity of the method and elimination of a hyperparameter. The paper [17] uses a regularized Mahalanobis distance instead. We have deferred trying the Roweisposes method with Mahalanobis distance to the future work (see Section VI). The middle special cases of Roweisposes also show that the performance of this method is almost stable in most of the cases of Roweisposes in the Roweis map.
The related work and stateoftheart methods reported in Table I are the Grassmann manifold [30], HOJ3D [34], kinetic energy [29], Contributive Representation based Reconstruction (CRR) [31], Riemannian manifold [5], MDTW [9], Multidimensional DTW (MDTW) [9], and Sparseness embedding [26]. On TST data, Roweisposes outperforms HOJ3D and has comparable performance with the kinetic energy method. Although the Rowesiposes method does not outperform the state of the art in other cases, this does not question the value of the proposed method because it proposes infinite number of action recognition methods using basic subspace learning methods based on generalized eigenvalue problem. The rich literature of action recognition has a lack for such fundamental methods; although these basic methods have been vastly used in other fields such as face recognition [1].
VD Confusion Matrices
Figure 2 depicts the confusion matrices of the performance of Roweisposes in nine special cases of the Roweis map. For the sake of brevity, we only report the matrices for TST dataset. As this figure shows, the performance of Roweisposes is stable and acceptable along different parts of the map. However, some actions are recognized more differently in different cases. For example, falling side is recognized better by increasing . This makes sense because of using more information of labels in the kernel over labels (see Eq. (16)). In larger values of , this action is confused with similar actions such as falling front, falling back, and falling ending up sitting. An example for the impact of increasing is the grasping action. In most of the cases, increasing improves the the performance of Roweisposes for recognizing this action which is expected because of more use of labels in the within scatter (see Eq. (15)).
VE Comparison of Subspaces
The embedding of body poses in the two leading Roweisposes directions are depicted in Fig. 3. For brevity, we only show the embedding of TST dataset. This figure shows the embeddings along nine different cases in the Roweis map. As this figure shows, poses are more more confused in the smaller supervision level, i.e., smaller values of and . This is expected because more supervision level uses the information of pose labels more. For instance, the pose laying side is confused when . Increasing discriminates this pose much better. The poses laying back and sitting on ground are also better separated when is increased. As an example for effect of , in , increasing from to has separated the poses laying side and laying back more.
Vi Conclusion and Future Work
In this paper, we proposed the Roweisposes method which generalizes Fisherposes. This method includes infinite number of action methods which learn a discriminative subspace for embedding body poses using the generalized eigenvalue problem. Four special cases of Roweisposes are Fisherposes, eigenposes, supervised eigenposes, and double supervised eigenposes. Compared to the complicated new action recognition methods, the need to this basic action recognition method is especially sensed when similar basic approaches have been proposed for face recognition such as Fisherfaces, eigenfaces, supervised eigenfaces, and double supervised eigenfaces.
At least two possibilities for future direction exist to improve the Roweisposes method. The first improvement is to use Mahalanobis distance instead of Euclidean distance for recognizing the body pose after projection onto the Roweisposes subspace. Tuning the parameters of regularized Mahalanobis distance [17] may improve the accuracy of action recognition. The other future direction is to sampling the training frames for learning the body poses automatic rather than manual selection [18].
References
 [1] (2007) 2D and 3d face recognition: a survey. Pattern recognition letters 28 (14), pp. 1885–1906. Cited by: §I, §VC.
 [2] (2011) Supervised principal component analysis: visualization, classification and regression on subspaces and submanifolds. Pattern Recognition 44 (7), pp. 1357–1371. Cited by: §I, §I, §IIIA, §IIIB.

[3]
(1970)
A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains
. The annals of mathematical statistics 41 (1), pp. 164–171. Cited by: §IVE.  [4] (1997) Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence 19 (7), pp. 711–720. Cited by: §I.
 [5] (2014) 3d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE transactions on cybernetics 45 (7), pp. 1340–1352. Cited by: §I, §VC, TABLE I.
 [6] (2013) Exploring the tradeoff between accuracy and observational latency in action recognition. International Journal of Computer Vision 101 (3), pp. 420–436. Cited by: §VA3.
 [7] (1936) The use of multiple measurements in taxonomic problems. Annals of eugenics 7 (2), pp. 179–188. Cited by: §I, §IIA, §IIIB.
 [8] (2015) Proposal and experimental evaluation of fall detection solution based on wearable and depth data fusion. In International Conference on ICT Innovations, pp. 99–108. Cited by: §VA1.
 [9] (2018) Simultaneous joint and object trajectory templates for human activity recognition from 3d data. Journal of Visual Communication and Image Representation 55, pp. 729–741. Cited by: §I, §VC, TABLE I.
 [10] (2019) Linear and quadratic discriminant analysis: tutorial. arXiv preprint arXiv:1906.02590. Cited by: §I.
 [11] (2019) Unsupervised and supervised principal component analysis: tutorial. arXiv preprint arXiv:1906.03148. Cited by: §I, §I, §IIIA, §IIIB.
 [12] (2019) Eigenvalue and generalized eigenvalue problems: tutorial. arXiv preprint arXiv:1903.11240. Cited by: §I, §I, §IIA, §IIA, §IIIA, §IIIA, §IIIB, §IIIB.
 [13] (2019) Fisher and kernel Fisher discriminant analysis: tutorial. arXiv preprint arXiv:1906.09436. Cited by: §I, §I, §IIA, §IIIB.
 [14] (2019) Hidden Markov model: tutorial. engrXiv. Cited by: §IIB, §IVE, §IVF.
 [15] (2019) Roweis discriminant analysis: a generalized subspace learning method. arXiv preprint arXiv:1910.05437. Cited by: §I, Fig. 1, §IIIB.
 [16] (2020) Generalized subspace learning by Roweis discriminant analysis. In 17th International Conference on Image Analysis and Recognition, Cited by: §I, §I, §I, Fig. 1, §IIIB, §VC.
 [17] (2018) Fisherposes for human action recognition using Kinect sensor data. IEEE Sensors Journal 18 (4), pp. 1612–1627. Cited by: §I, §I, §I, §IIB, §IVA, §IVD, §VA, §VB, §VC, TABLE I, §VI.
 [18] (2017) Automatic extraction of keyposes and keyjoints for action recognition using 3d skeleton data. In 2017 10th Iranian Conference on Machine Vision and Image Processing, pp. 164–170. Cited by: §IVA, §VI.
 [19] (2005) Measuring statistical dependence with HilbertSchmidt norms. In International conference on algorithmic learning theory, pp. 63–77. Cited by: §IIIA.
 [20] (2009) The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Cited by: §I, §I, §IIA.
 [21] (2019) Variant Grassmann manifolds: a representation augmentation method for action recognition. ACM Transactions on Knowledge Discovery from Data 13 (2), pp. 1–23. Cited by: §I.

[22]
(2012)
3D convolutional neural networks for human action recognition
. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 221–231. Cited by: §I.  [23] (2011) Principal component analysis. Springer. Cited by: §I, §IIIA, §IIIB.

[24]
(2019)
Speech and language processing: an introduction to speech recognition, computational linguistics and natural language processing
. Pearson Prentice Hall. Cited by: §IVF.  [25] (2019) Timeinvariant 3d human action recognition with positive and negative movement memory using convolutional neural networks. In 2019 4th International Conference on Pattern Recognition and Image Analysis, pp. 26–31. Cited by: §I.
 [26] (2020) Sparsness embedding in bending of space and time; a case study on unsupervised 3d action recognition. Journal of Visual Communication and Image Representation 66, pp. 1–19. Cited by: §I, §VC, TABLE I.
 [27] (2020) Recognizing involuntary actions from 3d skeleton data using body states. Scientia Iranica. Cited by: §I.
 [28] (2016) 3D skeletonbased human action classification: a survey. Pattern Recognition 53, pp. 130–147. Cited by: §I.
 [29] (2014) 3D human action segmentation and recognition using pose kinetic energy. In 2014 IEEE international workshop on advanced robotics and its social impacts, pp. 69–75. Cited by: §VC, TABLE I.
 [30] (2015) Accurate 3d action recognition using learning on the Grassmann manifold. Pattern Recognition 48 (2), pp. 556–567. Cited by: §I, §VC, TABLE I.

[31]
(2020)
Contributive representation based reconstruction for online 3d action recognition.
International Journal of Pattern Recognition and Artificial Intelligence
. Cited by: §VC, TABLE I.  [32] (1991) Face recognition using eigenfaces. In Proceedings. 1991 IEEE computer society conference on computer vision and pattern recognition, pp. 586–587. Cited by: §I.
 [33] (2018) Action recognition based on joint trajectory maps with convolutional neural networks. KnowledgeBased Systems 158, pp. 43–53. Cited by: §I.
 [34] (2012) View invariant human action recognition using histograms of 3d joints. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–27. Cited by: §VA2, §VC, TABLE I.
 [35] (2007) Least squares linear discriminant analysis. In Proceedings of the 24th international conference on machine learning, pp. 1087–1093. Cited by: §IIA.
Comments
There are no comments yet.