I Introduction
In the past three decades, motion capture systems – MoCap – have been engineered with the ultimate goal of tracking and recording human motion while guaranteeing high resolutions in both spatial and temporal domains. The acquired data consist of time series of joint/marker 3D positions and are broadly used for several different applications, e.g., studying human motions in sport sciences, inferring biometric patterns for person identification or generating realistic motion sequences in computer animation to name a few [1]. Among these ones, action and activity recognition displays a crucial role in humanrobot interaction, autonomous driving vehicles and videosurveillance [2]. However, devising effective methods to analyze MoCap data is demanding due to the many yet unsolved problems related, for instance, to missing acquisitions of joints coordinates or to highly corrupted data.
Previous attempts to face these issues either rely on some distance learning techniques (e.g., subspace view invariant metric [3]
) or applied stochastic techniques to model the degree of uncertainty in the data. For instance, a hidden Markov model is used in
[4]to produce weak classifiers which are enhanced by AdaBoost. Furthermore,
[5] proposed an action graph to model the dynamics for action recognition and exploited a bag of 3D points as feature representation.Since the spatial and/or temporal dimensions of the recorded data can be heavy, dimensionality reduction [6]
[7] methods have been devised. However, in general, the classification is subsequent to a design phase of discriminative features such as actionlets [8], random occupancy patterns [9], posebased sets [10], spacetime trajectories [11], velocity and acceleration [12], normal vectors
[13] or Lie group geometry embedding [14].As a different paradigm to a customized class of taskspecific features, generalizable representations driven by covariance matrix were shown to be promising, either encoding spatiotemporal derivatives of joint positions [15] or producing a hierarchical temporal pyramids of descriptors [16].
Recently, the new state of the art for action and activity recognition from MoCap data was set by [17], where several Gram matrices are computed to produce multiple representations of the joint positions of each trial and, once a fusion step is performed, a logEuclidean kernel feeds the SVM classifier. Therein, the covariance is replaced by kernel matrices and this is motivated by the observation that the former can only understand linear relationships while the latter allows to model general ones. In this work, we pursue an opposite perspective, focusing on the covariance representation and rigorously devising a kernelized version to extend its discriminative power.
Indeed, by the direct usage of a kernel, we can avoid any preliminary explicit feature encoding (as, for instance, occurs in [15]
) and, for a general class of kernel functions, we recover the kernel trick for covariance matrix estimation. As a result, its descriptiveness increases from linear to arbitrary relationships modelling, while the efficiency in the computation is preserved.
To the best of our knowledge, this problem was never faced before in this principled way in both machine learning and pattern recognition fields.
To sum up, we highlight the contributions of this paper.

We propose a new kernelized representation for covariance matrix, namely KernelizedCOV. By recovering the wellknown kernel trick, we can capture more general interdependencies between variables in a way that the usual covariance descriptor becomes a particular case and the overall computational cost does not increase.

In order to prove the effectiveness of our approach for action and activity recognition of MoCap data, we compare our method against different ones on MSRAction3D [18], MSRDailyActivity [19], MSRCKinect12 [20] and HDM05 [21] benchmark datasets. With respect to the stateoftheart methods [17], the registered performance shows comparable results in the first two datasets and better scores in the remaining ones. This properly certifies that our kernelization is able to bridge the gap between covariance and kernelbased representation.
Ii Background
At an arbitrary timestamp , a generic MoCap system represents the body of a human agent as the collection of the threedimensional locations of joints/markers positions, being the and coordinates for In order to quantify how much any pair of the coordinates mutually change in time, the notion of covariance is classically exploited in statistics [22]
. However, it cannot be computed in absence of a known distribution for the probability according to which the samples
are drawn. However, this assumption is seldom verified in real cases and, as an alternative, the sampling covariance matrixis usually exploited: this is due to the fact that it is an unbiased estimator of the original covariance
^{1}^{1}1For convenience, in the following, we will concisely refer to the estimator as the covariance itself, omitting the “sampling” attribute. and can be computed using a finite number of samples , , only. Precisely, it is defined as(1) 
where represents the data matrix which stacks by columns all the temporal acquisitions whose average is denoted by In matrix notation, (1) becomes^{2}^{2}2For a matter of space, the technical proof of deriving equation (2) from (1) was moved to the Supplementary Material.
(2) 
once defined as the matrix whose th entry is
(3) 
The usage of the covariance to produce descriptors for classification tasks has been intensively studied [23, 24, 25, 26, 27, 28, 29, 17]. In particular, [23]
proposed patchspecific covariance descriptors, efficiently computed with integral images. Other approaches rely on covariance to systematically encode mutual relationships inside the data and such idea was applied to many different applications such as face recognition
[24], person identification [25] and more general classification tasks [26]. Further, covariance was proposed to measure similarities across data samples [27].This latter direction actually grounds on the mathematical properties of positive definite matrices, exploiting Riemannian metrics on manifold for image classification: once moved from a finite to an infinite dimensional space, the performance enhances [28, 29]
and only recently deep learning approaches have shown to be superior. However, one of the main limitation related to covariance matrix is that it only enables to capture linear interrelationships
[22]. For instance, principal component analysis actually exploits a covariance matrix to remove linear correlation of data points
[30]. Among the attempts for modeling more complicated relationships, additional statistics, such as entropy and mutual information [26], and kernels [17] have been adopted. As a different paradigm, one can model nonlinear behaviors by preliminary applying a preprocessing step and encode raw data by means of a transformation which increases the feature space. For instance, [15] applied such idea for spatial and temporal derivatives for gesture recognition, [26] considered both different color spaces and edge detectors for image classification, and [25] used filter bank responses as features to estimate head orientation. In this latter approach, once defined the feature map and the transformed data matrix whose th column is , the covariance (2) is now expressed by(4) 
Despite is able to capture general relationships embedded in the raw data the main bottleneck with (4) is the requirement of explicit computation for . Indeed, due to feature space augmentation performed by the higher dimensionality of such a matrix is more demanding in terms of both storage and computational cost required to calculate (4) instead of (2). Additionally, although infinite feature spaces are common for many classes of feature maps (e.g., the one corresponding to a Gaussian kernel), this case has to be excluded in (4) since is infinite dimensional and therefore impossible to compute exactly. In the following Section, we will face the problem of obtaining without involving .
Iii Method
Leveraging on the theory of kernel methods [31], every symmetric and positive definite kernel function can be expressed as
(5) 
where the inner product is computed in the Hilbert space^{3}^{3}3For additional details about as well as for an extended presentation of the proposed method, please, refer to the Supplementary Material. which defines the range of the feature map . In (5), the kernel trick [31] replaces the arbitrary relationships in the original data space with a linear reformulation in : most importantly, can be actually skipped, since only requiring the computation of the kernel (e.g.
, this happens for support vector machines
[30]). In our case, we will employ to obtain the representation , equivalent to (4), that is , while also skipping the computation of . The following statement moves the first step in this direction.Lemma 1.
Assume that there exist such that for every , being the unitary element of the canonical base of as a vectorial space. Then, there exists a matrix depending only on the kernel , the data and , such that, if we define we get .
Proof.
Using (4), the th entry of rewrites
(6) 
In (6), once exploited the assumption that for some we can define the matrix whose th entry is and consequently we deduce
(7) 
which proves the thesis. ∎
Lemma 1 certifies that we are able to compute the covariance in terms of the sole kernel However, some issues pertain to the practical feasibility of the assumption
(8) 
for any , which is nevertheless fundamental for our purposes.
Actually, (8) is quite restrictive since the range of is forced to contain the whole canonical base of . For instance, if (8) consists in a set of equations that have to be solved in an dimensional space and, even if we assume that the resulting linear system can be either undetermined or impossible. Clearly, in case of a more general shape for , it is not trivial to check whether the assumption (8) is verified. Hence, it seems natural to opt for a different feature map, which can replace in generating the kernel function , also satisfying (8). Thus, in the rest of the paper, we will focus on a specific class of stochastic feature maps , actually fulfilling hypothesis (8), so that the induced linear kernel approximates in a both stochastic and analytical sense. Therefore, we select the family of functions
(9) 
where the dot product is computed in and for any It is worth nothing that, due to the nonnegativeness of these coefficients, since a linear combination of kernels is still positive definite, then (9) admits the representation (5). Also, (9) covers both finite and infinite linear combinations and therefore is comprehensive of a broad class of kernel functions. For instance, it is easily checked that (9) generalizes both the polynomial kernel and the exponentialdot product kernel , . In this setting, we now introduce the following lemma which gives the fundamental tool to construct .
Lemma 2.
Let a collection of
independent samples jointly distributed as a mixture of discrete Dirac’s deltas and define
Then, the expectation of under the distribution of is(10) 
Proof.
Using the definition of the property of the mixture of Dirac’s delta distribution and the linearity of the expectation the thesis comes after the following chain of equivalences
where denotes the Kronecker symbol. ∎
Once sampled a random number with probability define where are all identical copies of the function
(11) 
where are independently distributed according to Equation (11) and Lemma 2 allow to extend to our case [32, Lemma 7], which states that the linear kernel obtained through is an unbiased estimator of the original function Similarly, using the same arguments of Section 4.1 in [32], we obtain that uniformly over any compact set of .
Since we proved that approximates the kernel in the sense explained above, the final stage is solving the issue related to (8).
Proposition 1.
The map satisfies the assumption (8), that is, for every i = , it results
(12) 
Proof.
The relationship (12) displays a system of equations, stochastically dependent on the randomness of . Actually, in our case, it is enough to solve the system (12) and prove the existence of under a specific realization of and , the two sources of randomness in . In other words, we can solve (12) in a maximum likelihood sense by considering the samples of and which verify (12) with probability . Thus, we use a prior on so that and, once absorbed into all the multiplicative constant defining , then (12) becomes
(13) 
Precisely, (13) is a linear system of size in the unknowns . If we then assume that the Dirac delta distribution of is concentrated in with probability 1, (13) is solvable if and only if for any . This is actually verified once chosen to be the th element of the orthonormal basis of . ∎
With Proposition 1, all issues related to the computability for is solved. Additionally, one can also easily understand that, with the previous choice of , once selected a linear kernel , then is equal to the , so that the classical covariance is a particular case of our framework.
The theoretical discussion leads to derive Algorithm 1 and to apply the proposed kernelized covariance for the task of action and activity recognition. For a better understanding, we also visualize such pipeline in Figure 1.
Computational cost. The complexity of our trialspecific kernelized covariance is . Thus, differently from previous approaches [27, 33, 28, 29], the proposed framework is very efficient if compared to the cubic complexity of methods like [33] which require eigendecomposition. Under a mathematical point of view, our kernelized covariance is a natural generalization of the classical covariance matrix, which can be retrieved as a particular case in our paradigm once fixed the kernel function (9) to be a linear one. On the other hand, the computational cost still remains the same if compared with the classical covariance descriptor.
Method  MSRAction3D  MSRDailyActivity  MSRCKinect12  HDM05 

RegionCOV [23]  74.0%  85.0%  89.2%  91.5% 
Hierarchy of COVs [16]  90.5%    91.7%   
COVSVM [29]  80.4%  75.5%  89.2%  82.5% 
KerRPPOL [17]  96.2%  96.9%  90.5%  93.6% 
KerRPRBF [17]  96.9%  96.3%  92.3%  96.8% 
KernelizedCOV (proposed)  96.2%  96.3%  95.0%  98.1% 
Iv Experimental results
In this section, we present the experimental results obtained with our KernelizedCOV method on different publicly available MoCap datasets for action recognition. Precisely, the following algorithms were compared in our experiments: RegionCOV [23] (covariance region descriptor), temporal pyramid of covariance descriptors (Hierarchy of COVs) [16] and, finally, an infinite covariance operator which exploits Bregman divergence, namely COVSVM [29]. Furthermore, we also report the comparison against the recent stateoftheart methods, namely KerRPPOL and KerRPRBF [17].
In all the experiments, we followed [17] in performing SVM classification by means of a global logEuclidean kernel applied upon Gram matrices, directly computed over joints coordinates, encoding each single trial. Nevertheless, differently from [17], in order to represent each multivariate time series of joints trajectories, the data encoding of any trial was realized through our kernelized covariance matrix , where is the exponentialdot product kernel (see Section III). For a fair comparison, our kernelization was plugged into the publicly available code^{4}^{4}4http://www.uow.edu.au/~leiw/ and, for classification, we used the SVM and Kernel Methods Matlab Toolbox^{5}^{5}5http://asi.insarouen.fr/enseignants/~arakoto/toolbox/index.html using the wrapper directly provided by the authors. Finally, we fixed and, as done by [17], the kernel parameter is chosen by cross validation.
In all the experiments, we only used the 3D skeleton coordinates available in the following datasets:

MSRAction3D [18], where there are 20 classes of mostly sportrelated action (e.g., jogging or tennisserve) involving 10 subjects. Since each subject performs each action 2 or 3 times, the overall number of trials is 567. For each of them, Kinect sensor is used to acquire depth maps, from which 20 joints are extracted to model the human pose of any of the human agents.

MSRDailyActivity [19], captured by using a Kinect device and it is composed by 16 different classes related to everyday actions such as read book or lie down on sofa. All of them are performed by 10 subjects. The main difficulty of this dataset originates from the fact that any activity class is performed in an either standing/sitting position, with a consequent misleading motion pattern to mess up the classification.

MSRCKinect12 [20], consisting of sequences of human movements, represented as bodypart locations, and the associated gesture to be recognized by the system. 594 sequences of approximate total length of six hours and 40 minutes are collected from 30 people performing 12 gestures: in total, 6,244 gesture instances. The motion files contain Kinect estimated trajectories of 20 joints.

HDM05 [21], containing more than tree hours of systematically recorded and welldocumented MoCap data using a 240Hz VICON system to acquire the gestures of 5 nonprofessional actors via 31 markers. Motion clips have been manually cut out and annotated into roughly 100 different motion classes: on average, 1050 realizations per class are available.
In all cases, we used the same splits adopted in [17]
: for MSRAction3D, MSRDailyActivity and MSRCKinect12, training is performed on oddindex subject, while the evenindex ones are left for testing (crosssubject pipeline of
[18]), while, in HDM05, the training split exploits all the data from the “bd” and “mm” subjects and testing is performed on “bk”, “dg” and “tr”.Furthermore, for the HDM05 dataset we removed some severely corrupted samples [16] and, as performed by [17], selected only the following classes: clap above head, deposit floor, elbow to knee, grab high, hop both legs, jog, kick forward, lie down floor, rotate both arms backward, sit down chair, sneak, squat, stand up lie and throw basketball. All the data are preprocessed in a common way. In particular, in MSRAction3D and MSRDailyActivity, we computed the velocity and acceleration from the raw positions of the joints adopting either first and second order finite different scheme respectively as in [12].
Table I shows the results of KernelizedCOV on the four different datasets in comparison with all the other methods. Therein, in the case of MSRAction3D and MSRDailyActivity, our proposed method is able to achieve comparable results with a small deviation from the stateoftheart [17], but it outperforms all the other competitors. More impressively, on MSRCKinect12, KernelizedCOV improves thestateoftheart [17] by . Even in the last dataset, namely HDM05, the accuracy of the proposed method is higher of the best score achieved by the other competitors. In this case, referring to [16], we did not report the accuracy on HDM05 due to the different experimental settings: Hierarchy of COVs scored on a simplified class problem, while, in the same conditions, we scored . Furthermore, it is worth noting that, on all the considered datasets our KernelizedCOV works even better than a recent infinite covariance operator [29], more discriminatively encoding the data.
The improvements in classification accuracies demonstrate the effectiveness of KernelizedCOV. Moreover, our proposed principled way of encoding nonlinearities conveyed by the data is always superior to classical covariance based methods such as [23, 16, 29] and does not suffer the gap in performance showed by covariance representation in [17].
Method  MSRAction3D 

Action Graph [5]  79.0% 
Random Occupancy Patterns [9]  86.0% 
Actionlets [8]  88.2% 
Pose Set [10]  90.0% 
Moving Pose [12]  91.7% 
Lie Group [14]  92.5% 
Normal Vectors [13]  93.1% 
KernelizedCOV (proposed)  96.2% 
As a final remark, it is interesting to compare the performance of our KernelizedCOV with other not covariancebased methods. To this aim, we take into account the MSRAction3D dataset and we compared with many previous approaches in the literature, already introduced in Section I. From this analysis, the results presented in Table II give a further evidence of the effectiveness of the proposed use of the kernelized covariance, which is able to overcome [13], the best score reported, by a margin of 3.1%.
V Conclusions & Future Perspectives
This paper presents a principled mathematical paradigm to recover the applicability of kernel trick for covariance matrix, in order to better model more general class of relationships other than the linear ones. This enhances the descriptiveness of the classical covariance matrix which is retrievable as a particular case of our general theoretical framework. Experimentally, KernelizedCOV closes the gap between covariance and kernelbased representations in many action recognition datasets, namely MSRAction3D, MSRDailyActivity, MSRCKinect12 and HDM05. The proposed method is able to improve the previous best accuracies, setting the new stateoftheart performance on the last two datasets.
As a future work, we either tackle the applicability of this novel framework to other classification problems and we will also investigate how a similar pipeline can be extended to more general classes of kernel functions.
References
 [1] T. B. Moeslund, A. Hilton, and V. Krüger, “A survey of advances in visionbased human motion capture and analysis,” CVIU, vol. 104, no. 2, pp. 90–126, 2006.
 [2] M. Vrigkas, C. Nikou, and I. Kakadiaris, “A review of human activity recognition methods,” Frontiers in Robotics and AI, vol. 2, no. 28, 2015.
 [3] Y. Sheikh, M. Sheikh, and M. Shah, “Exploring the space of a human action,” in ICCV, 2005.
 [4] F. Lv and R. Nevatia, “Recognition and segmentation of 3d human action using hmm and multiclass adaboost,” in ECCV, 2006.
 [5] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” in CVPR workshop, 2010.

[6]
X. Yang and Y. Tian, “Eigenjointsbased action recognition using naive bayes nearest neighbor,” in
CVPR workshop, 2012.  [7] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy, “Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition,” in CVPR workshop, 2013.
 [8] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in CVPR, 2012.
 [9] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, “Robust 3d action recognition with random occupancy patterns,” in ECCV, 2012.
 [10] C. Wang, Y. Wang, and A. L. Yuille, “An approach to posebased action recognition.” in CVPR, 2013.
 [11] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. Del Bimbo, “Spacetime Pose Representation for 3D Human Action Recognition,” in ICIAP workshop, 2013.
 [12] M. Zanfir, M. Leordeanu, and C. Sminchisescu, “The moving pose: An efficient 3d kinematics descriptor for lowlatency action recognition and detection,” in ICCV, 2013.
 [13] X. Yang and Y. Tian, “Super normal vector for activity recognition using depth sequences,” in CVPR, 2014.
 [14] R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition by representing 3d skeletons as points in a lie group,” in CVPR, 2014.
 [15] A. Sanin, C. Sanderson, M. T. Harandi, and B. C. Lovell, “Spatiotemporal covariance descriptors for action and gesture recognition,” CoRR, vol. abs/1303.6021, 2013.
 [16] M. Hussein, M. Torki, M. Gowayyed, and M. ElSaban., “Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations,” IJCAI, 2013.
 [17] L. Wang, J. Zhang, L. Zhou, C. Tang, and W. Li, “Beyond covariance: Feature representation with nonlinear kernel matrices,” in ICCV, 2015.
 [18] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” in CVPR workshop, 2010.
 [19] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in CVPR, 2012.
 [20] S. Fothergill, H. M. Mentis, P. Kohli, and S. Nowozin, “Instructing people for training gestural interactive systems,” in ACMCHI, 2012.
 [21] M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, and A. Weber, “Documentation mocap database HDM05,” Universität Bonn, Tech. Rep. CG20072, June 2007.
 [22] J. D. Hamilton, Time series analysis. Princenton University Press, 1994.
 [23] O. Tuzel, F. Porikli, and P. Meer, “Region covariance: A fast descriptor for detection and classification,” in ECCV, 2006.
 [24] Y. Pang, Y. Yuan, and X. Li, “Gaborbased region covariance matrices for face recognition,” TCSVT, vol. 18, no. 7, pp. 989–993, 2008.
 [25] D. Tosato, M. Spera, M. Cristani, and V. Murino, “Characterizing humans on riemannian manifolds,” TPAMI, vol. 35, no. 8, pp. 1972–1984, 2013.
 [26] M. San Biagio, M. Crocco, M. Cristani, S. Martelli, and V. Murino, “Heterogeneous AutoSimilarities of Characteristics (HASC): Exploiting Relational Information for Classification,” in ICCV, 2013.

[27]
M. San Biagio, S. Martelli, M. Crocco, M. Cristani, and V. Murino, “Encoding classes of unaligned objects using structural similarity crosscovariance tensors,” in
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 2013, pp. 133–140.  [28] M. Ha Quang, M. San Biagio, and V. Murino, “LogHilbertSchmidt metric between positive definite operators on Hilbert spaces,” in NIPS, 2014.
 [29] M. Harandi, M. Salzmann, and F. Porikli, “Bregman divergences for infinite dimensional covariance matrices,” in CVPR, 2014.
 [30] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
 [31] B. Schölkopf and A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, ser. Adaptive Computation and Machine Learning, 2002.
 [32] P. Kar and H. Karnick, “Random features maps for dot product kernels,” in JMLR, 2012.
 [33] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi, “Kernel Methods on the Riemannian Manifold of Symmetric Positive Definite Matrices,” in CVPR, 2013.
Comments
There are no comments yet.