1 Introduction
Hand gesture recognition is an important research topic with applications in many fields, e.g., assisted living, humanrobot interaction or sign language interpretation. A large family of hand gesture recognition methods is based on lowlevel features extracted from images, e.g., spatiotemporal interest points. However, with the introduction of affordable depth sensing cameras, e.g., Intel Realsense or Microsoft Kinect, and the availability of highly accurate joint tracking algorithms, skeletal data can be obtained effectively with good precision. Skeletal data provides a rich and high level description of the hand. This has led to an extensive development of approaches for skeletonbased hand gesture recognition in recent years.
Early works on the recognition of hand gestures or human actions from skeletal data are based on a modeling of the skeleton’s movement as time series [49, 52]. The recognition step is thus based on the comparison of sequences of features describing skeleton’s movements using, e.g., Dynamic Time Warping [49] or Fourier Temporal Pyramid [52].
Such approaches ignore the high correlations existing between the movement of two adjacent hand joints (e.g., two joints of a same finger) within a hand gesture. Taking into account this information is a crucial step for hand gesture recognition and requires the definition and the appropriate processing of hand joints’ neighborhoods.
Recently, graph convolutional networks for action recognition [27, 56] have shown excellent performance by taking into account physical connections of body joints defined by the underlying structure of body skeleton. While the use of physical connections of skeleton joints are important for capturing discriminating cues in hand gesture and action recognition, the identification of other connections induced by the performed gestures and actions are also useful and can greatly improve recognition accuracy [45].
Motivated by this observation, we model in this work the hand skeleton as a 2D grid where connections, different from the classical physical connections of hand joints, are added to better capture patterns defined by hand joints’ movements. Figure 1
(a) shows the hand joint positions estimated by an Intel Realsense camera. Since the hand skeleton has an irregular geometric structure that differs from gridshaped structures, the 2D grid is constructed from the hand skeleton by removing some hand joints and adding connections between neighboring joints. Figure
1(b) shows a 2D grid corresponding to the hand skeleton in Fig. 1(a). This 2D grid integrates adjacency relationships between hand joints that often have correlated movements. Moreover, this modeling allows us to use a classical convolutional layer instead of a graph convolutional operator on an arbitrary geometric graph [56].Our approach relies on SPD matrices to aggregate features resulting from the convolutional layer. The SPD matrices considered in this work combine mean and covariance information which have been shown effective in various vision tasks [15, 21, 22]. Since SPD matrices are known to lie on a Riemannian manifold, specific layers for deep neural networks of SPD matrices should be designed [19, 60]. Beside good performances on action recognition tasks, these networks do not put a focus on spatial and temporal relationships of skeleton joints. This motivates us to design a neural network model for learning a SPD matrixbased gesture representation from skeletal data with a special attention on those relationships. In our work, the encoding of spatial and temporal relationships of hand joints is performed using different network architectures. This allows to capture relevant statistics for individual hand joints as well as groups of hand joints whose movement is highly correlated with that of other joints in the group. The experimental evaluation shows that our method significantly improves the stateoftheart methods on two standard datasets.
2 Related Works
This section presents representative works for skeletonbased hand gesture recognition (Sec. 2.1) and deep neural networks for SPD manifold learning (Sec. 2.2).
2.1 SkeletonBased Gesture Recognition
Most of approaches can be categorized as handcrafted featurebased approaches or deep learning approaches. Handcrafted featurebased approaches describe relationships of hand and body joints in different forms to represent gestures and actions. The simplest proposed relationships are relative positions between pairs of joints
[35, 46, 57]. More complex relationships were also exploited, e.g., skeletal quad [8] or 3D geometric relationships of body parts in a Lie group [49]. Temporal relationships have also been taken into account and proven effective [50]. While all joints are involved in the performed gestures and actions, only a subset of key joints is important for the recognition task. These are called informative joints and they can be automatically identified using information theory [37]. This allows to avoid considering noninformative joints that often bring noise and degrade performance.Motivated by the success of deep neural networks in various vision tasks [13, 17, 26]
, deep learning approaches for action and gesture recognition have been extensively studied in recent years. To capture spatial and temporal relationships of hand and body joints, they rely mainly on Convolutional Neural Network (CNN)
[4, 25, 32, 33, 36, 53], Recurrent Neural Network (RNN)
[6, 51]and Long ShortTerm Memory (LSTM)
[31, 36, 44]. While handcrafted featurebased approaches have used informative joints to improve recognition accuracy, deep learning approaches were based on attention mechanism to selectively focus on relevant parts of skeletal data [30, 55]. Recently, deep learning on manifolds and graphs has increasingly attracted attention. Approaches following this line of research have also been successfully applied to skeletonbased action recognition [19, 20, 23, 27, 56]. By extending classical operations like convolutions to manifolds and graphs while respecting the underlying geometric structure of data, they have demonstrated superior performance over other approaches.2.2 Deep Learning of SPD Matrices
In recent years the deep learning community has shifted its focus towards developing approaches that deal with data in a nonEuclidean domain, e.g., Lie groups [20], SPD manifolds [19] or Grassmann manifolds [23]. Among them, those that deal with SPD manifolds have received particular attention. This comes from the popular applications of SPD matrices in many vision problems [1, 14, 16, 58].
Deep neural networks for SPD matrix learning aim at projecting a highdimensional SPD matrix into a more discriminative lowdimensional one. Differently from classical CNNs, their layers are designed so that they preserve the geometric structure of input SPD matrices, i.e., their output are also SPD matrices. In [5], a 2D fully connected layer was proposed for the projection, while in [19]
it was achieved by a Bimap layer. Inspired by ReLU layers in CNNs, different types of layers that perform nonlinear transformations of SPD matrices were also introduced
[5, 7, 19]. To classify the final SPD matrix, a layer is generally required to map it to an Euclidean space. Most of approaches rely on the two widely used operations in many machine learning models, i.e., singular value decomposition (SVD) and eigen value decomposition (EIG) for constructing this type of layers
[19, 29, 54]. As gradients involved in SVD and EIG cannot be computed by traditional backpropagation, they exploit the chain rule established by Ionescu et al.
[24] for backpropagation of matrix functions in deep learning.3 The Proposed Approach
In this section, we present our network model referred to as SpatialTemporal and TemporalSpatial Hand Gesture Recognition Network (STTSHGRNET). An overview of our network is given in Section 3.1. The different components of our network are explained in Sections 3.2, 3.3, 3.4, and 3.5. In Section 3.6, we show how our network is trained for gesture recognition. Finally, Section 3.7 points out the relations of our approach with previous approaches.
3.1 Overview of The Proposed Network
Our network illustrated in Fig. 2 is made up of three components. The first component, referred to as CONV, is a convolutional layer applied on the 2D grid encoding the hand skeletal data (Fig. 1). Filter weights are shared over all frames of the sequence.
The second component is based on the Gaussian embedding method of [34] and is used to capture first and secondorder statistics. This component is composed of two different architectures for feature aggregation referred to as SpatialTemporal Gaussian Aggregation SubNetwork (STGANET) and TemporalSpatial Gaussian Aggregation SubNetwork (TSGANET).
The third component, referred to as SPD Matrix Learning and Classification SubNetwork (SPDCNET), learns a SPD matrix from a set of SPD matrices and maps the resulting SPD matrix, which lies on a Riemannian manifold, to an Euclidean space for classification.
In the following, we explain in detail each component of our network. The backpropagation procedures of our network’s layers are given in Appendix A.
3.2 Convolutional Layer
The convolutional layer (Fig. 3) used in the first place of our network allows to combine joints with correlated variations (Section 1). Let and be respectively the number of hand joints and the length of the skeleton sequence. Let us denote by , the 3D coordinates of hand joint at frame . We define a 2D grid where each node represents a hand joint at a frame (Section 1). The grid has three channels corresponding to the x, y, and z coordinates of hand joints. Fig. 1(b) shows the 2D grid corresponding to the hand skeleton in Fig. 1(a), where each node has at most 9 neighbors including itself. Let be the output dimension of the convolutional layer. Let us denote by
, the output of the convolutional layer. The output feature vector at node
is computed as:(1) 
where is the set of neighbors of node , is the filter weight matrix, and is defined as:
(2) 
3.3 SpatialTemporal Gaussian Aggregation SubNetwork
To capture the temporal ordering of a skeleton sequence, a number of subsequences are constructed and then fed to different branches of STGANET (see Fig. 4). A branch of STGANET is designed to aggregate features for a subsequence of a specific finger. In this paper, we construct six subsequences for each skeleton sequence. The first subsequence is the original sequence. The next two subsequences are obtained by dividing the sequence into two subsequences of equal length. The last three subsequences are obtained by dividing the sequence into three subsequences of equal length. This results in branches for STGANET ( subsequences fingers).
To aggregate features in a branch associated with subsequence and finger , , each frame of subsequence is processed through layers. Let be the set of hand joints belonging to finger , be the beginning and ending frames of subsequence , be a given frame of subsequence , be the subset of output feature vectors of the convolutional layer that are fed to the branch. Let us finally consider a sliding window centered on frame . Following previous works [28, 43], we assume that ,
, are independent and identically distributed samples from a Gaussian distribution (hereafter abbreviated as Gaussian for simplicity):
(3) 
where is the determinant, is the mean vector and is the covariance matrix. The parameters of the Gaussian can be estimated as:
(4) 
(5) 
Based on the method in [34] that embeds the space of Gaussians in the Riemannian symmetric space, the Gaussian can be identified as a SPD matrix given by:
(6) 
The GaussAgg layer is designed to perform the computation of Eq. 6, that is:
(7) 
where is the mapping of the GaussAgg layer, is the output of the GaussAgg layer.
The next layer ReEig [19] introduces nonlinear transformations of SPD matrices via a mapping defined as:
(8) 
where is the mapping of the ReEig layer, and are the input and output SPD matrices, is the eigendecomposition of , is a rectification threshold,
is the identity matrix,
is a diagonal matrix whose diagonal elements are defined as:(9) 
After the ReEig layer, the LogEig layer [19] is used to map SPD matrices to Euclidean spaces. Formally, the mapping of this layer is defined as:
(10) 
where is the mapping of the LogEig layer, and are the input and output SPD matrices, as before.
The next layer, referred to as VecMat, vectorizes SPD matrices by the following mapping [48]:
(11) 
where is the mapping of the VecMat layer, is the input matrix, is the output vector, , are the diagonal entries of and , are the offdiagonal entries.
We again assume that , are independent and identically distributed samples from a Gaussian whose parameters can be estimated as:
(12)  
(13) 
The second GaussAgg layer then performs the mapping:
(14) 
The resulting SPD matrix describes variations of finger along subsequence .
3.4 TemporalSpatial Gaussian Aggregation SubNetwork
Similarly to STGANET, TSGANET is composed of 30 branches where each branch aggregates features for a subsequence of a specific finger. The subsequences are constructed in exactly the same way as STGANET. However, the feature aggregation procedure at the first and second GaussAgg layers are performed differently. More precisely, considering the branch associated with subsequence and finger . First, subsequence is further divided into subsequences of equal length. Let and , , be the beginning and ending frames of these subsequences. Then for a given hand joint and subsequence , the first GaussAgg layer computes a SPD matrix given as:
(15) 
where and .
Note that encodes the first and secondorder statistics of hand joint computed within subsequence . This temporal variation of individual joints is not captured by the first GaussAgg layer of STGANET. The resulting SPD matrices are processed through the ReEig, LogEig and VecMat layers. Let , be the output vectors of the VecMat layer of the branch. The second GaussAgg layer of TSGANET then performs the following mapping:
(16) 
where and can be estimated as:
(17)  
(18) 
3.5 SPD Matrix Learning and Classification SubNetwork
The outputs of subnetworks STGANET and TSGANET are sets of SPD matrices. The objective of the classification subnetwork (see Fig. 6) is to transform those sets to a new SPD matrix, then map it to an Euclidean space for classification. The mapping of the SPDAgg layer is defined as:
(19) 
where , are the input SPD matrices, are the transformation matrices, is the output matrix.
To guarantee that the output is SPD, we remark that the righthand side of Eq. (19) can be rewritten as:
(20) 
where and is constructed such that its diagonal contains the diagonal entries of :
(21) 
It can be easily seen that is a valid SPD matrix, as for any vector , one has , where and the vectors , have equal sizes. The righthand side of the above equation is strictly positive since , and there must exist such that (as ), which implies that .
Inspired by [19], we assume that the combined matrix is a full row rank matrix. Then optimal solutions of the transformation matrices are achieved by additionally assuming that resides on a compact Stiefel manifold ^{1}^{1}1A compact Stiefel manifold is the set of dimensional orthonormal matrices of .. The transformation matrices are updated by optimizing and projecting the optimal on its columns. Note that the constraint on the dimension of the output is: .
To map the output SPD matrix of the SPDAgg layer to an Euclidean space, we use the LogEig layer, followed by a fully connected (FC) layer and a softmax layer.
3.6 Gesture Recognition
The SPDAgg layer outputs a matrix for each gesture sequence (see Fig. 6). This matrix is then transformed to its matrix logarithm and finally vectorized. The final representation of the gesture sequence is where , are the diagonal entries of and , are the offdiagonal entries of .
3.7 Relation with Previous Works
Our approach is closely related to [19, 54]. We point out in the following paragraphs the relations between the proposed network and those introduced in [19, 54].

Our network takes directly 3D coordinates of hand joints as input, while in [19], covariance matrices must be computed beforehand as input of their network.

Our network relies not only on the secondorder information (covariance) as [19] but also on the firstorder information (mean). The firstorder information has been proven to be useful in capturing the extra distribution information of lowlevel features [42]. Moreover, we consider the first and secondorder information for different subsets of hand joints, while [19] uses the whole set of joints to compute statistics. Our network is thus based on a finer granularity than [19].
4 Experiments
We conducted experiments using the Dynamic Hand Gesture (DHG) dataset [46, 47] and the FirstPerson Hand Action (FPHA) dataset [12]. In all experiments, the dimension of a output feature vector of the convolutional layer was set to (), the dimensions of the transformation matrices of the SPDAgg layer were set to (, ). All sequences of the two datasets were normalized to have frames ()^{2}^{2}2We tested with and the difference between obtained results were marginal.. The batch size and the learning rate were set to 30 and 0.01, respectively. The rectification threshold for the ReEig layer was set to 0.0001 [19]
. The network trained at epoch
was used to create the final gesture representation. The classifier was learned using the LIBLINEAR library [9] with L2regularized L2loss (dual) where C was set to 1, the tolerance of termination criterion was set to 0.1 and no bias term was added. For FPHA dataset, the nonoptimized CPU implementation of our network on a 3.4GHz machine with 24GB RAM and Matlab R2015b takes about 22 minutes per epoch and 7 minutes per epoch for training and testing, respectively. In the following, we provide details on the experimental settings and results obtained for each dataset.4.1 Datasets and Experimental Settings
DHG dataset. The DHG dataset contains 14 gestures performed in two ways: using one finger and the whole hand. Each gesture is executed several times by different actors. Gestures are subdivided into fine and coarse categories. The dataset provides the 3D coordinates of 22 hand joints as illustrated in Fig. 1(a). It has been split into train sequences (70% of the dataset) and test sequences (30% of the dataset) [47].
FPHA dataset. This dataset contains action videos belonging to different action categories, in 3 different scenarios, and performed by 6 actors. Action sequences present high intersubject and intrasubject variability of style, speed, scale, and viewpoint. The dataset provides the 3D coordinates of 21 hand joints as DHG dataset except for the palm joint. We used the 1:1 setting proposed in [12] with action sequences for training and for testing.
Num. of a hand joint’s neighbors  FPHA  DHG (14 gestures)  DHG (28 gestures) 

91.65  93.10  88.33  
93.22  94.29  89.40 
FPHA  DHG (14 gestures)  DHG (28 gestures)  

93.22  94.29  89.40  
93.04  94.17  89.04  
93.04  94.29  89.40 
4.2 Ablation Study
In this section, we examine the influence of different components of our network on its accuracy. The default values of and are set to 1 and 15, respectively.
Hand modeling. We evaluate the performance of our network when only physical connections of hand joints are used for the computations at the convolutional layer, i.e., connections between hand joints belonging to neighboring fingers are removed from the graph in Fig. 1 (b). Each joint is now connected to at most three joints including itself. Results shown in Tab. 1 confirm that the use of connections other than physical connections of hand joints bring performance improvement.
Time interval . In this experiment, we vary and keep other components of our network unchanged. To ensure that the computation of covariance matrices is numerically stable, we set . Tab. 2 shows the performance of our network with three different settings of , i.e. . Results suggest that using 3 consecutive frames for the input of the first GaussAgg layer of STGANET is sufficient to obtain good performance.
Number of subsequences in a branch. This experiment is performed by varying while keeping other components of our network unchanged. For the same reason related to the computation of covariance matrices, must be in a certain interval. We tested with . Results given in Tab. 3 indicate that our network is not sensitive to different settings of .
Contribution of STGANET and TSGANET. We evaluate the performance of two networks, referred to as STHGRNET and TSHGRNET by removing subnetworks TSGANET and STGANET from our network, respectively. Results shown in Tab. 4 reveal that none of both STGANET and TSGANET always provides the best performances on the datasets. This motivates the need for their combination using the component SPDCNET and this contributes to the overall performance of our global network combining both TSGANET and STGANET.
In the following, we report results obtained with default settings of and , i.e. and .
FPHA  DHG (14 gestures)  DHG (28 gestures)  

93.33  94.29  89.40  
92.87  94.05  88.93  
92.70  94.29  89.04 
Network  FPHA  DHG (14 gestures)  DHG (28 gestures) 

STHGRNET  91.83  93.21  89.29 
TSHGRNET  90.96  93.33  88.21 
STTSHGRNET  93.22  94.29  89.40 
4.3 Comparison with StateoftheArt
DHG dataset. The comparison of our method and stateoftheart methods on DHG dataset is given in Tab. 5. The accuracy of the method of [19] is obtained by using the implementation provided by the authors with their default parameter settings. Our method significantly outperforms the competing ones. The network of [19] also learns a SPD matrixbased representation from skeletal data which is similar in spirit to our network. However, they concatenate the 3D coordinates of joints at each frame to create the feature vector of that frame, and their network’s input is the covariance matrix computed from feature vectors over the whole skeleton sequence. Thus, spatial and temporal relationships of joints are not effectively taken into account. By exploiting these relationships, our network improves the recognition accuracy by 19.05% and 19.76% compared to the results of [19] for experiments with 14 and 28 gestures, respectively. For more comparison of our method and existing methods, we conducted experiments using the leaveonesubjectout experimental protocol. Results on Tabs. 6 (14 gestures) and 7 (28 gestures) demonstrate that our method achieves the best results compared to existing methods on this protocol. In particular, our method outperforms the most recent work [55] by 1.5 and 3 percent points for experiments with 14 and 28 gestures, respectively.
Method  Year  Color  Depth  Pose  Accuracy (%)  

14 gestures  28 gestures  
Oreifej and Liu [40]  2013  ✗  ✓  ✗  78.53  74.03 
Devanne et al. [3]  2015  ✗  ✗  ✓  79.61  62.00 
Huang et al. [19]  2017  ✗  ✗  ✓  75.24  69.64 
OhnBar and Trivedi [38]  2013  ✗  ✗  ✓  83.85  76.53 
Chen et al. [2]  2017  ✗  ✗  ✓  84.68  80.32 
De Smedt et al. [46]  2016  ✗  ✗  ✓  88.24  81.90 
Devineau et al. [4]  2018  ✗  ✗  ✓  91.28  84.35 
STTSHGRNET  ✗  ✗  ✓  94.29  89.40 
Method  Year  Color  Depth  Pose  Accuracy (%) 

De Smedt et al., [46]  2016  ✗  ✗  ✓  83.1 
CNN+LSTM [36]  2018  ✗  ✗  ✓  85.6 
Weng et al., [55]  2018  ✗  ✗  ✓  85.8 
STTSHGRNET  ✗  ✗  ✓  87.3 
Method  Year  Color  Depth  Pose  Accuracy (%) 

De Smedt et al., [46]  2016  ✗  ✗  ✓  80.0 
CNN+LSTM [36]  2018  ✗  ✗  ✓  81.1 
Weng et al., [55]  2018  ✗  ✗  ✓  80.4 
STTSHGRNET  ✗  ✗  ✓  83.4 
FPHA dataset. Tab. 8 shows the accuracies of our method and stateoftheart methods on FPHA dataset. The accuracies of the methods of [19] and [23] are obtained by using the implementations provided by the authors with their default parameter settings. Despite the simplicity of our network compared to the competing deep neural networks, it is superior to them on this dataset. The best performing method among stateoftheart methods is Gram Matrix, which gives 85.39% accuracy, 7.83 percent points inferior to our method. The remaining methods are outperformed by our method by more than 10 percent points. We observe that the method of [19] performs well on this dataset. However, since this method does not fully exploit spatial and temporal relationships of skeleton joints, it gives a significantly lower accuracy than our method. Results again confirm the effectiveness of the proposed network architecture for hand gesture recognition.
Method  Year  Color  Depth  Pose  Accuracy (%) 

Two streamcolor [10]  2016  ✓  ✗  ✗  61.56 
Two streamflow [10]  2016  ✓  ✗  ✗  69.91 
Two streamall [10]  2016  ✓  ✗  ✗  75.30 
HOGdepth [39]  2013  ✗  ✓  ✗  59.83 
HOGdepth+pose [39]  2013  ✗  ✓  ✓  66.78 
HON4D [40]  2013  ✗  ✓  ✗  70.61 
Novel View [41]  2016  ✗  ✓  ✗  69.21 
1layer LSTM [62]  2016  ✗  ✗  ✓  78.73 
2layer LSTM [62]  2016  ✗  ✗  ✓  80.14 
Moving Pose [59]  2013  ✗  ✗  ✓  56.34 
Lie Group [49]  2014  ✗  ✗  ✓  82.69 
HBRNN [6]  2015  ✗  ✗  ✓  77.40 
Gram Matrix [61]  2016  ✗  ✗  ✓  85.39 
TF [11]  2017  ✗  ✗  ✓  80.69 
JOULEcolor [18]  2015  ✓  ✗  ✗  66.78 
JOULEdepth [18]  2015  ✗  ✓  ✗  60.17 
JOULEpose [18]  2015  ✗  ✗  ✓  74.60 
JOULEall [18]  2015  ✓  ✓  ✓  78.78 
Huang et al. [19]  2017  ✗  ✗  ✓  84.35 
Huang et al. [23]  2018  ✗  ✗  ✓  77.57 
STTSHGRNET  ✗  ✗  ✓  93.22 
5 Conclusion
We have presented a new neural network for hand gesture recognition that learns a discriminative SPD matrix encoding the firstorder and secondorder statistics. We have provided the experimental evaluation on two benchmark datasets showing that our method outperforms stateoftheart methods.
Acknowledgments.
This material is based upon work supported by the European Union and the Region Normandie under the project IGIL. We thank Guillermo GarciaHernando for providing access to FPHA dataset [12].
References
 [1] P. Bilinski and F. Bremond. Video Covariance Matrix Logarithm for Human Action Recognition in Videos. In IJCAI, pages 2140–2147, 2015.
 [2] X. Chen, H. Guo, G. Wang, and L. Zhang. Motion Feature Augmented Recurrent Neural Network for Skeletonbased Dynamic Hand Gesture Recognition. CoRR, abs/1708.03278, 2017.
 [3] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. D. Bimbo. 3D Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold. IEEE Transactions on Cybernetics, 45(7):1340–1352, 2015.
 [4] G. Devineau, F. Moutarde, W. Xi, and J. Yang. Deep Learning for Hand Gesture Recognition on Skeletal Data. In IEEE International Conference on Automatic Face Gesture Recognition, pages 106–113, May 2018.

[5]
Z. Dong, S. Jia, C. Zhang, M. Pei, and Y. Wu.
Deep Manifold Learning of Symmetric Positive Definite Matrices with Application to Face Recognition.
In AAAI, pages 4009–4015, 2017.  [6] Y. Du, W. Wang, and L. Wang. Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition. In CVPR, pages 1110–1118, 2015.
 [7] M. Engin, L. Wang, L. Zhou, and X. Liu. DeepKSPD: Learning Kernelmatrixbased SPD Representation for Finegrained Image Recognition. CoRR, abs/1711.04047, 2017.
 [8] G. Evangelidis, G. Singh, and R. Horaud. Skeletal Quads: Human Action Recognition Using Joint Quadruples. In ICPR, pages 4513–4518, 2014.
 [9] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
 [10] C. Feichtenhofer, A. P., and A. Zisserman. Convolutional TwoStream Network Fusion for Video Action Recognition. CVPR, pages 1933–1941, 2016.
 [11] G. GarciaHernando and T.K. Kim. Transition Forests: Learning Discriminative Temporal Transitions for Action Recognition. In CVPR, pages 407–415, 2017.
 [12] G. GarciaHernando, S. Yuan, S. Baek, and T.K. Kim. FirstPerson Hand Action Benchmark with RGBD Videos and 3D Hand Pose Annotations. In CVPR, 2018.
 [13] R. Girshick. Fast RCNN. In ICCV, pages 1440–1448, 2015.
 [14] K. Guo, P. Ishwar, and J. Konrad. Action Recognition From Video Using Feature Covariance Matrices. IEEE Transactions on Image Processing, 22(6):2479–2494, 2013.
 [15] M. Harandi, M. Salzmann, and R. Hartley. Dimensionality Reduction on SPD Manifolds: The Emergence of GeometryAware Methods. TPAMI, 40:48–62, 2018.
 [16] M. T. Harandi, C. Sanderson, A. Sanin, and B. C. Lovell. Spatiotemporal Covariance Descriptors for Action and Gesture Recognition. In WACV, pages 103–110, 2013.
 [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, June 2016.
 [18] J. Hu, W. Zheng, J. Lai, and J. Zhang. Jointly Learning Heterogeneous Features for RGBD Activity Recognition. In CVPR, pages 5344–5352, 2015.
 [19] Z. Huang and L. V. Gool. A Riemannian Network for SPD Matrix Learning. In AAAI, pages 2036–2042, 2017.
 [20] Z. Huang, C. Wan, T. Probst, and L. V. Gool. Deep Learning on Lie Groups for SkeletonBased Action Recognition. In CVPR, pages 6099–6108, 2017.
 [21] Z. Huang, R. Wang, X. Li, W. Liu, S. Shan, L. V. Gool, and X. Chen. GeometryAware Similarity Learning on SPD Manifolds for Visual Recognition. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):2513–2523, 2018.
 [22] Z. Huang, R. Wang, S. Shan, X. Li, and X. Chen. Logeuclidean Metric Learning on Symmetric Positive Definite Manifold with Application to Image Set Classification. In ICML, pages 720–729, 2015.
 [23] Z. Huang, J. Wu, and L. V. Gool. Building Deep Networks on Grassmann Manifolds. In AAAI, pages 3279–3286, 2018.
 [24] C. Ionescu, O. Vantzos, and C. Sminchisescu. Matrix Backpropagation for Deep Networks with Structured Layers. In ICCV, pages 2965–2973, 2015.
 [25] Q. Ke, M. Bennamoun, S. An, F. A. Sohel, and F. Boussaïd. A New Representation of Skeleton Sequences for 3D Action Recognition. In CVPR, pages 4570–4579, 2017.
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, pages 1097–1105, 2012.
 [27] C. Li, Z. Cui, W. Zheng, C. Xu, and J. Yang. SpatioTemporal Graph Convolution for Skeleton Based Action Recognition. In AAAI, pages 3482–3489, 2018.
 [28] P. Li, Q. Wang, H. Zeng, and L. Zhang. Local LogEuclidean Multivariate Gaussian Descriptor and Its Application to Image Classification. TPAMI, 39(4):803–817, 2017.
 [29] P. Li, J. Xie, Q. Wang, and W. Zuo. Is Secondorder Information Helpful for Largescale Visual Recognition? In ICCV, pages 2070–2078, 2017.
 [30] J. Liu, A. Shahroudy, D. Xu, and G. Wang. SpatioTemporal LSTM with Trust Gates for 3D Human Action Recognition. In ECCV, pages 816–833, 2016.
 [31] J. Liu, G. Wang, P. Hu, L.Y. Duan, and A. C. Kot. Global ContextAware Attention LSTM Networks for 3D Action Recognition. In CVPR, pages 3671–3680, 2017.
 [32] M. Liu, H. Liu, and C. Chen. Enhanced Skeleton Visualization for View Invariant Human Action Recognition. Pattern Recognition, 68:346–362, 2017.
 [33] M. Liu and J. Yuan. Recognizing Human Actions as The Evolution of Pose Estimation Maps. In CVPR, 2018.

[34]
M. Lovrić, M. MinOo, and E. A. Ruh.
Multivariate Normal Distributions Parametrized As a Riemannian Symmetric Space.
Journal of Multivariate Analysis
, 74(1):36–48, 2000.  [35] J. Luo, W. Wang, and H. Qi. Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps. In ICCV, pages 1809–1816, Dec 2013.
 [36] J. C. Nez, R. Cabido, J. J. Pantrigo, A. S. Montemayor, and J. F. Vlez. Convolutional Neural Networks and Long ShortTerm Memory for Skeletonbased Human Activity and Hand Gesture Recognition. Pattern Recognition, 76(C):80–94, 2018.
 [37] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy. Sequence of The Most Informative Joints (SMIJ): A New Representation for Human Skeletal Action Recognition. Journal of Visual Communication and Image Representation, 25(1):24–38, 2014.
 [38] E. OhnBar and M. M. Trivedi. Joint Angles Similarities and HOG2 for Action Recognition. In CVPRW, pages 465–470, 2013.
 [39] E. OhnBar and M. M. Trivedi. Hand Gesture Recognition in Real Time for Automotive Interfaces: A Multimodal VisionBased Approach and Evaluations. IEEE Transactions on Intelligent Transportation Systems, 15(6):2368–2377, 2014.
 [40] O. Oreifej and Z. Liu. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences. In CVPR, pages 716–723, June 2013.
 [41] H. Rahmani and A. Mian. 3D Action Recognition from Novel Viewpoints. In CVPR, pages 1506–1515, June 2016.
 [42] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image Classification with the Fisher Vector: Theory and Practice. IJCV, 105(3):222–245, 2013.
 [43] G. Serra, C. Grana, M. Manfredi, and R. Cucchiara. GOLD: Gaussians of Local Descriptors for Image Representation. CVIU, 134:22–32, 2015.
 [44] A. Shahroudy, J. Liu, T. T. Ng, and G. Wang. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In CVPR, pages 1010–1019, 2016.
 [45] L. Shi, Y. Zhang, J. Cheng, and H. Lu. NonLocal Graph Convolutional Networks for SkeletonBased Action Recognition. CoRR, abs/1805.07694, 2018.
 [46] Q. D. Smedt, H. Wannous, and J. Vandeborre. SkeletonBased Dynamic Hand Gesture Recognition. In CVPRW, pages 1206–1214, June 2016.
 [47] Q. D. Smedt, H. Wannous, J.P. Vandeborre, J. Guerry, B. L. Saux, and D. Filliat. 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset. In Eurographics Workshop on 3D Object Retrieval, pages 33–38, 2017.
 [48] O. Tuzel, F. Porikli, and P. Meer. Pedestrian Detection via Classification on Riemannian Manifolds. TPAMI, 30(10):1713–1727, 2008.
 [49] R. Vemulapalli, F. Arrate, and R. Chellappa. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group. In CVPR, pages 588–595, 2014.
 [50] C. Wang, Y. Wang, and A. L. Yuille. An Approach to PoseBased Action Recognition. In CVPR, pages 915–922, 2013.
 [51] H. Wang and L. Wang. Modeling Temporal Dynamics and Spatial Configurations of Actions Using TwoStream Recurrent Neural Networks. CVPR, pages 3633–3642, 2017.
 [52] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining Actionlet Ensemble for Action Recognition with Depth Cameras. In CVPR, pages 1290–1297, 2012.
 [53] P. Wang, Z. Li, Y. Hou, and W. Li. Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks. In ACM MM, pages 102–106, 2016.
 [54] Q. Wang, P. Li, and L. Zhang. G2DeNet: Global Gaussian Distribution Embedding Network and Its Application to Visual Recognition. In CVPR, pages 2730–2739, 2017.
 [55] J. Weng, M. Liu, X. Jiang, and J. Yuan. Deformable Pose Traversal Convolution for 3D Action and Gesture Recognition. In ECCV, 2018.
 [56] S. Yan, Y. Xiong, and D. Lin. Spatial Temporal Graph Convolutional Networks for SkeletonBased Action Recognition. In AAAI, pages 7444–7452, 2018.

[57]
X. Yang and Y. L. Tian.
EigenJointsbased Action Recognition Using NaiveBayesNearestNeighbor.
In CVPRW, pages 14–19, 2012.  [58] C. Yuan, W. Hu, X. Li, S. Maybank, and G. Luo. Human Action Recognition Under Logeuclidean Riemannian Metric. In ACCV, pages 343–353, 2010.
 [59] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The Moving Pose: An Efficient 3D Kinematics Descriptor for LowLatency Action Recognition and Detection. In ICCV, pages 2752–2759, 2013.
 [60] T. Zhang, W. Zheng, Z. Cui, and C. Li. Deep ManifoldtoManifold Transforming Network. CoRR, abs/1705.10732, 2017.
 [61] X. Zhang, Y. Wang, M. Gou, M. Sznaier, and O. Camps. Efficient Temporal Sequence Comparison and Classification Using Gram Matrix Embeddings on a Riemannian Manifold. In CVPR, pages 4498–4507, 2016.
 [62] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie. Cooccurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks. In AAAI, pages 3697–3703, 2016.
Appendix A Backpropagation Procedures
This part provides details on the backpropagation procedures during the training process of our network. Our network can be encoded as a pair where is a composition of layers, represents the network parameters, are the parameters of layer . Let be the loss as a function of layer . In the following, we omit the superscript of for the sake of convenience.
a.1 SPDAgg layer
We present in this section a method based on the chain rule of [24] for the computations of partial derivatives. For more details on the established theory, we refer readers to [24]. The variation of is given by:
(22) 
The chain rule in this case is:
(23) 
By replacing in Eq. (23) with its expression in Eq. (22), the lefthand side of Eq. (23) becomes:
(24) 
Using the properties [24] of the matrix inner product “:” and by the fact that and are symmetric, we have:
(25) 
(26) 
(27) 
The expression (24) now becomes:
(28) 
Since the last expression is equal to the righthand side of Eq. (23), we obtain the partial derivatives:
(29) 
(30) 
To learn the weights of this layer, we use the method proposed in [19]. The weight is updated in two steps. First, the tangential component to the Stiefel manifold is obtained by subtracting the normal component of the Euclidean gradient:
(31) 
where is the updated weight at the iteration and is the normal component of the Euclidean gradient . Following Eq. (29), the Euclidean gradient is given by:
(32) 
where , is the projection of on its columns corresponding to .
Then a retraction operation is used to map back the updated weight in the tangent space of the Stiefel manifold to that in the Stiefel manifold as:
(33) 
where is the retraction operation, is the learning rate.
The updated weights of , at the iteration can be computed as:
(34) 
a.2 LogEig and ReEig layers
To make this document selfcontained for readers, we present here the computations of partial derivatives for the LogEig and ReEig layers. For more details, we refer readers to [19, 24]. For the LogEig layers, the first step receives matrix as input and produces matrices and such that . The partial derivatives can be computed from those of the outputs and as [24]:
(35) 
where , is with all offdiagonal elements being 0, and is defined as:
(36) 
The second step receives matrices and as input and produces matrix . The partial derivatives and can be computed from those of the output as [19]:
(37) 
(38) 
The ReEig layers can be decomposed into two steps as the LogEig layers where the partial derivatives of the first step are computed similarly to the LogEig layers. For the second step, the partial derivatives and can be computed from those of the output as:
(39) 
(40) 
where is defined in Eq. (9), and is the gradient of with diagonal elements being defined as:
(41) 
a.3 VecMat layer
For the VecMat layer, the expression for the partial derivatives is straightforward and can be written as:
(42) 
where .
a.4 GaussAgg layer
The general form for the mapping of the GaussAgg layers can be written as:
(43) 
where is the input of the GaussAgg layer, is the output of the GaussAgg layer, and .
By the identity , can be expressed as a function of as [54]: