GeomNet: A Neural Network Based on Riemannian Geometries of SPD Matrix Space and Cholesky Space for 3D Skeleton-Based Interaction Recognition

11/25/2021
by   Xuan Son NGuyen, et al.
ENSEA
0

In this paper, we propose a novel method for representation and classification of two-person interactions from 3D skeleton sequences. The key idea of our approach is to use Gaussian distributions to capture statistics on R n and those on the space of symmetric positive definite (SPD) matrices. The main challenge is how to parametrize those distributions. Towards this end, we develop methods for embedding Gaussian distributions in matrix groups based on the theory of Lie groups and Riemannian symmetric spaces. Our method relies on the Riemannian geometry of the underlying manifolds and has the advantage of encoding high-order statistics from 3D joint positions. We show that the proposed method achieves competitive results in two-person interaction recognition on three benchmarks for 3D human activity understanding.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

11/27/2020

Riemannian Gaussian distributions, random matrix ensembles and diffusion kernels

We show that the Riemannian Gaussian distributions on symmetric spaces, ...
02/15/2021

Gaussian distributions on Riemannian symmetric spaces in the large N limit

We consider Gaussian distributions on certain Riemannian symmetric space...
09/18/2020

Additive Models for Symmetric Positive-Definite Matrices, Riemannian Manifolds and Lie groups

In this paper an additive regression model for a symmetric positive-defi...
01/27/2016

Neighborhood Preserved Sparse Representation for Robust Classification on Symmetric Positive Definite Matrices

Due to its promising classification performance, sparse representation b...
03/04/2014

Random Projections on Manifolds of Symmetric Positive Definite Matrices for Image Classification

Recent advances suggest that encoding images through Symmetric Positive ...
10/29/2021

A Riemannian Inexact Newton Dogleg Method for Constructing a Symmetric Nonnegative Matrix with Prescribed Spectrum

This paper is concerned with the inverse problem of constructing a symme...
01/02/2022

Riemannian Nearest-Regularized Subspace Classification for Polarimetric SAR images

As a representation learning method, nearest regularized subspace(NRS) a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D skeleton-based action recognition has been an active research topic in recent years with many potential applications. In this work, we focus on 3D skeleton-based two-person interaction recognition (3DTPIR). Compared to a large number of general-purpose methods for 3D skeleton-based action recognition, methods for 3DTPIR are much less studied in the literature. Recent works [46, 63] have shown that state-of-the-art action recognition methods do not always perform well on human interaction recognition. This is because they lack an effective mechanism for capturing intra-person and inter-person joint relationships [46].

In two-person interactions, arm and leg movements are highly correlated. However, these correlations are simply encoded by distances between joints in most existing works for 3DTPIR [17, 18, 42, 65]. This motivates us to use high-order statistics, i.e. covariance matrices to better capture these interactions. It has been known that covariance matrices lie on a special type of Riemannian manifolds, i.e. SPD manifolds (denoted by ). A large body of works has been developed for classification of SPD-valued data. Recently, SPD neural networks have demonstrated impressive results [13]

. One of the core issues that remains open is the finding of effective and efficient methods for modeling probability distributions on

. Since Gaussian distributions (abbreviated as Gaussians) on

are the most popular probability distributions used in statistics, existing works mainly focused on generalizing them to

. Such a generalization was first given in [44] in a more general context of Riemannian manifolds. However, the asymptotic formulae of Riemannian Gaussian distributions (abbreviated as Riemannian Gaussians) proposed in this work make them hard to evaluate and apply in practice. Some works aim to address this shortcoming by introducing notions of Riemannian Gaussians in symmetric spaces [49, 50] and homogeneous spaces [6]

. These have been successfully applied to classification problems. In this work, we also interested in Riemannian Gaussians for classification. However, differently from the above works, we seek methods for embedding Riemannian Gaussians in matrix groups. This allows us to perform classification of Riemannian Gaussians without having to resort to an exact expression of their probability density function as in 

[6, 49, 50].

In summary, the main contributions of this work are:

  • We propose an embedding method for Gaussians by mapping them diffeomorphically to Riemannian symmetric spaces.

  • We consider representing a 3D skeleton sequence by a set of SPD matrices that leads us to the study of statistics on . We show that the product space of mean and covariance on can be viewed as a Lie group with an appropriate group product. Moreover, we point out a connection between this space and the group of lower triangular matrices with positive diagonal entries.

  • Based on the theory described above, we introduce a neural network for learning a geometric representation from a 3D skeleton sequence.

  • Experiments on three benchmarks for 3D human activity understanding demonstrate the competitiveness of our method with state-of-the-art methods.

2 Related Works

We will briefly discuss representative works for 3DTPIR (Section 2.1), embeddings of Gaussians (Section 2.2), and probability distributions on (Section 2.3).

2.1 Two-person Interaction Recognition from 3D Skeleton Sequences

A variety of approaches has been proposed for 3D skeleton-based action recognition. These are based on hand-crafted features [10, 35, 54, 58, 64, 67]

and deep learning 

[9, 19, 28, 30, 31, 32, 33, 40, 52, 57, 60, 68]. Recent works focus on neural networks on manifolds [13, 14, 15, 41] and on graphs [7, 21, 22, 53, 62]. Due to space limit, we refer the interested reader to [48] for a more comprehensive survey. Below we focus our discussion on 3DTPIR.

Approaches for 3DTPIR are much less studied. Hand-crafted feature based methods mainly rely on distances [17, 18, 42, 65] or moving similarity [27] between joints of two persons. Li and Leung [23]

applied a multiple kernel learning method to an interaction graph constructed from the relative variance of joint relative distances. Two-stream RNNs are proposed in 

[37, 57] where interactions between two persons are modeled by concatenating the 3D coordinates of their corresponding joints, or by augmenting the input sequence with distances between their joints. In [46], Relational Network [51] is extended to automatically infer intra-person and inter-person joint relationships. The recent work [63] deals with graph construction in graph convolutional networks for 3DTPIR.

2.2 Embedding of Gaussians

Methods for embedding Gaussians are widely used in statistics, e.g. for measuring the distance between probability distributions. The work of [47]

first proposed a distance function based on the Fisher information as a Riemannian metric. However, in the general case of multivariate Gaussians, an exact formula for the distance function is difficult to obtain. In computer vision, one of the most widely used embedding is derived from 

[34]. The key idea is to identify Gaussians with SPD matrices by parametrizing the space of Gaussians as a Riemannian symmetric space. The work of [5] shares a similar idea of identifying Gaussians with SPD matrices. However, it is based on embedding Gaussians into the Siegel group. In [12], a connection is established between Gaussians and a subspace of affine matrices. The method of [25] relies on the Log-Euclidean metrics [1] for embedding Gaussians in linear spaces.

2.3 Probability Distributions on

Existing works mainly focused on generalizing Gaussians to due to their popularity in statistics. Generalizations of Gaussians are proposed in Riemannian manifolds [44, 66], symmetric spaces [49, 50], and homogeneous spaces [6]. In [2, 4], Riemannian Gaussians are derived from the definition of maximum entropy on exponential families. Family of Alpha-Divergences and other related divergences have also been extensively studied [8].

3 Background Theory

3.1 Statistics on Riemannian Manifolds

The theory presented here is based on [44]. In this framework, the structure of a manifold is specified by a Riemannian metric. Let be a point of as a local reference and be the tangent space at . A Riemannian metric is a continuous collection of dot products on . The distance between two points of a connected Riemannian manifold is the minimum length among the smooth curves joining them. The curves realizing this minimum for any two points are called geodesics. Let

be a tangent vector at

. We define the exponential map at as the function that maps to the point that is reached after a unit time by the geodesic starting at with this tangent vector. This map is defined in the whole tangent space but it is generally one-to-one only locally around 0 in the tangent space (i.e. around in the manifold). Let be the inverse of the exponential map that is the smallest vector as measured by the Riemannian metric such that . The exponential chart at can be seen as the development of in the tangent space at a given point along the geodesics.

The definitions of mean and covariance on a Riemannian manifold are given below.

Definition 1

Let be a random point of probability density function . Denote by the distance between induced by the Riemannian metric of . The set of expected or mean values is:

(1)

where is the volume measure induced by the Riemannian metric of , and we assume that the integral is finite for all points (which is true for a density with a compact support).

Definition 2

Let be a random point, be a mean value that we assume to be unique. The covariance is defined as:

(2)

where is the maximal definition domain for the exponential chart at .

4 Proposed Approach

In what follows, we are interested in matrices over the field of real numbers, unless otherwise stated.

4.1 Embedding of Gaussians

Let be the space of variate Gaussians, be the space of SPD matrices with determinant 1. Lemma 1 gives an embedding of Gaussians in .

Lemma 1

We can identify the Gaussian with the following matrix in :

(3)

where and are the mean and covariance of the Gaussian, is the identity matrix, is a matrix with identical column vectors .

Proof. See supplementary material.

When , the embedding in (3) becomes the one introduced in [34]. The natural symmetric Riemannian metric resulting from the above embedding is given in Lemma 2.

Lemma 2

The Riemannian metric is given by:

(4)

where and are two tangent vectors at .

Proof. See supplementary material.

It turns out that the Riemannian metric given in (4) belongs to the family of affine-invariant metrics proposed in [45]. Consequently, the exponential map at a point can be obtained [43] as:

(5)

where is a SPD matrix, is a tangent vector at , and is the matrix exponential. By inverting the exponential map, we obtain the logarithmic map:

(6)

where and are two SPD matrices, and is the matrix logarithm.

4.2 Statistics on

Suppose that we are given a set of matrices . From (1), one can define the empirical or discrete mean value of as:

(7)

The mean can be computed by an iterative procedure consisting in: (1) projecting the SPD matrices in the tangent space at the current mean; (2) estimating the arithmetic mean in that space; (3) projecting the mean back in

. These steps are iterated until convergence [39].

Similarly, the empirical covariance of a set of SPD matrices of mean is defined using the discrete version of the expectation operator in (2):

(8)

where

denotes the tensor product.

We propose to learn a transformation of from to another tangent space so that the covariance computed in this space is more discriminative for classification. The transformation is performed by parallel transport (PT). We need Lemma 3 for our transformation.

Lemma 3

Let . Let the Riemannian metric be the one given in (4). The PT from to along geodesics connecting and of a tangent vector is given by:

(9)

Proof. See supplementary material.

The formula of PT in (9) is the same as those in [11, 55, 61] which are all based on the Riemannian metric of the following form:

(10)

where and are two tangent vectors at .

In [4]

, the authors also use PT for designing Riemannian batch normalization (RBN) layers. Our method differs from theirs in three main aspects. First, their method learns the parameters of RBN layers from the statistics of mini-batches, while our method deals with the statistics within each sequence. Note that a RBN layer can also be designed in our framework and can potentially improve the accuracy of our network. Second, their formulation of Riemannian Gaussians involves only a Riemannian mean without notion of variance. Third, their method does not aim to leverage the second-order statistics (covariance) on SPD manifolds.

Now suppose that are transported to another tangent space at (the target point of PT), then the covariance can be estimated as:

(11)

Let be a mapping that vectorizes a symmetric matrix by taking its lower triangular part and applying a coefficient on its off-diagonal entries in order to preserve the norm [43]. Then, the covariance is given by:

(12)

If is a matrix of size , then we use a point that lies on the product manifold to parametrize the distribution of the given set of SPD matrices. Next, we propose an embedding of this point based on the Lie group theory.

4.3 Embedding of Riemannian Gaussians

We first define an appropriate group product on the product manifold .

Definition 3

Let be the product manifold . Let where , , and be the Cholesky decomposition of . Denote by a smooth bijective mapping with a smooth inverse where is a subset of the set of matrices. The group product between two elements of is defined as:

(13)

Theorem 1 shows that forms a Lie group.

Theorem 1

is a Lie group under product .

Proof. See supplementary material.

Based on Theorem 1, we can establish a Lie group isomorphism between and a subgroup of a group of lower triangular matrices with positive diagonal entries.

Theorem 2

Denote by the group of lower triangular matrices with positive diagonal entries, the matrix with all elements equal to zero. Let

(14)

where , and

(15)

where , . Then is a Lie group isomorphism.

Proof. See supplementary material.

We now can give the embedding matrix of a point where as follows:

(16)

The embedding matrix in (16) depends on the choice of function . In this work, we set where is the transpose of , is a matrix with identical column vectors obtained from , and is given by:

(17)

where

is the eigenvalue decomposition of

, and is the diagonal matrix of eigenvalue logarithms.

4.4 A Neural Network for 3DTPIR

We are now ready to introduce a neural network (GeomNet) for 3DTPIR based on the theory developed in the previous sections. Let and be the number of joints and that of frames in a given sequence, respectively, Let be the feature vector (3D coordinates) of joint at frame . Two joints and are neighbors if they are connected by a bone. Denote by the set of neighbors of joint . Let be the two joints selected as the roots of the first and second skeleton, respectively (see Fig. 1). For any two joints and that belong to the same skeleton, the distance between them is defined as the number of bones connecting them (see Fig. 1). The first layer of GeomNet is a convolutional layer written as:

(18)

where is the output feature vector of joint at frame , and is defined as:

(19)
Figure 1: Illustration of body joints’ positions (only the first skeleton is shown). The joint at the hip is selected as the root of the skeleton. The distance between joints 1 and 6 is 4. The joints 22,23,24,25 are not used in the convolution operation. The arms then contain the following joints: . The legs contain the following joints: (figure reproduced from [52]).

Here, the set of weights completely defines the convolution filters in Eq. (18). Let and be the numbers of joints belonging to the arms and legs of two skeletons, respectively (see Fig. 1). Let and respectively of size and be the data associated with the arms and legs of two persons. The motivation behind this partition is that the interaction between two persons often involve those among their arms and those among their legs. For each , the set of dim feature vectors from is partitioned into

subsets using K-means clustering. Let

be the feature vectors in the subset. We assume that are i.i.d. samples from a Gaussian whose parameters can be estimated as:

(20)
(21)

Based on the theory developed in Section 4.1, the Gaussian can be identified with the following matrix:

(22)

The above computations can be performed by a layer as:

(23)

The next layer is designed to compute statistics on SPD manifolds and can be written by:

(24)

where are the parameters corresponding to the target points of PT (see Section 4.2). Specifically, is the mean of , and is given by:

(25)

The next layer computes the embeddings of statistics and can be written as:

(26)

where is the embedding matrix of given in the right-hand side of (16).

The next layer transforms to some matrices in as:

(27)

where , are the parameters that are required to be in so that the outputs are also in . The network then performs a projection:

(28)

where

. Finally, a fully-connected (FC) layer and a softmax layer are used to obtain class probabilities:

(29)

where are the parameters of the FC layer, the operator concatenates the two column vectors and vertically, and are the output class probabilities. We use the cross-entropy loss for training GeomNet.

4.5 Geometry Aware Constrained Optimization

Some layers of GeomNet rely on the eigenvalue decomposition. To derive the backpropagation updates for these layers, we follow the framework of 

[16] for computation of the involved partial derivatives. The optimization procedure for the parameters is based on the Adam algorithm in Riemannian manifolds [3]. The Riemannian Adam update rule is given by:

(30)

where and are respectively the parameters updated at timesteps and , , , is a momentum term, is an adaptivity term, is the gradient evaluated at timestep , are constant values. The squared Riemannian norm corresponds to the squared gradient value in Riemannian settings. Here, is the dot product for the Riemannian metric of the manifold in consideration, as discussed in Section 3.1. After updating in Eq. (30), we update as the PT of along geodesics connecting and , i.e. .

The update rule in Eq. (30) requires the computation of the exponential map and the PT. For SPD manifolds, these operations are given in Eqs. (5) and (9). It remains to define these operations for the update of the parameters . To this aim, we rely on the Riemannian geometry of studied in the recent work [26]. By considering the following metric:

(31)

where , , is the element on the row and column of , Lin has shown [26] that the space (referred to as Cholesky space) equipped with the above metric forms a Riemannian manifold. On this manifold, the exponential map at a point can be computed as:

(32)

where , , is a matrix of the same size as whose element is if and is zero otherwise, is a diagonal matrix whose element is . Also, the PT of a tangent vector to a tangent vector at is given by:

(33)

where .

5 Experiments

Our network was implemented with Tensorflow deep learning framework and the experiments were conducted using two NVIDIA GeForce GTX 1080 GPUs. We used GeomStats library 

[38] for geometric computations. The dimension of output vectors at the convolutional layer, the number of clusters , and the learning rate were set to 9, 180, and , respectively. The batch sizes were set respectively to 30 and 256 for the experiments on SBU Interaction dataset and those on NTU datasets. The values of the pair (see (3) and (16)) were set to and for the experiments on SBU Interaction and NTU datasets, respectively. The values of , and in the Riemannian Adam algorithm111Our code deals with constrained and unconstrained parameters. were set to , and , respectively [20]

. In our experiments, GeomNet converged well after 600 epochs. For more details on our experiments, we refer the interested reader to the supplementary material.

5.1 Datasets and Experimental Settings

SBU Interaction dataset. This dataset [65] contains 282 sequences in 8 action classes created from 7 subjects. Each action is performed by two subjects where each subject has 15 joints. The joints 4,21,1,5,6,7,9,10,11,13,14,15,17,18,19 in Fig. 1 correspond respectively to the joints 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 of the first skeleton of SBU Interaction dataset. We followed the experimental protocol based on 5-fold cross validation with the provided training/testing splits [65].

NTU RGB+D 60 dataset. This dataset [52] contains 56,880 sequences created from 40 subjects with three cameras views and categorized into 60 classes. We followed the two experimental protocols cross-subject (X-subject) and cross-view (X-view) [52].

NTU RGB+D 120 dataset. This dataset [29] contains 114,480 sequences in 120 action classes, captured by 106 subjects with three cameras views. We followed the two experimental protocols cross-subject (X-subject) and cross-setup (X-setup) [29].

5.2 Ablation Study

In this section, we study the impact of different components of GeomNet on its accuracy222The results of GeomNet are averaged over 3 runs. on SBU Interaction and NTU RGB+D 60 datasets.

Embedding dimensions. Here we investigate the impact of the parameters and (see (3) and (16)). Fig. 2 shows the accuracies of GeomNet on SBU Interaction dataset with different settings of , i.e. and . Note that when , the layer relies only on the covariance information. Also, when , the outputs of the layer are simply obtained by the Cholesky decomposition of , i.e. . It is interesting to note that GeomNet achieves the best accuracy with , i.e. none of and is equal to 1. This is opposed to previous works [12, 24, 41, 59], where -variate Gaussians are always identified with SPD matrices. To the best of our knowledge, this is the first work that shows the benefits of identifying -variate Gaussians with SPD matrices where . The results also reveal that the setting of has a non-negligible impact on the accuracy of GeomNet. Namely, the performance gap between two settings (94.54%) and (96.33%) is 1.79%. We can also notice that when is fixed, GeomNet always performs best with . This shows the effectiveness of our parameterization of Riemannian Gaussians in (16).

Figure 2: Accuracy of GeomNet on SBU Interaction dataset with different settings of .
Dataset SBU Interaction NTU RGB+D 60 Dataset
X-Subject X-View
Without PT 71.51 62.18 66.83
PT 96.33 93.62 96.32
Table 1: Effectiveness of PT on SBU Interaction and NTU RGB+D 60 datasets.

To investigate the effectiveness of our proposed embedding of Gaussians outside of our framework, we used it to improve the state-of-the-art neural network on SPD manifolds SPDNet [13]. In [13], the authors performed action recognition experiments by representing each sequence by a joint covariance descriptor. The covariance descriptor is computed from the second order statistics of the 3D coordinates of all body joints in each frame. For SBU Interaction dataset, the size of the covariance matrix is (30 body joints in each frame). In our experiment, we combined the covariance matrix and the mean vector using the proposed embedding of Gaussians to represent each sequence. Each sequence is then represented by a SPD matrix. We used the code of SPDNet333https://github.com/zhiwu-huang/SPDNet published by the authors. Fig. 3 shows the accuracies of SPDNet444The results are averaged over 10 runs. on SBU Interaction dataset with different settings of . As can be observed, SPDNet gives the best accuracy with the setting . The performance gap between two settings (90.5%) and (92.38%) is 1.88%. The accuracy of SPDNet when using only the covariance () is 79.48%, which is significantly worse than its accuracy with the setting . The results confirm that our proposed embedding of Gaussians is effective in the framework of SPDNet and that it is advantageous over the one of [34]. This suggests that our method could also be beneficial to previous works that rely on Gaussians to capture local feature distribution, e.g. [12, 25, 24, 36, 41, 59].

Figure 3: Accuracy of SPDNet on SBU Interaction dataset with different settings of .
Dataset SBU Interaction NTU RGB+D 60 Dataset
X-Subject X-View
Without LTML 94.90 92.30 95.05
LTML 96.33 93.62 96.32
Table 2: Effectiveness of lower triangular matrix learning (LTML) on SBU Interaction and NTU RGB+D 60 datasets.

Parallel transport. Tab. 1 gives the accuracies of GeomNet without using PT on SBU Interaction and NTU RGB+D 60 datasets. The accuracies of GeomNet are also shown for comparison purposes. When PT is not used, the covariance in Eq. (12) is computed as:

(34)

It can be seen that the use of PT is crucial for obtaining high accuracy. Specifically, on NTU RGB+D 60 dataset, computing the covariance without PT results in a loss of 31.44% on X-Subject protocol and a loss of 29.49% on X-View protocol. On SBU Interaction dataset, a significant reduction in accuracy (24.82%) can also be observed when PT is not used. These results highlight the importance of learning the parameters in GeomNet.

Lower triangular matrix learning. Tab. 2 gives the accuracies of GeomNet without using the layer on SBU Interaction and NTU RGB+D 60 datasets. Again, the accuracies of GeomNet are also shown for comparison purposes. We can note that the introduction of the layer brings performance improvement, i.e. 1.43% on SBU Interaction dataset, and 1.32% on X-Subject protocol and 1.27% on X-View protocol on NTU RGB+D 60 dataset.

5.3 Results on SBU Interaction Dataset

Results of GeomNet and state-of-the-art methods on SBU Interaction dataset are given in Tab. 3. For SPDNet, we report its best accuracy using the embedding in (3) with . We can remark that the accuracies of most of the hand-crafted feature based methods [18, 56] are lower than 90%. The state-of-the-art method [9] for skeleton-based action recognition only gives a modest accuracy of 80.35%, the second worst accuracy among the competing methods. GeomNet achieves the best accuracy of 96.33%, which is 16.85% better than that of SPDNet.

Method Accuracy
Lie Group [56] 47.92
Constrast Mining [18] 86.90
Interaction Graph [23] 92.56
Trust Gate LSTM [28] 93.30
Hierarchical RNN [9] 80.35
Deep LSTM+Co-occurence [68] 90.41
SPDNet [13] 92.38
GeomNet 96.33
Table 3: Recognition accuracy (%) of GeomNet and state-of-the-art methods on SBU Interaction dataset.

5.4 Results on NTU RGB+D 60 Dataset

Tab. 4 shows the results of GeomNet and state-of-the-art methods on NTU RGB+D 60 dataset. For ST-GCN and AS-GCN, we used the codes555https://github.com/yysijie/st-gcn,666https://github.com/limaosen0/AS-GCN published by the authors. For SPDNet, we report its best accuracy using the embedding in (3) with . We can observe that GeomNet gives the best results on this dataset. Since ST-GCN is based on fixed skeleton graphs which might miss implicit joint correlations, AS-GCN improves it by learning actional links to capture the latent dependencies between joints. AS-GCN also extends the skeleton graphs to represent structural links. However, AS-GCN does not achieve significant improvements over ST-GCN. This indicates that actional and structural links in AS-GCN are still not able to cope with complex patterns in 3DTPIR. As can be seen, GeomNet outperforms ST-GCN and AS-GCN by large margins. We can also note a large performance gap between GeomNet and SPDNet. This can probably be explained by the fact that: (1) GeomNet aims to learn inter-person joint relationships; (2) GeomNet leverages the covariance information on SPD manifolds.

Method X-Subject X-View
ST-LSTM [30] 83.0 87.3
ST-GCN [62] 86.75 91.17
AS-GCN [22] 87.08 92.04
LSTM-IRN [46] 90.5 93.5
SPDNet [13] 74.85 76.07
GeomNet 93.62 96.32
Table 4: Recognition accuracy (%) of GeomNet and state-of-the-art methods on NTU RGB+D 60 dataset.
Method X-Subject X-Setup
ST-LSTM [30] 63.0 66.6
ST-GCN [62] 78.60 79.92
AS-GCN [22] 77.83 79.30
LSTM-IRN [46] 77.7 79.6
ST-GCN-PAM [63] 83.28
SPDNet [13] 60.72 62.08
GeomNet 86.49 87.58
Table 5: Recognition accuracy (%) of GeomNet and state-of-the-art methods on NTU RGB+D 120 dataset.

5.5 Results on NTU RGB+D 120 Dataset

Results of GeomNet and state-of-the-art methods on NTU RGB+D 120 dataset are given in Tab. 5. For SPDNet, we report its best accuracy using the embedding in (3) with . As can be observed, GeomNet performs best on this dataset. Note that LSTM-IRN performs significantly worse than GeomNet on this most challenging dataset. By adapting the graph structure in ST-GCN to involve connections between two skeletons, ST-GCN-PAM achieves significant improvements. However, ST-GCN-PAM is still outperformed by GeomNet by 3.21% on X-Subject protocol777The authors did not report its accuracy on X-Setup protocol.. The results indicate that: (1) without any prior knowledge, automatic inference of intra-person and inter-person joint relationships is difficult; (2) even with prior knowledge, the state-of-the-art ST-GCN performs worse than GeomNet. Compared to the results on NTU RGB+D 60 dataset, the performance gap between GeomNet and SPDNet is more pronounced on this dataset. Notice that our method is based only on the assumption that the joints of the arms of two persons and those of their legs are highly correlated during their interaction. Therefore, no explicit assumption in pairwise joint connections is required for interaction recognition.

6 Conclusion

We have presented GeomNet, a neural network based on embeddings of Gaussians and Riemannian Gaussians for 3DTPIR. To improve the accuracy of GeomNet, we have proposed the use of PT and a layer that learns lower triangular matrices with positive diagonal entries. Finally, we have provided experimental results on three benchmarks showing the effectiveness of GeomNet.

Acknowledgments. We thank the authors of NTU RGB+D datasets for providing access to their datasets.

References

  • [1] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache (2005) Fast and Simple Computations on Tensors with Log-Euclidean Metrics. Technical report Technical Report RR-5584, INRIA. Cited by: §2.2.
  • [2] F. Barbaresco (2019) Jean-Louis Koszul and the Elementary Structures of Information Geometry. In Geometric Structures of Information, F. Nielsen (Ed.), pp. 333–392. Cited by: §2.3.
  • [3] G. Bécigneul and O. Ganea (2019) Riemannian Adaptive Optimization Methods. In ICLR, Cited by: §4.5.
  • [4] D. A. Brooks, O. Schwander, F. Barbaresco, J. Schneider, and M. Cord (2019) Riemannian Batch Normalization for SPD Neural Networks. In NeurIPS, pp. 15463–15474. Cited by: §2.3, §4.2.
  • [5] M. Calvo and J. M. Oller (1990)

    A Distance between Multivariate Normal Distributions Based in an Embedding into the Siegel Group

    .

    Journal of Multivariate Analysis

    35 (2), pp. 223–242.
    Cited by: §2.2.
  • [6] R. Chakraborty and B. Vemuri (2017) Statistics on the (compact) Stiefel manifold: Theory and Applications. CoRR abs/1708.00045. External Links: 1708.00045 Cited by: §1, §2.3.
  • [7] K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu (2020) Skeleton-Based Action Recognition With Shift Graph Convolutional Network. In CVPR, pp. 180–189. Cited by: §2.1.
  • [8] A. Cichocki and S. Amari (2010) Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy 12 (6), pp. 1532–1568. Cited by: §2.3.
  • [9] Y. Du, W. Wang, and L. Wang (2015)

    Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition

    .
    In CVPR, pp. 1110–1118. Cited by: §2.1, §5.3, Table 3.
  • [10] G. Evangelidis, G. Singh, and R. Horaud (2014) Skeletal Quads: Human Action Recognition Using Joint Quadruples. In ICPR, pp. 4513–4518. Cited by: §2.1.
  • [11] R. Ferreira, J. Xavier, J. P. Costeira, and V. Barroso (2006) Newton Method for Riemannian Centroid Computation in Naturally Reductive Homogeneous Spaces. In ICASSP, pp. 704–707. Cited by: §4.2.
  • [12] L. Gong, T. Wang, and F. Liu (2009-06) Shape of Gaussians as Feature Descriptors. In CVPR, pp. 2366–2371. Cited by: §2.2, §5.2, §5.2.
  • [13] Z. Huang and L. V. Gool (2017) A Riemannian Network for SPD Matrix Learning. In AAAI, pp. 2036–2042. Cited by: §1, §2.1, §5.2, Table 3, Table 4, Table 5.
  • [14] Z. Huang, C. Wan, T. Probst, and L. V. Gool (2017) Deep Learning on Lie Groups for Skeleton-Based Action Recognition. In CVPR, pp. 6099–6108. Cited by: §2.1.
  • [15] Z. Huang, J. Wu, and L. V. Gool (2018) Building Deep Networks on Grassmann Manifolds. In AAAI, pp. 3279–3286. Cited by: §2.1.
  • [16] C. Ionescu, O. Vantzos, and C. Sminchisescu (2015) Matrix Backpropagation for Deep Networks with Structured Layers. In ICCV, pp. 2965–2973. Cited by: §4.5.
  • [17] Y. Ji, H. Cheng, Y. Zheng, and H. Li (2015) Learning Contrastive Feature Distribution Model for Interaction Recognition. Journal of Visual Communication and Image Representation 33 (C), pp. 340–349. Cited by: §1, §2.1.
  • [18] Y. Ji, G. Ye, and H. Cheng (2014) Interactive Body Part Contrast Mining for Human Interaction Recognition. In 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp. 1–6. Cited by: §1, §2.1, §5.3, Table 3.
  • [19] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaïd (2017) A New Representation of Skeleton Sequences for 3D Action Recognition. In CVPR, pp. 4570–4579. Cited by: §2.1.
  • [20] D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In ICLR, Cited by: §5.
  • [21] C. Li, Z. Cui, W. Zheng, C. Xu, and J. Yang (2018) Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition. In AAAI, pp. 3482–3489. Cited by: §2.1.
  • [22] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian (2019) Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. In CVPR, pp. 3595–3603. Cited by: §2.1, Table 4, Table 5.
  • [23] M. Li and H. Leung (2016) Multiview Skeletal Interaction Recognition Using Active Joint Interaction Graph. IEEE Transactions on Multimedia 18 (11), pp. 2293–2302. Cited by: §2.1, Table 3.
  • [24] P. Li, J. Xie, Q. Wang, and W. Zuo (2017) Is Second-order Information Helpful for Large-scale Visual Recognition?. In ICCV, pp. 2070–2078. Cited by: §5.2, §5.2.
  • [25] P. Li, Q. Wang, H. Zeng, and L. Zhang (2017) Local Log-Euclidean Multivariate Gaussian Descriptor and Its Application to Image Classification. TPAMI 39 (4), pp. 803–817. Cited by: §2.2, §5.2.
  • [26] Z. Lin (2019) Riemannian Geometry of Symmetric Positive Definite Matrices via Cholesky Decomposition. SIAM Journal on Matrix Analysis and Applications 40 (4), pp. 1353–1370. Cited by: §4.5.
  • [27] B. Liu, Z. Ju, and H. Liu (2018) A Structured Multi-Feature Representation for Recognizing Human Action and Interaction. Neurocomputing 318, pp. 287–296. Cited by: §2.1.
  • [28] J. Liu, A. Shahroudy, D. Xu, A. K. Chichung, and G. Wang (2018) Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates. TPAMI 40 (12), pp. 3007–3021. Cited by: §2.1, Table 3.
  • [29] J. Liu, A. Shahroudy, M. Perez, G. Wang, L. Duan, and A. C. Kot (2019) NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. TPAMI 42 (10), pp. 2684–2701. Cited by: §5.1.
  • [30] J. Liu, A. Shahroudy, D. Xu, and G. Wang (2016) Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In ECCV, pp. 816–833. Cited by: §2.1, Table 4, Table 5.
  • [31] J. Liu, G. Wang, P. Hu, L. Duan, and A. C. Kot (2017) Global Context-Aware Attention LSTM Networks for 3D Action Recognition. In CVPR, pp. 3671–3680. Cited by: §2.1.
  • [32] M. Liu, H. Liu, and C. Chen (2017) Enhanced Skeleton Visualization for View Invariant Human Action Recognition. Pattern Recognition 68, pp. 346–362. Cited by: §2.1.
  • [33] M. Liu and J. Yuan (2018)

    Recognizing Human Actions as The Evolution of Pose Estimation Maps

    .
    In CVPR, pp. 1159–1168. Cited by: §2.1.
  • [34] M. Lovrić, M. Min-Oo, and E. A. Ruh (2000) Multivariate Normal Distributions Parametrized As a Riemannian Symmetric Space. Journal of Multivariate Analysis 74 (1), pp. 36–48. Cited by: §2.2, §4.1, §5.2.
  • [35] J. Luo, W. Wang, and H. Qi (2013) Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps. In ICCV, pp. 1809–1816. Cited by: §2.1.
  • [36] T. Matsukawa, T. Okabe, E. Suzuki, and Y. Sato (2016) Hierarchical Gaussian Descriptor for Person Re-identification. In CVPR, pp. 1363–1372. Cited by: §5.2.
  • [37] Q. Men, E. S. L. Ho, H. P. H. Shum, and H. Leung (2020) A Two-Stream Recurrent Network for Skeleton-based Human Interaction Recognition. In ICPR, Cited by: §2.1.
  • [38] N. Miolane, J. Mathe, C. Donnat, M. Jorda, and X. Pennec (2018)

    geomstats: a Python Package for Riemannian Geometry in Machine Learning

    .
    CoRR abs/1805.08308. External Links: 1805.08308 Cited by: §5.
  • [39] M. Moakher (2005)

    A Differential Geometric Approach to the Geometric Mean of Symmetric Positive-definite Matrices

    .
    SIAM J. Matrix Anal. Appl. 26 (3), pp. 735–747. Cited by: §4.2.
  • [40] J. C. Nez, R. Cabido, J. J. Pantrigo, A. S. Montemayor, and J. F. Vlez (2018) Convolutional Neural Networks and Long Short-Term Memory for Skeleton-based Human Activity and Hand Gesture Recognition. Pattern Recognition 76 (C), pp. 80–94. Cited by: §2.1.
  • [41] X. S. Nguyen, L. Brun, O. Lézoray, and S. Bougleux (2019) A Neural Network Based on SPD Manifold Learning for Skeleton-based Hand Gesture Recognition. In CVPR, pp. 12036–12045. Cited by: §2.1, §5.2, §5.2.
  • [42] O. Ouyed and M. S. Allili (2020)

    Group-of-features Relevance in Multinomial Kernel Logistic Regression and Application to Human Interaction Recognition

    .
    Expert Systems with Applications 148, pp. 113247. Cited by: §1, §2.1.
  • [43] X. Pennec, P. Fillard, and N. Ayache (2004) A Riemannian Framework for Tensor Computing. Technical report Technical Report RR-5255, INRIA. Cited by: §4.1, §4.2.
  • [44] X. Pennec (2004) Probabilities and Statistics on Riemannian Manifolds : A Geometric approach. Technical report Technical Report RR-5093, INRIA. Cited by: §1, §2.3, §3.1.
  • [45] X. Pennec (2006) Statistical Computing on Manifolds for Computational Anatomy. Habilitation à diriger des recherches, Université Nice Sophia-Antipolis. Cited by: §4.1.
  • [46] M. Perez, J. Liu, and A. C. Kot (2019) Interaction Relational Network for Mutual Action Recognition. CoRR abs/1910.04963. External Links: 1910.04963 Cited by: §1, §2.1, Table 4, Table 5.
  • [47] C. R. Rao (1992) Information and the Accuracy Attainable in the Estimation of Statistical Parameters. In Breakthroughs in Statistics: Foundations and Basic Theory, S. Kotz and N. L. Johnson (Eds.), pp. 235–247. Cited by: §2.2.
  • [48] B. Ren, M. Liu, R. Ding, and H. Liu (2020) A Survey on 3D Skeleton-Based Action Recognition Using Learning Method. CoRR abs/2002.05907. External Links: 2002.05907 Cited by: §2.1.
  • [49] S. Said, L. Bombrun, Y. Berthoumieu, and J. Manton (2017) Riemannian Gaussian Distributions on the Space of Symmetric Positive Definite Matrices. IEEE Trans. Inf. Theor. 63 (4), pp. 2153–2170. Cited by: §1, §2.3.
  • [50] S. Said, H. Hajri, L. Bombrun, and B. C. Vemuri (2018) Gaussian Distributions on Riemannian Symmetric Spaces: Statistical Learning With Structured Covariance Matrices. IEEE Trans. Inf. Theor. 64 (2), pp. 752–772. Cited by: §1, §2.3.
  • [51] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap (2017) A Simple Neural Network Module for Relational Reasoning. In NIPS, pp. 4967–4976. Cited by: §2.1.
  • [52] A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016) NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In CVPR, pp. 1010–1019. Cited by: §2.1, Figure 1, §5.1.
  • [53] C. Si, W. Chen, W. Wang, L. Wang, and T. Tan (2019) An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition. In CVPR, pp. 1227–1236. Cited by: §2.1.
  • [54] Q. D. Smedt, H. Wannous, and J. Vandeborre (2016) Skeleton-Based Dynamic Hand Gesture Recognition. In CVPRW, pp. 1206–1214. Cited by: §2.1.
  • [55] S. Sra and R. Hosseini (2015) Conic Geometric Optimization on the Manifold of Positive Definite Matrices. SIAM Journal on Optimization 25 (1), pp. 713–739. Cited by: §4.2.
  • [56] R. Vemulapalli, F. Arrate, and R. Chellappa (2014) Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group. In CVPR, pp. 588–595. Cited by: §5.3, Table 3.
  • [57] H. Wang and L. Wang (2017) Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks. CVPR, pp. 3633–3642. Cited by: §2.1, §2.1.
  • [58] J. Wang, Z. Liu, Y. Wu, and J. Yuan (2012) Mining Actionlet Ensemble for Action Recognition with Depth Cameras. In CVPR, pp. 1290–1297. Cited by: §2.1.
  • [59] Q. Wang, P. Li, and L. Zhang (2017) G2DeNet: Global Gaussian Distribution Embedding Network and Its Application to Visual Recognition. In CVPR, pp. 2730–2739. Cited by: §5.2, §5.2.
  • [60] J. Weng, M. Liu, X. Jiang, and J. Yuan (2018) Deformable Pose Traversal Convolution for 3D Action and Gesture Recognition. In ECCV, pp. 142–157. Cited by: §2.1.
  • [61] O. Yair, M. Ben-Chen, and R. Talmon (2019) Parallel Transport on the Cone Manifold of SPD Matrices for Domain Adaptation. IEEE Transactions on Signal Processing 67 (7), pp. 1797–1811. Cited by: §4.2.
  • [62] S. Yan, Y. Xiong, and D. Lin (2018) Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In AAAI, pp. 7444–7452. Cited by: §2.1, Table 4, Table 5.
  • [63] C. Yang, A. Setyoko, H. Tampubolon, and K. Hua (2020) Pairwise Adjacency Matrix on Spatial Temporal Graph Convolution Network for Skeleton-Based Two-Person Interaction Recognition. In ICIP, pp. 2166–2170. Cited by: §1, §2.1, Table 5.
  • [64] X. Yang and Y. L. Tian (2012)

    EigenJoints-based Action Recognition Using Naive-Bayes-Nearest-Neighbor

    .
    In CVPRW, pp. 14–19. Cited by: §2.1.
  • [65] K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras (2012) Two-person Interaction Detection Using Body-pose Features And Multiple Instance Learning. In CVPRW, pp. 28–35. Cited by: §1, §2.1, §5.1.
  • [66] M. Zhang and P. T. Fletcher (2013) Probabilistic Principal Geodesic Analysis. In NIPS, pp. 1178–1186. Cited by: §2.3.
  • [67] X. Zhang, Y. Wang, M. Gou, M. Sznaier, and O. Camps (2016) Efficient Temporal Sequence Comparison and Classification Using Gram Matrix Embeddings on a Riemannian Manifold. In CVPR, pp. 4498–4507. Cited by: §2.1.
  • [68] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie (2016) Co-occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks. In AAAI, pp. 3697–3703. Cited by: §2.1, Table 3.