Action recognition is one of the important areas in computer vision studies, for understanding human behaviours using a computational system. It can be applied to various applications for industrial system[tran20113, nozaki2012recognition], medical software [bahrepour2011sensor], and multimedia [zhang2012microsoft, wang2017two]. Because of the industrial and practical importance of this literature, the interest of this literature is increasing rapidly in recent years, and numerous studies have been proposed. In general, various modalities, such as appearance [Feichtenhofer_2016_CVPR], depth [zhang2016rgb, liu2019learning], motion flow [wang2019hallucinating], and skeleton-features [si2019attention]
are utilized to recognize human actions. With the great advancements of deep learning which is a method to learn useful representation automatically, various approaches employ convolutional neural networks (CNNs)[kim2017interpretable, ding2017investigation, carreira2017quo]
and recurrent neural networks (RNNs)[liu2016spatio, zhang2018fusing, zhang2017view] to train the spatio-temporal information and to recognize human actions. These CNNs and RNNs based approaches using RGB images and motion flows (e.g. , optical flow) achieved outstanding performances than the previous methods based on hand-crafted features [wang2013action, xia2012view]. The drawback of these approaches is that the learnt representations are may not focused on actions since entire areas of video frames are exploited to learn the representations [shi2019two, fernando2015modeling]. Skeleton features provide quantized information about peoples’ joints and bones. Compared to RGBs and motion flows, the skeleton features can provide more compact and useful information in the dynamic circumstance and complicated background [vemulapalli2014human, fernando2015modeling, du2015hierarchical, ke2017new].
Early deep-learning based approaches using skeleton-features manually create skeleton data as a sequence of joint-coordinate vectores[du2015hierarchical, shahroudy2016ntu, liu2016spatio, song2017end, zhang2017view] or as a pesudo-image [ke2017new, kim2017interpretable, liu2017enhanced], and apply the data to RNNs or CNNs to inference corresponding action classes. However, these approaches are unable to indicate the dependency between correlated joints [shi2019two]. Intuitively, skeleton-features can be represented as a graph structure since their components are homeomorphic. For instance, joints and bones of skeleton-features can be defined as the vertices and connections of the graph. Recently, Graph Convolutional Networks (GCNs), which are the graphical framework using convolution neural network, have been achieved a great number of successes in skeleton-based action recognition [shi2019two, si2019attention, li2019actional]
. In the graph convolution, nodes are filtered in two ways, namely spatial and spectral. Spectral approach filters the nodes based on the Laplacian matrix and eigenvectors while spatial approach filters the nodes with local neighborhood nodes. ST-GCN[yan2018spatial] is the first work to use spatial approach GCNs to handle skeleton model and has shown impressive improvements. However, the spatial graph in ST-GCN is a predefined graph which only relies on the physical structure of human body. This makes hard to capture the relationship between closely related joints such as both two hands in hand-related action. To tackle this limitation, many methods [shi2019two, shi2019skeleton, song2019richly, li2019actional, si2019attention] were proposed to build adaptive graph to pay the dynamic attention to each joint based on the performing action.
Unfortunately, all these approaches assume that the complete skeleton features would be provided. It is almost impossible to guarantee that extracting of perfect skeleton samples from a real-world system. Song et al. [song2019richly] have proposed a GCN based method which can deal with ’incomplete skeletons
’ defined as spatially occluded or temporally missed skeleton features. Even though recent studies for pose estimation[cao2018openpose, wang2019densefusion] and constructing skeleton-features [ding2017investigation, ke2017new, liu2017enhanced], have shown precise and scene-condition invariant performances, still, there is a possibility that extraed skeleton may contain a piece of inaccurate information. Song et al. proposed RA-GCN [song2019richly] model to learn distinctive features of currently unactivated joints (missed joints) in multiple streams by utilizing class activation maps (CAM), but it is still problematic. To improve the performance of action recognition using skeleton-features, it should have addressed how a model processes noisy skeleton samples.
We present Predictively encoded Graph Convolutional Networks (PeGCNs), which can learn noise-robust representation for skeleton-based action recognition using GCN. The key insight of our model is to learn such representations by predicting the perfect sample from noisy sample in latent space via autoregression model. We use a probabilistic contrastive loss to capture the most useful information for predicting a perfect sample. To demonstrate the efficiency of PeGCNs on skeleton-based action recogntion with noised samples, we have conducted various experiments using NTU-RGB+D[shahroudy2016ntu] and Kinetics-skeleton [yan2018spatial] datasets. The experimental results show that PeGCNs can provide noise-robust action recognition performance using skeleton features, and it surpasses existing methods.
The key contributions of our works can abridge as follows. First, we propose a novel method for noise-robust skeleton-based action recognition, called Predictively Encoded Graph Convolutional Network (PeGCN), which performs better than the existing state-of-the-art methods on either general skeleton-based action recognition and that with noisy skeleton samples. Second, predictive encoding loss on latent space captures useful representations to predict complete skeleton features from noisy skeleton and improves action recognition performance with noisy samples. In addition to these contributions, we also provide comprehensive experiments on skeleton-based action recognition with noised samples. Our experiments include various ablation studies and comparisons with existing GCN based methods.
2 Skeleton-based action recognition using deep learning
The recent success of deep-learning techniques had a significant impact on the studies for human action recognition. To model spatio-temporal features of actions, many works [liu2016spatio, si2018skeleton, lee2017ensemble, zhang2017view, zhu2016co, shahroudy2016ntu] attempt to extract appearance information with Convolutional Neural Networks (CNNs) and temporal information with Recurrent Neural Networks (RNNs). TS-LSTM [lee2017ensemble] uses multiple temporal-windows to handle both short/mid/long-range actions dynamically. Zhang et al. [zhang2017view] proposed view-adaptive action model (VA-LSTM) which is robust to view point change. However, CNN/LSTM based methods usually represent the skeleton data as a sequence of vectors which cannot express the dependencies enough between related joints. The skeleton model can be seen as a graph structure where joints and bones correspond to the vertices and edges, respectively.
Recently, ST-GCN [yan2018spatial] successfully adopted graph convolution networks (GCNs) to handle graphs in arbitrary forms and it was the first method which applied GCNs to the skeleton-based action recognition. After Yan et al. [yan2018spatial] proposed ST-GCN, lots of works using graph convolution networks (GCNs) were proposed. The GCNs have two main approaches to apply: spectral approach [li2018spatio] and spatial approach [shi2019two, shi2019skeleton, li2019actional, si2019attention]
. The spectral approach first performs Eigen decomposition on graph Laplacian matrix to get Eigenvalues and Eigenvectors. With these Eigen features, graph convolution is performed on sub-graph with graph Fourier transform. In this way, no locally-connected node partitioning is required. On the other hand, in spatial perspective method performs graph convolution directly on graph nodes with it’s neighbourhood nodes. This approach is widely accepted in action recognition (e.g. ST-GCN) since it takes less computational cost than spectral approach.
The main drawbacks of ST-GCN is the spatial graph which is predefined only relying on the physical structure of human body and is fixed to all the GCN layers. These drawbacks make hard to capture not only the relationships between closely related joints such as both two hands in hand-related action, but also the dynamics of each action. To tackle these limitations, many methods [shi2019two, shi2019skeleton, song2019richly, li2019actional, si2019attention] were proposed to build adaptive graph to pay attention dynamically to each joint based on the performing action. The adaptive graph is trainable mask which learns relationships between any joints which can increase both flexibility and generality in constructing the graph. Shi et al. [shi2019two] proposed 2s-AGCN model which has two adaptive graphs: 1) global-graph and 2) local-graph. Both of them are trained and updated jointly with CNNs in end-to-end manner. The global-graph learns common patterns for all the samples while local graph learns unique patterns of each individual sample. Lie et al. [li2019actional] proposed actional links (A-links) to learn action-specific dependencies, and structural-links (S-links) for higher-order relationships between joints.
While most of works were using undirected graph, Shi et al. [shi2019multi] proposed directed graph based model (DGNN) where direction of the graph plays important role in graph convolution for updating features of edges and vertices. Si et al. [si2019attention] combines LSTM with GCN (AGC-LSTM ) to learn a spatio-temporal representations from sequential skeletons, while most of GCN based action recognition models acquire temporal information with 1d-convolution on the temporal-axis. Spatial based GCNs usually distribute graphs into multiple sub-graphs with distance partitioning or spatial configuration partitioning proposed in [yan2018spatial]. In contrast to these common partitioning strategies, Thakkar et al. [thakkar2018part] proposed part-based GCN (PB-GCN) that learns relationships between five body parts.
3 Predictively Encoded Graph Convolutional Networks
3.1 Motivation and Intuition
For developing the precise action recognition method, it is important to learn a global representation which can represent every detail of given video clip for the entire time period. To learn a suitable global representation, a model needs outstanding generalization ability which can be robust to the diverse types of noise. Variation of skeleton features depending on geometric conditions, such as a viewpoint of cameras or acting objects, can be regarded as a sort of noise skeleton features. Missing of skeleton features (a.k.a., incomplete skeleton features [song2019richly] (see Fig. 1) by spatial or temporal occlusions, are also a kind of noisy skeleton features. These noise patterns are inherently unpredictable. It is, therefore, intractable to model noise information explicitly in a data-driven approaches.
Deep learning is well known as an effective way to improve generalization performance of a model for various visual recognition studies [krizhevsky2012imagenet, du2015hierarchical, badrinarayanan2017segnet, girshick2015fast]. GCN is a unified framework of a graph structure and deep learning, so it also has advantage improved generalization performance. Based on this advantage, The dominant approach to training the skeleton-based action recognition methods based on GCNs is initially extracting information from skeleton samples using GCNs and then computing the unimodal loss e.g. cross-entropy [yan2018spatial, shi2019two, shi2019skeleton, song2019richly, li2019actional, si2019attention]. It can be regarded as direct end-to-end learning for a model between skeleton samples and a corresponding acting classes . However this approach, which directly derives a mapping model for and from a complete sample or an incomplete sample to class label , is computationally intensive and a waste of representation capacity of the model. For example, the mapping between and directly can be thought as using every detail of input samples all the time whether it is necessary or not. A slight noise, which can be alleviated during generalization via a non-linear network structure, does not need to be considered seriously. As a result, it may not suitable to derive a mapping model directly for deriving the optimal global representation.
The key insight of PeGCN for noise-robust skeleton-based action recognition is to learn the representations that encode to the underlying shared information between complete sample and noisy sample via predicting missing information in the latent space. This idea is inspired by the predictive coding [1055126, atal1970adaptive, oord2018representation]
which is one of the oldest techniques in signal processing for data compression, and recently it is applied to unsupervised learning for learning word representations[mikolov2013efficient] by predicting neighbouring words. The approach to latent space has the following advantages: First, since action recognition processes relatively long time samples than the others including event detection [yu2018joint, 8580568] or change detection [hussain2013change], action recognition models need to infer more global structure. When inferring the global structure, high-level information i.e., latent space, is more suitable than the low-level information. Second, the global noise on the latent representation is likely to be a serious noise which can affect the recognition performance seriously than local noise which can be reduced via non-linear weighted kernel structures of deep learning.
When predicting proper information from noise skeleton features, we initially map the normal skeleton features and noise skeleton feature into compact distributed vector representations (a.k.a., latent features) and respectively, via non-linear mapping function, and train the model in a way that maximally preserves the mutual information between and . The mutual information is defined by,
By maximizing the mutual information between two encoded representations (which are bounded by the MI between the input signals), we extract the underlying latent variable robust to the global noise.
3.2 Structural details
Fig. 2 illustrates the pipeline of PeGCN on the training and inference. PeGCN consists of a GCN module and an autoregressive module . The GCN module encodes skeleton samples into a latent space , where indicate the input types: normal one and or noise one and . The autoregressive module summarizes the latent representation and produces a context latent representation , where can be defined by and depending on the corresponding input skeletons.
In the training step, the normal skeleton samples and the corresponding noisy skeleton samples are provided. First, the GCN module produces latent representations and from and , respectively. Next, the autoregressive module extracts the context latent representation from the latent representation only. As argued in the previous section, we do not train a model by directly deriving or . Instead, PeGCN is trained in the way to maximize the mutual information (Eq. 1) between the two latent representations, and of the normal and noisy skeleton samples, by modeling a density ratio which preserves the mutual information (Eq. 1) between and as follows:
By using with autoregressive module , we relieve the model from modelling the high dimensional distribution or . Although we cannot evaluate or
directly, we can use samples from these distributions, allowing us to use the technique as Noise-Contrastive Estimation[gutmann2010noise, mnih2012fast, jozefowicz2016exploring] and Important Sampling [bengio2008adaptive]. The output of autoregressive module can be used if extra context from the representation is useful. One such example is speech recognition, that the receptive field of may not contain enough information to capture phonetic content. In other cases, where no additional context is required, might be better instead.
The noise skeleton features are generated by adding some noise to randomly picked joints in the original skeleton samples . The noise is generated based on the bounding box computed using the minimum and maximum values of the x, y, and z coordinates of skeleton samples (Fig. 3(a)). When generating the noise samples in the training and test steps, we set the noise level which is the parameter to decide how many joints would be noised. The generated noisy samples depending on the noise level are represented in Fig. 3(b). The noise skeleton samples that we are regarding in this paper, are different from the spatially or temporally occluded skeleton samples that considered in Song et al. [song2019richly] (see Fig. 1). In the real scenarios, the missed joints in the occluded skeleton samples can be defined by a set of joints that have low likelihoods or confidences than a pre-defined threshold. However, a noise is inherently unpredictable so that assumption may not practical.
The backbone network for our GCN module is the GCN part of of Js-AGCN [shi2019two], which composed of adaptive graph convolutional layers, which can make the topology of the graph optimized together with the other parameters of the network in end-to-end learning manner. The adaptive convolutional layer is defined by,
where is the original normalized adjacency matrix for GCN, is the trainable matrix for global attention and is a data-dependent graph for learning a unique graph for each sample. We employ the GCN of 2s-AGCN without the fully connected networks located on the after the GCN.
We use RNNs using GRUs [chung2014empirical] for the autoregressive module
. This can be easily replaced by other linear transformation or non-linear networks. The detilas of dimensionalities of the GCN module and the autoregressive module of PeGCNs are represented in AppendixA.1
. Note that any type of GCN model and autoregressive model can be applied in the proposed method. Probably, more recent advancements in GCNs and autoregressive modelling could help improve results further.
3.3 Training and inference
Both the GCN and autoregressive modules are jointly trained to optimize the loss in order to maximize the mutual information between two latent representations of normal and noise skeleton features, which we call predictive encoding loss. With given set for normal skeleton samples of samples and the corresponding noise skeleton samples , the predictive encoding loss is defined by,
Optimizing this loss will result in estimating the density ratio in Eq. 1. It is theoretically and experimentally demonstrated by Ooord et al. [oord2018representation].
Action recognition should identify an action class of given skeleton sample. Using only can not achieve this goal since it is only focused on the maximizing mutual information between two latent representation. Therefore, as similar to other studies [shi2019two, song2019richly, shi2019multi], the cross-entropy loss is exploited as follows,
where is the numbers of action classes. is a given annotation for an action sample, and is the output of the fully connected network for classification task on the inference step.
Consequently, to train the noise-robust skeleton-based action recognition model, the total loss functions is straightforwardly defined by the sum of the cross-entropy loss,and the proposed predictive loss function with the balancing weight . It is represented as follows:
In all our experiments, is set by 0.1 for the best performance.
The action recognition using PeGCN is straightforward. In the test step, the GCN module encodes an input skeleton sample into the latent space, and the autoregressive model summarizes the latent feature and generate the context latent representation . The is used as an input of a fully connected networks for action recognition (Fig. 2).
4.1 Experimental setting
To evaluate the action recognition performances of PeGCN and other methods on noise skeleton samples, we use NTU-RGB+D dataset [shahroudy2016ntu], which is one of the largest datasets in skeleton-based action recognition, and Kinetics-skeleton (a.k.a., Kinetics) dataset generated from the Kinetics dataset [kay2017kinetics] containing 34,000 video clips. Two experimental protocols: 1) Cross-view (CV) and 2) Cross-subject (CS) are applied for the experiments using NTU-RGB+D dataset. The detail explanations of the two datasets are described in Appendix A.2.
The settings of common hyperparameters to train PeGCNs are as follows. The numbers of epochs are 50 and 65 for NTU-RGB+D dataset and Kinetics-skeleton dataset, respectively. Since our computational resources are limited, the batch size reduced to 32 and it is the half of original batch size of our backbone network[shi2019two]
which can affect the action recognition performance of PeGCNs negatively. Stochastic gradient descent and the weight decay are utilized as optimization algorithms. The source code of PeGCN is publicly available onhttps://github.com/andreYoo/PeGCNs.git. The source code includes the feed function to generate the noise skeleton samples. The experiments are categorized into two parts. One is for the ablation study, and another is for the comparison with existing state-of-the-art methods.
4.2 Ablation study
Experimental protocol We have conducted the performance analysis depending on the hyperparameter settings of PeGCN. The hyperparameters that significantly affect the action recognition performance of PeGCNs are the noise level and the composition of loss functions. The performance analysis depending on the setting of noise level and the composition of loss functions in the training step is as follows. First, we construct two PeGCN models trained by (PeGCN) and (PeGCN), and each model is trained with 1, 3, and 5 noise-levels. Other parameters are set as exactly same to the parameter setting mentioned in the above section. Next, we evaluate these models with noise level between 0 to 5. We have observed the trends of the cross-entropy losses and the predictive coding losses of these models and compared the action recognition accuracies. For efficient experiments, the ablation study is only conducted with the CV protocol of NTU-RGB+D dataset.
Experimental results. Table 1 shows that action recognition accuracy depending on the noise levels and the setting of the loss functions in the training step. The best accuracy is achieved by the PeGCN trained with noise level 5. Its achieves 93.21 of accuracy in noise level 1 and 89.39 of accuracy for the noise level 10 in the test step, respectively. The PeGCN trained with the noise level 5, achieves 92.87 of accuracy in noise level 1. The PeGCN trained with noise-level 5, achieves 92.24 of accuracy in the evaluations with the noise level 5. It also produces 89.39 of accuracy in the test with the noise level 10. On the other hand, the PeGCN trained with noise-level 5, obtains 87.43 of accuracy on the test with noise level 5. The quantitative results demonstrate that if models are trained at the same noise level, the model trained with the total loss function usually performs better, and it also suggests that the performance degradation of the PeGCNs trained by the cross-entropy loss only, is much faster than the others. Not only quantitative results, but also the trend of each losses show the efficiency of the predictive encoding loss when learning the noise-robust representation. The trends of cross-entropy losses in the ablation studies (see Fig 4(a)) show that the curves of the PeGCNs trained by the cross-entropy losses only, are converged faster than the PeGCNs trained by the total loss usually. It can be thought that the PeGCN trained with the cross-entropy loss only is easier to converged into the poor locally optimized solution than the others.
Interestingly, The trend of predictive encoding loss (see Fig 4(b)) shows that the curve of the PeGCN trained by noise level 5 is relatively lower than that of the PeGCN trained by noise level 1 or 3. In the graphical comparison, PeGCNs trained by and are compared to each other. These trends can be interpreted as a difficulty of learning with highly noised samples. In the training step, a higher noise level can provide more diversity in the training samples than the lower noise level. Consequently, the ablation study demonstrates that the higher noise-level in the training step can improve the action recognition performance in the test step, but it is not linearly proportional. For the efficient experiments, further studies for comparing PeGCN with existing state-of-the-art methods are only conducted with PeGCN trained with noise level 5.
4.3 Comparison with existing state-of-the-art methods
Experimental protocol. Our experiments include either the experiments with normal skeleton samples or that with noisy samples. Basically, we follow the general experimental protocol described in NTU-RGB+D dataset [shahroudy2016ntu] and Kinetics-skeleton dataset [kay2017kinetics]. For both datasets, top-1 and top-5 accuracies are computed for the performance comparison. In the experiments using NTU-RBGD dataset, both CV and CS protocols are applied. To evaluate action recognition performance on the noisy setting, we artificially generate noisy samples as follows: First, the number of joints (a.k.a. noise level) for assigning noise values is determined manually. Second, according to the noise level, the joints which would be assigned by noise value, are randomly picked. The selected joints are constant for all frames in the video clip.
After which joints will be noised is decide, random values generated from the bounding-box are assigned to each selected joint in every frame (see Fig. 3(a)). To reduce the volatility of performance due to the randomness of noised joints, all experiments are iteratively conducted for 10 times, and the average and standard deviation for the results are used for the comparison. The examples of the artificially generated noisy skeleton samples are illustrated in Appendix A.3.
Predominantly, we have tried to compare PeGCN with recently proposed state-of-the-art methods. For efficient experiment and fair comparison, methods which were proposed before 2018 or performances are lower than ours by 5% in normal skeleton evaluation (e.g. [fernando2015modeling, du2015hierarchical, shahroudy2016ntu, liu2016spatio, zhang2017view]) are excluded for the comparison (see Table 2). Particularly, in the experiments using noisy skeleton samples, methods, which source code did not be released by paper authors, are excluded in the experiments [shi2019skeleton, peng2019learning]. Even if source code exists, some methods are excluded by the following criteria: First, the source codes have released from non-authors [shi2019skeleton]. Second, the paper is not officially published yet on a journal or a conference [peng2019learning]. Third, the source codes are argued by other people that they can not obtain the performance reported on a paper [shi2019skeleton]. The detail information of the source code and the pre-trained weight for each model are described in Appendix A.4.
Experiment with normal skeletons. Initially, we compare PeGCN with other existing state-of-the-art methods on normal skeleton samples. For the consistency of the experiments, several methods are tested using publicly available source codes by ourselves [yan2018spatial, song2019richly, thakkar2018part, shi2019two]. Table 2 contains the top 1 accuracies on the CS and CV protocols of NTU-RGB+D dataset and the top 1 and top 5 accuracies on Kinetics dataset. In the experiments, PeGCN achieves 85.6 and 93.4 accuracies on the CS and CV protocols of NTU-RGB+D dataset, respectively. PeGCN produces 34.8 and 57.2 for Top 1 and top5 accuracies in Kineitcs-skeleton dataset. The state-of-the-art performance is achieved by MS-AAGCN [shi2019multi] with 90.0 for CS protocol and 96.2 for CV protocol. The second highest performance is achieved by DGNN [shi2019skeleton] which recorded 89.9 and 96.1 for CS and CV protocol, respectively. In Kinetics-skeleton dataset, MS-AAGCN [shi2019multi] scores 37.8 for top-1 and 61.0 for top-5. MS-AAGCN scores the second-highest performance again with 37.8 and 61.0 for top-1 and top-5, respectively.
|Synthesized CNN [liu2017enhanced]||CNN||80.0||87.2||-||-|
|3scale ResNet152 [li2017skeleton]||CNN||85.0||92.3||-||-|
|2s RA-GCN [song2019richly]||GCN||85.8||93.0||-||-|
|3s RA-GCN [song2019richly]||GCN||85.9||93.5||-||-|
|Js-AGCN (Backbone) [shi2019two]||GCN||85.4||93.1||34.4||57.0|
Compared with the state-of-the art performance, PeGCN produces better or comparable performance than the several methods. Js-AGCN [shi2019two], which is used as the backbone network for PeGCN achieves 85.4 and 93.1 accuracies for the CS and CV protocol on NTU-RGB+D dataset. These figure are slightly lower than ours. PeGCN achives 85.6 and 93.4 accuracies on the two protocal.
Nevertheless, the performances of PeGCN is relatively lower than few methods such as MS-AAGCN[shi2019multi], DGNN [shi2019skeleton], GCN-NAS [peng2019learning], and AS-GCN [li2019actional]. The gap of performances between state-of-the-art methods and PeGCN can be interpreted as follows: MS-AAGCN[shi2019multi] has additional attention modules (e.g. Spatial , temporal, channel-wise attention) and exploiting four different modalities including joint and bone information and motion information of them. In training, batch size is twice than ours and adaptive graphs are fixed in the first 5 epochs for better learning explained in DGNN [shi2019skeleton]. MS-AAGCN achieved more top-1 accuracy than us by 5.5%, 2.8% and 4.0 % on CS, CV and Kinetics respectively. Although DGNN [shi2019skeleton] has same batch size 32, it has longer training epoch as 120 while our training epoch for NTU-RGB+D is 50 and kinetics-skeleton is 65. Besides, DGNN utilizes both joint and bone information with directed acyle-graph. This leads improvement of top-1 accuracy 5.4%, 2.7% and 3.1% on CS, CV and Kinetics, respectively. In other methods (such as GCN-NAS [peng2019learning] and AS-GCN [li2019actional]) has longer training epochs than ours and learning rate decay more frequently.
Experiment with noisy skeletons. The experimental results on the skeleton-based action recognition with noisy samples clearly demonstrate the efficiency of PeGCNs in recognizing actions on noisy skeleton samples. In contrast to the other approaches that performances are rapidly degraded when the noise-levels are deepened, PeGCN shows the noise-robust action recognition performance. As shown in Table 3, PeGCN achieves 84.21 and 82.20 of accuracies in the experiments with the noise level 1 and the noise level 5, respectively. The performance gap between these two figures is less than 3%, and it is significantly lower than the other methods. Shi et al. [shi2019two], which achieves the state-of-the-art performance on the experiments with normal skeleton samples (see Table 2), produces 84.31 and 51.27 of accuracies on the noise-1 experiments, and the gap between these two accruacies is larger than 30. Js-AGCN [shi2019two] achieved high accuracy that 35.1 and 57.1 for top-1 and top-5 accuracy, respectively. However, performance is dropped when noised samples are given. It recorded 23.06 on the noise-1 and 3.81 on the noise-5 experiments, and the gap between them is larger than 19. In the experiments with the noise level 10 on CS protocol in NTU-RGB+D, while the performances of other methods are all lower than 25%, PeGCN obtains 77.92% of accuracy.
|3s RA-GCN [song2019richly]||85.87||98.10||72.02(0.26)||89.89(0.20)||45.12(0.29)||68.79(0.33)||25.59(0.25)||48.71(0.42)||6.11(0.24)||20.55(0.31)|
|2s RA-GCN [song2019richly]||85.83||98.19||71.97(0.18)||91.00(0.20)||44.41(0.23)||70.81(0.34)||25.35(0.33)||50.54(0.23)||6.41(0.23)||21.10(0.19)|
|3s RA-GCN [song2019richly]||93.51||99.30||79.77(0.18)||92.74(0.18)||53.59(0.32)||76.41(0.29)||32.71(0.19)||59.08(0.37)||8.88(0.24)||29.53(0.24)|
|2s RA-GCN [song2019richly]||92.97||99.28||79.58(0.16)||92.72(0.11)||53.34(0.36)||75.09(0.24)||32.46(0.24)||55.84(0.32)||8.59(0.11)||24.98(0.20)|
The experimental results on the CV protocol using NTU-RGB+D dataset likewise suggest that PeGCN can provide more noise-robust performance for skeleton-based action recognition and surpass existing state-of-the-art methods. PeGCN achieves state-of-the-art performance. As shown in Table 4, while PeGCN achieves 99.21 and 89.39 of accuracies on the experiments with noise level 1 and 10, respectively, there is no other method that can provide over the 90% of accuracies even in the noise level 1. Js-AAGCN [shi2019multi] produces 87.87 of accuracy for the noise level 1. However, the recognition performance of Js-AAGCN is steeply degraded when the noise level is increased. In the experiment with noise level 10, the performance of Js-AAGCN is 18.99, and it is lower than 23.63 of ST-GCN [yan2018spatial] which obtains 83.14 of accuracy in the experiments on the noise level 1.
4.4 Analysis and discussion
The overall results indicate that PeGCN can provide outstanding skeleton-based action recognition robust to noisy samples compared to existing state-of-the-art methods. The accuracies of PeGCN for all noise-level on NTU-RGB+D dataset and Kinetics dataset are higher than the comparison targets. The performance gap between PeGCN and other methods is proportional to the noise level. In the experiment on noise level 10, the performances of almost methods except PeGCN are degraded over 90% compared with the results on normal samples. In addition to the accuracies, the standard deviations also suggest that the advantage of PeGCN for noise-robust skeleton-based action recognition. In experiments for the CV protocol on NTU-RGB+D dataset, between noise levels 1 to 5, while the other methods produce the standard deviations over 0.2 usually, the range of standard deviation of the proposed method is from 0.02 to 0.11.
Interestingly, among the experimental results, RA-GCN [song2019richly], which have proposed for recognizing actions using incomplete skeletons, achieves relatively poor accuracies (Table 3 and Table 4) than the other methods [shi2019two, yan2018spatial, shi2019skeleton] that do not consider the skeletons with noise information. It may be caused by the difference in the definition of ’noise’ on skeleton features. As shown in Fig. 1, Song et al. [song2019richly] assigned 0 to the noised joints that defined by the ’missed joints’ by spatially or temporally occlusions. However, in our experiments, the arbitrary value for the joint noising is defined randomly within the bounding box (see Fig. 3). Consequently, the entire experimental results demonstrate the efficiency of PeGCN on skeleton-based action recognition with noise skeleton samples.
In this work, we have presented the noise-robust skeleton-based action recognition method based on graph convolutional networks with predictive encoding for latent space, called Predictively encoded Graph Convolutional Networks (PeGCNs). In the training step, PeGCNs learns to improve the representation ability for noise-robust skeleton-based action recognition by predicting complete samples from noisy samples on latent space. PeGCN increases the flexibility of GCNs and is more suitable for action recognition tasks using skeleton features. PeGCN is evaluated on two large-scale action recognition datasets, NTU-RGB+D and Kinetics, and it achieved the state-of-the-art performance on both of them.
a.1 Dimensional details for the kernels on the GCN module and the autoregressive module
a.2 NTU-RGB+D dataset and Kinetics dataset
NTU-RGB+D dataset [shahroudy2016ntu] is one of the largest dataset in skeleton based action recognition which contains around 56,000 samples in four different types including depth map, RGB video, IR image and skeleton sequence. The samples are captured by Microsoft Kinect v2 in three different angels (-45, 0, 45) with 40 volunteers. In skeleton sequence, 3d spatial coordinates (X,Y,Z) of 25 joints are provided for each human action. The human actions are captured by one or two performers and consists of 60 indoor activities such as hand-clapping or drinking-water. [shahroudy2016ntu] also provides two benchmark protocols: 1) Cross-view and 2) Cross-subject. In cross-view protocol, samples are split into training and test set according to camera angle. Each subset contains 37,920 and 18,960 samples respectively. In cross-subject protocol, samples are split into training and test set according to subjects. Some subjects are assigned as training samples and remaining subjects are assigned as test samples. Each training and test sebsets contains 40,320 samples and 16,560 samples respectively. We follow these protocols and report the top-1 accuracy on both benchmarks.
Kinetics-skeleton dataset is one of the large-scale skeleton action dataset generated from Kinetics [kay2017kinetics] which contains 34,000 video clips collected from Youtube to have wide variety (such as illumination change, background color) and each video clips are labeled with 400 action classes. Before estimating skeleton model from video, resolution and frame rate of video clips are converted. Skeleton model is estimated with publicly available OpenPose toolbox [cao2017realtime] which gives 2d locations and 1d confidence of 18 joints. The top two person, whom has the highest average of joint confidences, in video clips are selected if multiple people are in the scene. The length of each skeleton sequence is fixed to 300 by repeating or sampling the sequence. [yan2018spatial] released this dataset (Kinetics-skeleton) which contains 240,000k samples for training set and 20,000k samples for validation set. We follow same evaluation protocol mentioned in [yan2018spatial] that Top-1 and Top-5 recognition accuracies are evaluated.
a.3 Examples of noise skeleton samples
a.4 Experiment: Source codes and pre-trained models
All models used in our experiments are publicly available on github including our PeGCN model. Github links are listed in Table 1. We also provides all weight files of the models via Google drive https://drive.google.com/open?id=1Q-S-JAJwURPH7cy9-h25Mo0p15w_ALsb. Each model has multiple weight files depending on the three factors: 1) Dataset, 2) Evaluation protocol and, 3) Data-type (e.g. Joint and Bone). Details of each weight file are described in Table 2.
a.5 Extended comparison on skeleton-based action recognition performance using normal skeletons on NTU-RGB+D dataest
|Fenture Enc [fernando2015modeling]||2015||Hand||-||-||-||-||14.9||25.8|
|Deep LSTM [shahroudy2016ntu]||2016||LSTM||60.7||-||67.3||-||16.4||35.3|
|Synthesized CNN [liu2017enhanced]||2017||CNN||80.0||-||87.2||-||-||-|
|3scale ResNet152 [li2017skeleton]||2017||CNN||85.0||-||92.3||-||-||-|
|2s RA-GCN [song2019richly]||2019||GCN||85.8(85.8)||98.2||93.0(93.0)||99.3||-||-|
|3s RA-GCN [song2019richly]||2019||GCN||85.9(85.9)||98.1||93.5(93.5)||99.3||-||-|
a.6 Additional comparison on skeleton-based action recognition performance using noisy skeletons on Kinetics-skeleton dataset