RA-GCN v1 for incomplete skeleton-based action recognition, accepted by ICIP2019
Current methods for skeleton-based human action recognition usually work with completely observed skeletons. However, in real scenarios, it is prone to capture incomplete and noisy skeletons, which will deteriorate the performance of traditional models. To enhance the robustness of action recognition models to incomplete skeletons, we propose a multi-stream graph convolutional network (GCN) for exploring sufficient discriminative features distributed over all skeleton joints. Here, each stream of the network is only responsible for learning features from currently unactivated joints, which are distinguished by the class activation maps (CAM) obtained by preceding streams, so that the activated joints of the proposed method are obviously more than traditional methods. Thus, the proposed method is termed richly activated GCN (RA-GCN), where the richly discovered features will improve the robustness of the model. Compared to the state-of-the-art methods, the RA-GCN achieves comparable performance on the NTU RGB+D dataset. Moreover, on a synthetic occlusion dataset, the performance deterioration can be alleviated by the RA-GCN significantly.READ FULL TEXT VIEW PDF
Current methods for skeleton-based human action recognition usually work...
In skeleton-based action recognition, graph convolutional networks (GCNs...
Despite the growing discriminative capabilities of modern deep learning
Graph convolutional networks (GCNs), which generalize CNNs to more gener...
3D skeleton-based action recognition, owing to the latent advantages of
Graph convolutional network (GCN) is now an effective tool to deal with
For the two-stream style methods in action recognition, fusing the two
RA-GCN v1 for incomplete skeleton-based action recognition, accepted by ICIP2019
Skeleton-based human action recognition methods become increasingly important in many applications and achieve great progress, due to the superiority in background adaptability, robustness to light intensity and less computational cost. Skeleton data is composed of 3D coordinates of multiple spatial and temporal skeleton joints, which can be either collected by multimodal sensors such as Kinect or directly estimated from 2D images by pose estimation methods. Traditional methods usually deal with skeleton data in two ways. One way is to connect these joints into a whole vector, then model temporal information using RNN-based methods[1, 2, 3, 4, 5]. The other way is to treat or expand temporal sequences of joints into images, then utilize CNN-based methods to recognize actions [6, 7, 8, 9, 10]. However, the spatial structure information among skeleton joints is hard to be ultilized effectively by both the RNN-based and CNN-based methods, though researchers propose some additional constraints or dedicated network structures to strenuously encode the spatial structure of skeleton joints. Recently, Yan et al. firstly apply graph-based methods to skeleton-based action recognition, and propose the spatial temporal graph convolutional networks (ST-GCN) to extract features embedded in the spatial configuration and the temporal dynamics. Additionally, there are also some action recognition methods using graphic techniques, such as [12, 13, 14]. But these are mainly based on RGB videos, instead of skeleton data.
All these above methods assume that the complete skeleton joints can be well captured, while the incomplete case is not considered. However, it is often difficult to obtain a complete skeleton sequence in real scenarios. For example, pedestrians may be occluded by parked vehicles or other contextual objects observed. Meanwhile, when facing to incomplete skeletons, traditional methods will have varying degrees of performance deterioration. Therefore, how to recognize actions with incomplete skeletons is a challenging problem.
Many researchers are exploring to extract non-local features from all positions in the input feature maps, such as  and . Inspired by this, we propose a multi-stream graph convolutional network (GCN) to explore sufficient discriminative features for robust action recognitions. Here, a subtle technique, class activation maps (CAM), is utilized to distinguish the discriminative skeleton joints activated by each stream. The activation maps obtained by preceding streams are accumulated as a mask matrix to inform the new stream about which joints have been already activated. Then, the new stream will be forced to explore discriminative features from unactivated joints. Therefore, the proposed method is called richly activated GCN (RA-GCN), where the richly discovered discriminative features will improve the robustness of the model to incomplete skeletons. The experimental results on the NTU RGB+D dataset  show that the RA-GCN achieves comparable performance to the state-of-the-art methods. Furthermore, for the case of incomplete skeletons, we construct a synthetic occlusion dataset, where the joints in the NTU dataset are partially occluded over both spatial and temporal dimensions. Some examples on the two types of occlusions are shown in Fig.1. On the new dataset, the RA-GCN significantly alleviates the performance deterioration.
In order to enhance the robustness of action recognition models, we propose the RA-GCN to explore sufficient discriminative features from the training skeleton sequences. The overview of RA-GCN is presented in Fig.2. Suppose that is the number of joints in one skeleton, is the number of skeletons in one frame and is the number of frames in one sequence. Then, the size of input data is , where denotes the 3D coordinates of each joint.
The proposed network consists of three main steps. Firstly, in the preprocessing module, for extracting more informative features, the input data is transformed into , whose size is . The preprocessing module is composed of two parts. The first part extracts motion features by computing temporal difference , where means the input data of the -th frame. The second part calculates the relative coordinates between all joints and the center joint (center trunk) in each frame. Then, will be obtained by concatenating , and . Secondly, for each stream, the skeleton joints in will be filtered by the element-wise product with a mask matrix, which records the currently unactivated joints. These joints are distinguished by accumulating the activated maps obtained by the activation modules of preceding streams. Here, the mask matrix is initialized to all-one matrix with the same size as . After the masking operation, the input data of each stream only contains unactivated joints, and subsequently passes through an ST-GCN network 
to obtain a feature representation based on partial skeleton joints. Finally, the features of all streams are concatenated in the output module, and a softmax layer is used to obtain the final class of input.
The baseline model is the ST-GCN , which is composed of several spatial convolutional blocks and temporal convolutional blocks. Concretely, the spatial graph convolutional block can be implemented by the following formulation:
where is the predefined maximum distance, and are the input and output feature maps respectively, denotes the adjacency matrix for graphic distance , is the normalized diagonal matrix, denotes the element of the -th row and -th column of and is set to a small value, e.g. , to avoid the empty rows in . For each adjacency matrix, we accompany it with a learnable matrix , which expresses the importance of each edge.
After the spatial graph convolutional block, a Conv layer is used to extract temporal information of the feature map , where
is the temporal window size. Both spatial and temporal convolutional blocks are followed with a BatchNorm layer and a ReLU layer, and the total ST-GCN layer contains a residual connection. Besides, an adaptive dropout layer is added between the spatial and temporal convolutional blocks to avoid overfitting. More details of the ST-GCN will be found in.
The activation module in the RA-GCN is constructed to distinguish the activated joints of each stream, then guide the learning process of the new stream by accumulating the activated maps of preceding streams. This procedure can be implemented mainly by the CAM technique . The original CAM technique is to localize class-specific image regions in a single forward-pass, and is defined as the activation map for class , where each spatial point is
In this formulation, is the feature map before the global average pooling operation, and is the weight of the -th channel for class . In this paper, we replace the coordinates in an image with the frame number and the joint number in a skeleton sequence, by which we are able to locate the activated joints. These joints can also be regarded as the attention joints of the corresponding streams. Here, the class is selected as the true class. Then, the mask matrix of stream is calculated as follows.
where denotes the element-wise product of all mask matrices before the -th stream. Specially, the mask matrix of the first stream is an all-one matrix. Finally, the input of stream will be obtained by
where is the skeleton data after preprocessing.
Eq. 4 illustrates that the input of stream only consists of the joints which are not activated by previous streams. Thus the RA-GCN will explore discriminative features from all joints sufficiently. Fig.3 shows an example of the activated joints of all streams for the baseline model, the RA-GCN whth 2 streams and 3 streams, from which we can observe that the RA-GCN discovers significantly more activated joints than the baseline model. The code of RA-GCN is availabel on https://github.com/yfsong0709/RA-GCNv1
In this section, we evaluate the performance of the RA-GCN on a large-scale dataset NTU RGB+D , which is the currently largest indoor action recognition dataset. This dataset contains 56880 video samples collected by Microsoft Kinect v2, and consists of 60 action classes performed by 40 subjects. The maximum frame number is set to 300 for simplicity. The authors of this dataset recommend two benchmarks: (1) cross-subject (CS) contains 40320 and 16560 samples for training and evaluation, which splits 40 subjects into training and evaluation groups; (2) cross-view (CV) contains 37920 and 18960 samples, which uses camera 2 and 3 for training and camera 1 for evaluation.
Before training RA-GCN, we need to pre-train an ST-GCN network with transformed skeleton data to get the baseline model, and the training setting is the same as . Then, initialize all streams of RA-GCN with this baseline model. Finally, finetune the RA-GCN model. All of our experiments are running on two TITAN X GPUs.
|ST-GCN (baseline) ||2018||81.5||88.3|
|*: 2s denotes two streams and 3s denotes three streams|
We compare the performance of RA-GCN against previous state-of-the-art methods on the NTU RGB+D dataset. As shown in Table 1, our method achieves better performance than other models on the CV benchmark, though it is only 1.6% less than PB-GCN  on the CS benchmark. Compared to the baseline, our method outperforms by 4.4% and 5.2%, respectively. The RA-GCN only achieves comparable performance to the state-of-the-art method, because the RA-GCN aims to discover more discriminative joints, while the most actions can be recognized by only a few main joints. However, when these main joints are occluded, the performance of traditional methods will deteriorate significantly.
In Section 2.1
, we introduce two hyperparameters for the baseline model,for maximum distance and for temporal window size. These two hyperparameters have a great impact on our model. We test six groups of parameters and the experimental results are given in Table 2. It is observed that our model achieves the best accuracy when and on the CS benchmark. As to the CV benchmark, and are optimally set to 3 and 9, respectively.
|temporal||occluded frame number|
|*: the difference between 3s RA-GCN and baseline model|
To validate the robustness of our method to incomplete skeletons, we construct a synthetic occlusion dataset based on the NTU-RGB+D dataset, where some joints are selected to be occluded (set to 0) over both spatial and temporal dimensions. For spatial occlusion, we train the testing models with complete skeletons, then evaluate them with skeletons without part 1, 2, 3, 4, 5, which denote left arm, right arm, two hands, two legs and trunk, respectively. For temporal occlusion, we randomly occlude a block of frames in first 100 frames, because the lengths of many sequences are less than 100. Some examples of the occlusion dataset are demonstrated in Fig.1. On the synthetic occlusion dataset, we test the baseline model , SR-TSL , RA-GCN with 2 streams and 3 streams. The experimental results are displayed in Table 3, from which it is revealed that the 3s RA-GCN greatly outperforms the other models in most occlusion experiments except occluding part 2. Especially, in temporal occlusion experiments, 3s RA-GCN achieves an increasing superiority to other models. Thus the performance deterioration can be alleviated by the proposed methods. We also find that when some important joints, such as right arms, are occluded, some action categories, e.g. handshaking, cannot be inferred by other joints. The proposed method will fail in such case.
In this paper, we have proposed a novel model named RA-GCN, which achieves much better than the baseline model and improves the robustness of the model. With extensive experiments on the NTU RGB+D dataset, we verify the effectiveness of our model in occlusion scenarios.
In the future, we will add the attention module into our model, in order to make each stream focus more on certain discriminative joints.
“View adaptive recurrent neural networks for high performance human action recognition from skeleton data,”in ICCV, 2017, pp. 2136–2145.
“Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks,”in ICCV, 2017, pp. 1012–1020.
“Skeleton-based action recognition with convolutional neural networks,”in ICME Workshops. IEEE, 2017, pp. 597–600.
“Deep progressive reinforcement learning for skeleton-based action recognition,”in CVPR, 2018, pp. 5323–5332.
“Learning deep features for discriminative localization,”in CVPR, 2016, pp. 2921–2929.