Focusing and Diffusion: Bidirectional Attentive Graph Convolutional Networks for Skeleton-based Action Recognition

A collection of approaches based on graph convolutional networks have proven success in skeleton-based action recognition by exploring neighborhood information and dense dependencies between intra-frame joints. However, these approaches usually ignore the spatial-temporal global context as well as the local relation between inter-frame and intra-frame. In this paper, we propose a focusing and diffusion mechanism to enhance graph convolutional networks by paying attention to the kinematic dependence of articulated human pose in a frame and their implicit dependencies over frames. In the focusing process, we introduce an attention module to learn a latent node over the intra-frame joints to convey spatial contextual information. In this way, the sparse connections between joints in a frame can be well captured, while the global context over the entire sequence is further captured by these hidden nodes with a bidirectional LSTM. In the diffusing process, the learned spatial-temporal contextual information is passed back to the spatial joints, leading to a bidirectional attentive graph convolutional network (BAGCN) that can facilitate skeleton-based action recognition. Extensive experiments on the challenging NTU RGB+D and Skeleton-Kinetics benchmarks demonstrate the efficacy of our approach.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/25/2019

An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition

Skeleton-based action recognition is an important task that requires the...
07/04/2020

Structure-Aware Human-Action Generation

Generating long-range skeleton-based human actions has been a challengin...
10/13/2020

Supertagging Combinatory Categorial Grammar with Attentive Graph Convolutional Networks

Supertagging is conventionally regarded as an important task for combina...
03/19/2020

Temporal Extension Module for Skeleton-Based Action Recognition

We present a module that extends the temporal graph of a graph convoluti...
04/17/2018

Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation

Skeleton-based human action recognition has recently drawn increasing at...
09/21/2022

Adaptive Local-Component-aware Graph Convolutional Network for One-shot Skeleton-based Action Recognition

Skeleton-based action recognition receives increasing attention because ...
11/26/2018

Stacked Spatio-Temporal Graph Convolutional Networks for Action Segmentation

We propose novel Stacked Spatio-Temporal Graph Convolutional Networks (S...

1 Introduction

Figure 1: Previous methods ignore the implicit inter-frame and intra-frame connections, which inspired us to learn a latent transformer node and then convey the global spatial-temporal context information back to spatial joints in each frame.

Action recognition is a challenging vision problem that has been studied for years. It plays an essential role in various applications, such as video surveillance [24], patient monitoring [20], robotics[40], human-machine interaction[10] and so on [37, 40, 28]. Previous methods mainly rely on two different data modalities: RGB images [11, 33] and skeleton points [28, 39, 40]. Compared with 2D images, 3D (skeleton) points directly encode object shape and motion [26, 27, 38, 5, 23, 34, 6] and thus are more robust to nuisances like fast motions, camera viewpoint changes, etc.

Conventional attempts [35, 4, 22]

take skeletons as a sequence of vectors generated by encoding all joints in the same frame together. Some deep-learning-based methods, such as CNNs, employ convolution operation directly on skeletons, the irregular grid data. These approaches ignore the intrinsically kinematic dependencies and structural information of human pose, leading to unsatisfactory performance.

To exploit the structure of skeleton data, Yan et al. [39] first introduces the spatial-temporal Graph Convolutional Network (GCN) based on coordinated joints and the natural topology of the human body. However, this predefined graph structure suffers from two defects, as shown in Fig.1: 1) the graph only considers physically structural information but ignores the semantic connections among intra-frame joints. For example, the implicit correlation between hands and head is crucial to recognize the action of ”drinking.” 2) the inter-frame edges only connect the same joints over consecutive frames, which lack modeling the connections between inter-frame joints. Another challenge is that a complete action lasts from several seconds to minutes. Thus, it is crucial to design a model that is capable of reducing noise from irrelevant joints and focusing on informative ones. Temporal context modeling is usually expected to be helpful for addressing this issue, inspired by which in this paper we intend to capture both the physical connections and semantic dependencies beyond the spatial and temporal neighborhood restrictions of GCN.

Several methods, like AS-GCN[18] and 2s-AGCN[30], attempt to overcome the aforementioned first drawback but they introduce additional challenges to be solved. For obtaining kinematic dependency, 2s-AGCN introduces a module to learn the additional data-driven morphing graph over layers, and AS-GCN proposes an encoder-decoder structure to capture action-specific dependencies combined with physically structural links between joints. Nevertheless, it is inefficient to directly model the relationship between non-neighbor joints over a fully-connected graph. These potentially dense connections are also tricky to learn because of the noise from irrelevant ones, which leads to difficulty in optimization and unsatisfied performance. Besides, they also ignore the correlations between inter-frame joints so that they cannot take advantage of the temporal contextual information.

To overcome such limitations, we introduce a focusing and diffusion mechanism to enhance the graph convolutional networks to receive information beyond the spatial and temporal neighborhood restrictions. We first represent the human skeleton in each frame by constructing a focus graph and diffusion graph. Then we propose an efficient way to propagate information over these graphs though a learnable latent transformer node. This way captures a sparse but informative connections between joints in each frame, since not all the implicit correlations are activated in an action. It also agrees with the fact that informative connections may also change over the frames. During the training stage, the latent transformer node learns how to connect the action-specific related joints and passes message forth and back though the focusing and diffusion graphs.

However, there remains a problem of how to model the relation between intra-frame and inter-frame joints. To solve this, we take advantage of the latent transformer node again. We empirically show that the node learns the informative spatial context in each frame, generating a sequence of latent spatial nodes. To capture the temporal contextual information over frames, we introduce a context-aware module consisting of bidirectional LSTM cells to model temporal dynamics and dependencies based on the learned spatial latent nodes.

With the focusing and diffusion mechanism, a graph convolutional network can selectively focus on the informative joints in each frame, capture the global spatial-temporal contextual information and further convey it back to augment node embedding in each frame. We also develop a Bidirectional Attentive Graph Convolutional Network (BAGCN) as an exemplar and evaluate its effectiveness on two challenging public benchmarks: NTH-RGB+D and Skeleton-Kinetics. In summary, our main contributions are three-fold:

  • We propose a novel representation of the human skeleton data in a single frame by constructing two opposite-direction graphs and also introduce an efficient way of message passing in the graph.

  • We design a new network architecture, Bidirectional Attentive Graph Convolutional Network (BAGCN), that learns spatial-temporal context from human skeleton sequences leveraging a graph convolutional networks based focusing and diffusion mechanism.

  • Our BAGCN model is compared against other state-of-the-arts methods on the challenging NTU-RGB+D and Skeleton-Kinetics benchmarks, and demonstrate superior performance.

2 Related Work

Skeleton-based action recognition: The methods can be divided into handcraft feature based [36, 35, 9] and deep learning based [28, 13, 16, 39]. The former encounters challenges in designing representative features, which results in limited recognition accuracy. But the latter is data-driven and improves the performance by a great margin. There are three types of deep models used widely: RNNs, CNNs and GCNs. RNN-based [3, 28, 21, 32] methods usually take the skeletons as a sequence of vectors and models their temporal dependencies over frames. CNN-based approaches [13, 12, 22, 16, 17] employ convolution on skeleton data by regarding it as a pseudo-image.

Recently, graph-based methods [29, 30, 39, 18] perform superior classification accuracies. ST-GCN [39] is the first one to employ graph convolution. SR-TSL [31]

uses gated recurrent unit (GRU) to propagate messages on graphs and uses LSTM to learn the temporal features. AS-GCN

[18] learns actional links simultaneously with structural links in their graph model. 2s-AGCN[30] introduces an adaptive module to construct a data-driven morphing graphs over layers. However, these two methods meet the same difficulties in optimization due to noise from irrelevant joints or bones. Further, DGNN[29] constructs a directed graph and define the message passing rules for node and edge updating. However, they ignore the implicit correlations between intra-frame and inter-frame joints.

Graph convolutional networks: Graph convolutional networks (GCNs), which generalizes convolution from image to graph, has been successfully applied in many applications. The way of constructing GCNs on graph can be split into 2 categories: the spectral [7, 19, 15] and spatial [1, 25]

perspective. The former utilizes graph Fourier transform performing convolution in frequency domain. Differently, the latter directly performs the convolution filters on the graph according to predefined rules and neighbors. We follow the second way to construct the CNN filters on the spatial domain.

3 Method

To obtain the physically and semantically connections sparsely, we introduce a focusing and diffusion mechanism to enhance the graph convolutional networks’ ability in receiving information from other joints of intra-frame and inter-frame. First, we define focusing and diffusion graphs based on the skeletons. Then we formulate how to convey message over constructed graphs. Afterwards, we introduce an exemplar of our focusing and diffusion mechanism, developing a bidirectional attentive graph convolutional networks.

Figure 2: Undirected graphs (left) can be modeled by explicitly assigning two directed edges in opposite direction for each undirected edge, leading to two directed graphs(right). The black node denotes the center node and arrow indicates the passing direction.

3.1 Graph Definition

The raw skeleton data consists of a sequential articulated human pose coordinates. Previous works [16, 30] construct an undirected spatial temporal graph the same as [39]. Given a spatial-temporal graph , where is the set of body joints and is the set of intra-skeleton and inter-frame bones. contains the natural connections in articulated human pose and consists of the joints’ trajectories between consecutive two frames. As shown in the left part of Fig.2, the vertexes denote body joints and the naturally connections between them represent the bone.

Different from previous works, we define an undirected spatial temporal graph as two directed graphs with opposite direction for each edges inspired by [14], where , as shown in the right part of Fig.2. The message passing between joints is bi-directional, which is expected to exploit the kinematics dependencies and help the center joint and peripheral joints receive information from each other.

ST-GCN [39] indicates that a graph with spatial configuration partitioning strategy exploits more location information. We follow this conclusion and split the neighbors into three subsets: 1) the root node itself; 2) the centripetal group; 3) the centrifugal group. That we have = {root, closer, far}. It is according to the distance from the gravity center of the skeleton. Hence, we define the message passing in focus graphs from a joint close to the center of gravity to a joint far from the center of gravity and that direction is opposite in diffusion graphs. The bone information is obtained by calculating the coordinates difference between these two points.

Let be the adjacency matrix of our graph , where represents the corresponding adjacency matrices to focus graph and diffusion graph , respectively. For each sub-graph, we have and

Figure 3: Illustration of our bidirectional attentive graph convolutional network for skeleton-based action recognition. Besides the stacked spatial-temporal graph convolution, we introduce a focusing and diffusion mechanism to learn a latent transformer node and bidirectionally convey the local context over frames. Once capturing the global spatial-temporal contextual information, spatial joints learn how much context should be absorbed for updating node embeddings in each frame.

3.2 Building Block with Graph Convolution

We have defined the sequential skeletons as two directed spatial-temporal graphs. The problem becomes how to passing messages in these directed graphs. We adopt a similar implementation of spatial graph convolution as in [15], which employs a weighted average of neighboring features for updating intra-frame joints’ embedding. According to the definition in , each graph convolution layer is followed by a temporal convolution, so that we only need to implement graph convolution operation in a single frame case.

For message passing, we first convey information though focus graphs and then transform the updated features back via diffusion graphs. Let be the nodes’ embedding of our focus graph from a certain intermediate layer. Before the focus stage, the node embedding is first updated via a graph convolution

(1)

where is the normalized adjacent matrix for each group, is a learnable edge importance weighting and denotes element-wise production between two matrix. is the weight matrix, which is implemented via convolution operation.

Then, these feature maps will be transformed to learn a latent node during the focus stage, which performs a transformer to pass message to arbitrary nodes beyond the temporal and spatial restrictions. This latent node also can be regarded as the spatial context over a single frame graph. In order to capture the temporal contextual information, we propose to learn the context though a context-aware module, denoted as . Afterward, the learned spatial-temporal contextual information can be conveyed back to the spatial node in each frame. Every node learns how much context should be used for embedding during the diffusion stage. For notation, and represent the focusing and diffusion function, respectively. Hence, the latent node and temporal context during the focusing and diffusion can be formulated as:

(2)

where indicates a simple fully-connected layer and denotes concatenating operation.

The latent node models global spatial-temporal contextual information which is selectively transferred to update the node features in a single frame graph. Given enhanced with our focusing and diffusion mechanism, another graph convolution operates on the diffusion graph so that the updated node embedding as:

(3)

where is the normalized adjacent matrix for each group in diffusion graphs, is a learnable edge importance weighting and denotes element-wise multiplication operator.

is the weight matrix. Similar to the conventional building block, we add a BN layer and a ReLU layer atfer each graph convolution. With this building blocks, we can construct our bidirectional attentive graph convolutional network as shown in Fig.

3, the details of which implementation is shown in next session.

3.3 Context-aware Graph Focusing and Diffusion

In this section, we illustrate an exemplar of our focusing and diffusion mechanism in details, which is shown in Fig.4. First, we introduce attentive focus module and then a temporal context-aware module. After learning the spatial-temporal contextual information with a latent node, we develop a dynamic diffusion to convey the global context back to spatial nodes in each frame.

Figure 4: Implementation details of the context-aware focusing and diffusion.

3.3.1 Dynamic Attentive Focusing

The purpose of the focusing stage is to learn a latent node as a transformer, which carries information between nodes in a single frame graph beyond spatial neighborhood restrictions. Besides, this latent transformer node endows a model with the ability of selectively focusing on the informative joints in each frame.

Specifically, given an input , our attentive focus learns a latent node , which satisfies

(4)

where is a convolution operation for node embedding and is a learnable matrix to focus the information over a single frame graph. It is equivalent to the attention mechanism combining features from all nodes over a spatial graph. In our spatial-temporal graphs, it is also temporally dynamic and each spatial graph in a sequence has its own attention causing a series of global spatial contexts. A model can learn how to selectively transform features to the latent node by our attentive focus and receive their feedback in the diffusion stage.

3.3.2 Temporal Context-aware Module

We exploit the temporal dependencies between learned latent nodes in the graph sequence via Context-Aware Module (CAM) for transforming global spatial-temporal context. Once each spatial graph obtains its

, the left problem is how to model the temporal contextual information based on these. Long-Short-term Memory (LSTM)

[8] has been validated its strengths in modeling the dependencies and dynamics in sequential data. Hence, we utilize it to build our context-aware module. A LSTM cell contains an input gate , a forget gate , a output gate and an input cell , which are computed by the following functions:

(5)

where is the previous hidden state at time , is a transformation matrix with learnable parameters, and

are the activation functions.

is the projected latent node features, which is with a weight matrix . The memory cell and hidden state are updated by :

(6)

where represents element-wise multiplication operator.

Considering the directed message passing, we stack two bidirectional LSTM cells as our temporal context-aware module. Thus, the learned spatial-temporal context can be formulated as

3.3.3 Dynamic Attentive Diffusion

For diffusion stage, we expect the neural network to learn how much spatial-temporal context should be passed back when updating node features. With this diffusion, one node in a single graph can receive the information from inter-frame and intra-frame nodes, which indicates the learned latent node can be regarded as a transformer.

In details, the net firstly learns a weight matrix for conveying global context to every spatial node in each frame. Then the node absorbs the diffused contextual information to concatenate with its node embedding . This combined node features are embedded via a convolution operation with a learnable . Hence, the diffusion is formulated as:

(7)

where denotes concatenating along channel dimension.

4 Experiments

To evaluate the effectiveness of our approach, we conduct extensive experiments to compare with state-of-the-art methods on two challenging benchmarks: NTU-RGB+D [28] and Skeleton-Kinetics[39]. Considering the limited computing resources, we choose the relative smaller NTU-RGB+D dataset to analyze the effect of components in our Bidirectional Attentive Graph Convolutional Network(BAGCN).

4.1 Datasets and Setting

NTU-RGB+D: NTU-RGB+D [28] is collected by Microsoft Kinect v2 with joints annotations. It consists of 56,880 video samples from 60 categories of human actions. 40 distinct subjects perform various actions, which are captured by three cameras simultaneously with different horizontal angles: -, , . The human pose are articulated as 25 joints and their coordinates are labeled as annotations. The evaluation of this dataset is divided into two standard protocols: Cross-Subject (X-sub) and Cross-View (X-view). In the former configuration, half of the 40 subjects consists of the training set and the other for testing. For the latter, training and testing groups are split in terms of the camera views, where the training group has 37920 video samples from camera 2 and 3. Only top-1 accuracy is reported on both of the two protocols. Data preprocessing follows that in [29, 28].

Skeleton-Kinetics: DeepMind Kinetics [11]

human action video dataset contains 400 human action classes, with at least 400 video clips for each action taken from the Internet. Its activity categories contains various indoors and outdoors daily actions, like driving car and dying hair. The skeleton-kinetics data was obtained by estimating the location of 18 articulated joints on every frame with public available

OpenPose[2] toolbox. The annotation of these joints consists of their 2D coordinates and 1D confidence score. We follow the same data preparation strategy in ST-GCN [39]

for fair comparison and select two bodies from multi-person cases according to the highest average joints confidence. All the samples are resized to 300 frames by padding. Top-1 and Top-5 classification accuracy are chosen to serve as evaluation metric with 240,000 and 20,000 clips for training and testing, respectively.

Implementation details: Here, we introduce how to implement our Bidirectional Attentive Graph Convolutional Network (BAGCN) and the training details. We concatenate joints and bones along the channel dimension following [30] and fed it into networks for almost the experiments.

Our BAGCN consists of 9 building blocks and the first layer have 64 channels for output. In the 4-th and 7-th blocks, we double the number of channels while downsample the temporal length by a factor 2, as the same in ST-GCN[39]

. A data BN layer performs the normalization at the beginning and a global average pooling layer follows the last block to generate a 256 dimension vector for each sequence, which is then classified by a softmax classifier. For the context-aware focusing and diffusion,

and is the input and output channels of corresponding blocks, where and . Besides, and share the weight and learn from directly using a

convolution. Stochastic gradient descent with 0.9 momentum, weight decay with 0.0001 and Cross-entropy are applied for training. We initialize the base learning rate as 0.1 and decay it by 10 at

and epoch of 50 epochs for NTU-RGB+D dataset while , of 65 epochs for Skeleton-Kinetics dataset. Besides, we also explore the spatial and motion information referred in DGNN [29]. The motion stream takes the movements of joints and the deformations of bones as input, which is exploited only in the last part in ablation study. Otherwise, all the other experiments are based on the spatial stream for fair comparison.

4.2 Compared with State-of-the-art Methods

Methods Cross-Subject Cross-View
Lie Group [35] 50.1% 82.8%
H-RNN[3] 59.1% 64.0%
Deep LSTM[28] 60.7% 67.3%
ST-LSTM[21] 69.2% 77.7%
STA-LSTM[32] 73.4% 81.2%
Temporal Conv[13] 74.3% 84.1%
Clip+CNN+MTLN[12] 79.6% 84.8%
Synthesized CNN[22] 80.0% 87.2%
ST-GCN[39] 81.5% 88.3%
Two-Stream CNN[16] 83.2% 89.3%
SR-TSL[31] 84.8% 92.4%
HCN [17] 86.5% 91.1%
AS-GCN[18] 86.8% 94.2%
2s-AGCN[30] 88.5% 95.1%
DGNN[29] 89.9% 96.1%
BAGCN (Ours) 90.3% 96.3%
Table 1: Top-1 accuracy of action recognition on NTU-RGB+D compared with state-of-the-art methods.

To validate the superior performance of the BAGCN, we compare it with state-of-the-arts on both NTU-RGB+D and Skeleton-Kinetics datasets.

For the former, we report the cross-subject and cross-view protocols of all the compared methods in Tab.1. These approaches can be divided into handcraft-based [35], RNN-based [3, 28, 21, 32], CNN-based [13, 17, 16, 12, 22] and recent GCN-based[39, 31, 18, 30, 29]. Our model outperforms previous methods. Compared with our baseline ST-GCN, the context-aware focusing and diffusion mechanism enhances the performance by a great margin, which increases from 81.5% to 90.3% and 88.3% to 96.3% in terms of cross-subject and cross-view, respectively. This performance gain verifies the superiority and effectiveness of BAGCN. The effect of components in BAGCN is explored in the later ablation study.

For Skeleton-Kinetics datasets, we compare with other methods in terms of top-1 and top-5 accuracies in Tab.2. We see that our BAGCN achieves the best performance, which is identical to the former. From these two table, it is found that graph-based methods exploit better kinematic dependency than RNN/CNN based ones and our focusing and diffusion mechanism can achieve the message passing between intra-frame joints and inter-frame joints for better recognition accuracies.

Methods Top-1 Top-5
Feature Enc. [4] 14.9% 25.8%
Deep LSTM [28] 16.4% 35.3%
Temporal Conv[13] 20.3% 40.0%
ST-GCN [39] 30.7% 52.8&
AS-GCN [18] 34.8% 56.5%
2s-AGCN [30] 36.1% 58.7%
DGNN [29] 36.9% 59.6%
BAGCN(Ours) 37.3% 60.2%
Table 2: Comparison with state-of-the-art methods of on Skeleton-Kinetics in terms of Top-1 and Top-5 classification accuracies.

4.3 Ablation Study

In this session, we evaluate the effect of components in our bidirectional attentive graph convolutional network. In details, we analyze each component by testing recognition performance of BAGCN without that part on NTU-RGB+D. Besides, we also exploit the impact from different modalities. Compared with the X-view protocol, the X-subject requires a model the ability to recognize actions performed by unseen subjects, which is more challenging.

Methods Accuracy
ST-GCN 81.5%
ST-GCN [39] 84.8%
A-link GCN 83.2%
S-link GCN 84.2%
AS-GCN [18] 86.1%
BAGCN (wo/F) 86.4%
BAGCN (max) 87.4%
BAGCN (avg) 87.8%
BAGCN (att) 89.4%

Original results reported in [39] with only joints information
Actional links (A-link) and Structural links (S-link) learned in [18]

Table 3: Comparison with the baseline ST-GCNs and focusing ways on NTU-RGB+D validated by Cross-Subjects protocol.

4.3.1 Attentive Focusing and Diffusion

We analyze the importance of our attentive focusing and diffusion by exploring the different ways of focusing. First, we delete it from our Bidirectional Attentive Graph Convolutional network (BAGCN) and then investigate three types for the focusing stage, results shown in Tab.3.

Without attentive focusing and diffusion. The original performance in ST-GCN[39] is increased to 84.8% with the concatenated joints and bones information, serving as our baseline method. According to the graph construction, our BAGCN degrades to bidirectional ST-GCN without the focusing and diffusion, denoted as BAGCN (wo/F). The message first flows out from the center joint to peripheral joints and then flows back after nodes updating. This message passing way improves the performance from 84.8% to 86.4%, that indicates the necessity for a joint node receiving information from spatial non-neighbor joints. AS-GCN [18] also captures intra-frame dependencies beyond physical connection to update node embeddings. It learns actional links (A-link) and structural links (S-link) simultaneously, which causes difficulty in optimizing. The limited performance of AS-GCN (86.1%) illustrates this problem and it demonstrates the effectiveness and efficiency of our proposed way of message passing in turn. On the other hand, it indicates that dense dependencies make it hard for the net to overcome noise from those irrelevant ones.

With other ways of focusing and diffusion.

For the focus stage, a latent transformer node first passes the spatial contextual information in a frame and then captures temporal dynamics over frames for spatial-temporal contextual information. This global context will be conveyed back to each spatial skeleton graph in the diffusion stage. Two more types of focus way, average and max-pool, are investigated for proving the importance of attentive focus. Recognition performances are shown in the last three lines of Tab.

3.

It is observed that attentive focusing and diffusion can significantly improve performance. For the max-pool configuration, our BAGCN takes embedding features of maximum response node to initialize the latent node in each frame and then captures temporal contextual information between those maximum response nodes. This captured spatial-temporal context is demonstrated useful in augmenting with the spatial node embedding for every frame, owing to the performance increased by 1.0% compared with BAGCN (wo/F). However, It is not reasonable to only focus on the maximum responsive node and ignore those physically associated or semantically associated joint nodes in an action. Hence, we can find that the other two configurations bring higher performance improvement. The average focus way, denoted as BAGCN (avg), emphasizes the mean statistics of the joint nodes’ response in every graph convolutional layer and finally achieves 87.8%. Further, BAGCN (att) leans dynamic spatial attention over graph layers for each frame to selectively pass message forth and back between informative joints. It leads to around 6% performance gain compared with our baseline ST-GCN. This attentive focusing and diffusion learns a sequence of latent nodes to capture spatial and temporal contextual information from implicit connections and allows spatial nodes learn how much context should be used when updating node embedding.

Figure 5: Illustration of the associated joints with our learned latent transformer node in three action samples. Actions from left to right are reading, hand waving and face wiping, respectively. The red points represent the informative joint nodes and activated joints associated with specific action. The larger circle denotes more informative and relative irrelevant joints are ignored in this figure.

4.3.2 Latent Edge Visualization

Various actions my refer to activate different joints while ignore the irrelevant ones. Fig.5 shows some action samples and their associated joints with our learned latent transformer node. We choose the last graph convolutional layer to display the implicitly sparse dependencies between joint nodes. The attention matrix measures the informative degree of connections between the latent transformer node and joints nodes. In the attentive focusing, if the value of is larger than 0.8, the connection between latent and joint will be drawn. All the connected joints reflects the activated joint nodes related to a certain action.

It can be seen that reading action, left part in Fig.5, refers to hands and head joints, which agrees with the kinematic dependency. The middle part in Fig.5 displays the informative joints in hand waving and the right part of Fig.5 illustrates action-specific related body parts in face wiping.

4.3.3 Temporal Context-aware or not

The latent node captures spatial contextual information in a frame and passes its message back after node updating. In this section, we evaluate the effect of temporal context-aware module, which is used to extract the global context in temporal dimension. First, we delete the context-aware and then employ two different types of LSTM for modeling temporal dynamics. The configuration of only learning a latent node with focusing and diffusion graphs obtains the 86.6% classification accuracy in Tab.4. Its performance is only 0.2% higher than BAGCN (wo/F). In contrast, results in the last two lines demonstrates the power of temporal context-aware module. BAGCN (1-Ca) represents one direction temporal context module achieving 87.3% and BAGCN (2-Ca) denotes bi-directional one can obtain 89.4% recognition accuracy, a higher improvement. These improvements validate the effectiveness and necessity of capturing spatial-temporal contextual information for action recognition.

Methods Accuracy
ST-GCN [39] 84.8%
BAGCN (wo/F) 86.4%
BAGCN (wo/Ca) 86.6%
BAGCN (1-Ca) 87.3%
BAGCN (2-Ca) 89.4%
Table 4: Comparison of classification performance with or without Context-aware (Ca) on NTU-RGB+D validation dataset in Cross-Subjects protocol

4.3.4 Two-stream or not

Inspired by previous works [16, 17, 29], we test the supplementary effect of motion information in our bidirectional attentive graph convolutional network. We perform experiments to train the spatial and motion branch separately and then employ late fusion on NTU-RGB+D dataset in Cross-Subjects and Cross-View protocols, which are shown in Tab.5. From this table, it can be found that the spatial branch achieves higher classification performance than the motion one while fusing them together can still improve the accuracy from 89.4% to 90.3% in Cross-subject and 95.6% to 96.3% in cross-view.

Methods X-Sub X-View Top-1 Top5
Spatial 89.4% 95.6% 36.4% 58.9%
Motion 86.3% 94.1% 31.8% 54.9%
Fusion 90.3% 96.3% 37.3% 60.2%
Table 5: Comparison of classification performance with spatial information, motion information and their fusion on NTU-RGB+D validation dataset in Cross-Subjects and Cross-views protocols and Skeleton-Kinetics in top-1,5 accuracies

5 Conclusion

In this paper, we introduce a focusing and diffusion mechanism to enhance the graph convolutional network for receiving the information beyond the spatial and temporal neighbors restrictions. Besides, we propose a context-aware module to capture the global spatial-temporal contextual information for modeling implicit correlations between inter-frame and intra-frame joints. We then develop a Bidirectional Attentive Graph Convolutional Network (BAGCN) as an exemplar. This graph-based model is enhanced with our context-aware focusing and diffusion mechanism. Its superior performance is demonstrated on two public challenging skeleton-based action recognition benchmarks: NTU-RGB+D and Skeleton-Kinetics.

References