1 Introduction
Action recognition is a challenging vision problem that has been studied for years. It plays an essential role in various applications, such as video surveillance [24], patient monitoring [20], robotics[40], humanmachine interaction[10] and so on [37, 40, 28]. Previous methods mainly rely on two different data modalities: RGB images [11, 33] and skeleton points [28, 39, 40]. Compared with 2D images, 3D (skeleton) points directly encode object shape and motion [26, 27, 38, 5, 23, 34, 6] and thus are more robust to nuisances like fast motions, camera viewpoint changes, etc.
Conventional attempts [35, 4, 22]
take skeletons as a sequence of vectors generated by encoding all joints in the same frame together. Some deeplearningbased methods, such as CNNs, employ convolution operation directly on skeletons, the irregular grid data. These approaches ignore the intrinsically kinematic dependencies and structural information of human pose, leading to unsatisfactory performance.
To exploit the structure of skeleton data, Yan et al. [39] first introduces the spatialtemporal Graph Convolutional Network (GCN) based on coordinated joints and the natural topology of the human body. However, this predefined graph structure suffers from two defects, as shown in Fig.1: 1) the graph only considers physically structural information but ignores the semantic connections among intraframe joints. For example, the implicit correlation between hands and head is crucial to recognize the action of ”drinking.” 2) the interframe edges only connect the same joints over consecutive frames, which lack modeling the connections between interframe joints. Another challenge is that a complete action lasts from several seconds to minutes. Thus, it is crucial to design a model that is capable of reducing noise from irrelevant joints and focusing on informative ones. Temporal context modeling is usually expected to be helpful for addressing this issue, inspired by which in this paper we intend to capture both the physical connections and semantic dependencies beyond the spatial and temporal neighborhood restrictions of GCN.
Several methods, like ASGCN[18] and 2sAGCN[30], attempt to overcome the aforementioned first drawback but they introduce additional challenges to be solved. For obtaining kinematic dependency, 2sAGCN introduces a module to learn the additional datadriven morphing graph over layers, and ASGCN proposes an encoderdecoder structure to capture actionspecific dependencies combined with physically structural links between joints. Nevertheless, it is inefficient to directly model the relationship between nonneighbor joints over a fullyconnected graph. These potentially dense connections are also tricky to learn because of the noise from irrelevant ones, which leads to difficulty in optimization and unsatisfied performance. Besides, they also ignore the correlations between interframe joints so that they cannot take advantage of the temporal contextual information.
To overcome such limitations, we introduce a focusing and diffusion mechanism to enhance the graph convolutional networks to receive information beyond the spatial and temporal neighborhood restrictions. We first represent the human skeleton in each frame by constructing a focus graph and diffusion graph. Then we propose an efficient way to propagate information over these graphs though a learnable latent transformer node. This way captures a sparse but informative connections between joints in each frame, since not all the implicit correlations are activated in an action. It also agrees with the fact that informative connections may also change over the frames. During the training stage, the latent transformer node learns how to connect the actionspecific related joints and passes message forth and back though the focusing and diffusion graphs.
However, there remains a problem of how to model the relation between intraframe and interframe joints. To solve this, we take advantage of the latent transformer node again. We empirically show that the node learns the informative spatial context in each frame, generating a sequence of latent spatial nodes. To capture the temporal contextual information over frames, we introduce a contextaware module consisting of bidirectional LSTM cells to model temporal dynamics and dependencies based on the learned spatial latent nodes.
With the focusing and diffusion mechanism, a graph convolutional network can selectively focus on the informative joints in each frame, capture the global spatialtemporal contextual information and further convey it back to augment node embedding in each frame. We also develop a Bidirectional Attentive Graph Convolutional Network (BAGCN) as an exemplar and evaluate its effectiveness on two challenging public benchmarks: NTHRGB+D and SkeletonKinetics. In summary, our main contributions are threefold:

We propose a novel representation of the human skeleton data in a single frame by constructing two oppositedirection graphs and also introduce an efficient way of message passing in the graph.

We design a new network architecture, Bidirectional Attentive Graph Convolutional Network (BAGCN), that learns spatialtemporal context from human skeleton sequences leveraging a graph convolutional networks based focusing and diffusion mechanism.

Our BAGCN model is compared against other stateofthearts methods on the challenging NTURGB+D and SkeletonKinetics benchmarks, and demonstrate superior performance.
2 Related Work
Skeletonbased action recognition: The methods can be divided into handcraft feature based [36, 35, 9] and deep learning based [28, 13, 16, 39]. The former encounters challenges in designing representative features, which results in limited recognition accuracy. But the latter is datadriven and improves the performance by a great margin. There are three types of deep models used widely: RNNs, CNNs and GCNs. RNNbased [3, 28, 21, 32] methods usually take the skeletons as a sequence of vectors and models their temporal dependencies over frames. CNNbased approaches [13, 12, 22, 16, 17] employ convolution on skeleton data by regarding it as a pseudoimage.
Recently, graphbased methods [29, 30, 39, 18] perform superior classification accuracies. STGCN [39] is the first one to employ graph convolution. SRTSL [31]
uses gated recurrent unit (GRU) to propagate messages on graphs and uses LSTM to learn the temporal features. ASGCN
[18] learns actional links simultaneously with structural links in their graph model. 2sAGCN[30] introduces an adaptive module to construct a datadriven morphing graphs over layers. However, these two methods meet the same difficulties in optimization due to noise from irrelevant joints or bones. Further, DGNN[29] constructs a directed graph and define the message passing rules for node and edge updating. However, they ignore the implicit correlations between intraframe and interframe joints.Graph convolutional networks: Graph convolutional networks (GCNs), which generalizes convolution from image to graph, has been successfully applied in many applications. The way of constructing GCNs on graph can be split into 2 categories: the spectral [7, 19, 15] and spatial [1, 25]
perspective. The former utilizes graph Fourier transform performing convolution in frequency domain. Differently, the latter directly performs the convolution filters on the graph according to predefined rules and neighbors. We follow the second way to construct the CNN filters on the spatial domain.
3 Method
To obtain the physically and semantically connections sparsely, we introduce a focusing and diffusion mechanism to enhance the graph convolutional networks’ ability in receiving information from other joints of intraframe and interframe. First, we define focusing and diffusion graphs based on the skeletons. Then we formulate how to convey message over constructed graphs. Afterwards, we introduce an exemplar of our focusing and diffusion mechanism, developing a bidirectional attentive graph convolutional networks.
3.1 Graph Definition
The raw skeleton data consists of a sequential articulated human pose coordinates. Previous works [16, 30] construct an undirected spatial temporal graph the same as [39]. Given a spatialtemporal graph , where is the set of body joints and is the set of intraskeleton and interframe bones. contains the natural connections in articulated human pose and consists of the joints’ trajectories between consecutive two frames. As shown in the left part of Fig.2, the vertexes denote body joints and the naturally connections between them represent the bone.
Different from previous works, we define an undirected spatial temporal graph as two directed graphs with opposite direction for each edges inspired by [14], where , as shown in the right part of Fig.2. The message passing between joints is bidirectional, which is expected to exploit the kinematics dependencies and help the center joint and peripheral joints receive information from each other.
STGCN [39] indicates that a graph with spatial configuration partitioning strategy exploits more location information. We follow this conclusion and split the neighbors into three subsets: 1) the root node itself; 2) the centripetal group; 3) the centrifugal group. That we have = {root, closer, far}. It is according to the distance from the gravity center of the skeleton. Hence, we define the message passing in focus graphs from a joint close to the center of gravity to a joint far from the center of gravity and that direction is opposite in diffusion graphs. The bone information is obtained by calculating the coordinates difference between these two points.
Let be the adjacency matrix of our graph , where represents the corresponding adjacency matrices to focus graph and diffusion graph , respectively. For each subgraph, we have and
3.2 Building Block with Graph Convolution
We have defined the sequential skeletons as two directed spatialtemporal graphs. The problem becomes how to passing messages in these directed graphs. We adopt a similar implementation of spatial graph convolution as in [15], which employs a weighted average of neighboring features for updating intraframe joints’ embedding. According to the definition in , each graph convolution layer is followed by a temporal convolution, so that we only need to implement graph convolution operation in a single frame case.
For message passing, we first convey information though focus graphs and then transform the updated features back via diffusion graphs. Let be the nodes’ embedding of our focus graph from a certain intermediate layer. Before the focus stage, the node embedding is first updated via a graph convolution
(1) 
where is the normalized adjacent matrix for each group, is a learnable edge importance weighting and denotes elementwise production between two matrix. is the weight matrix, which is implemented via convolution operation.
Then, these feature maps will be transformed to learn a latent node during the focus stage, which performs a transformer to pass message to arbitrary nodes beyond the temporal and spatial restrictions. This latent node also can be regarded as the spatial context over a single frame graph. In order to capture the temporal contextual information, we propose to learn the context though a contextaware module, denoted as . Afterward, the learned spatialtemporal contextual information can be conveyed back to the spatial node in each frame. Every node learns how much context should be used for embedding during the diffusion stage. For notation, and represent the focusing and diffusion function, respectively. Hence, the latent node and temporal context during the focusing and diffusion can be formulated as:
(2) 
where indicates a simple fullyconnected layer and denotes concatenating operation.
The latent node models global spatialtemporal contextual information which is selectively transferred to update the node features in a single frame graph. Given enhanced with our focusing and diffusion mechanism, another graph convolution operates on the diffusion graph so that the updated node embedding as:
(3) 
where is the normalized adjacent matrix for each group in diffusion graphs, is a learnable edge importance weighting and denotes elementwise multiplication operator.
is the weight matrix. Similar to the conventional building block, we add a BN layer and a ReLU layer atfer each graph convolution. With this building blocks, we can construct our bidirectional attentive graph convolutional network as shown in Fig.
3, the details of which implementation is shown in next session.3.3 Contextaware Graph Focusing and Diffusion
In this section, we illustrate an exemplar of our focusing and diffusion mechanism in details, which is shown in Fig.4. First, we introduce attentive focus module and then a temporal contextaware module. After learning the spatialtemporal contextual information with a latent node, we develop a dynamic diffusion to convey the global context back to spatial nodes in each frame.
3.3.1 Dynamic Attentive Focusing
The purpose of the focusing stage is to learn a latent node as a transformer, which carries information between nodes in a single frame graph beyond spatial neighborhood restrictions. Besides, this latent transformer node endows a model with the ability of selectively focusing on the informative joints in each frame.
Specifically, given an input , our attentive focus learns a latent node , which satisfies
(4) 
where is a convolution operation for node embedding and is a learnable matrix to focus the information over a single frame graph. It is equivalent to the attention mechanism combining features from all nodes over a spatial graph. In our spatialtemporal graphs, it is also temporally dynamic and each spatial graph in a sequence has its own attention causing a series of global spatial contexts. A model can learn how to selectively transform features to the latent node by our attentive focus and receive their feedback in the diffusion stage.
3.3.2 Temporal Contextaware Module
We exploit the temporal dependencies between learned latent nodes in the graph sequence via ContextAware Module (CAM) for transforming global spatialtemporal context. Once each spatial graph obtains its
, the left problem is how to model the temporal contextual information based on these. LongShortterm Memory (LSTM)
[8] has been validated its strengths in modeling the dependencies and dynamics in sequential data. Hence, we utilize it to build our contextaware module. A LSTM cell contains an input gate , a forget gate , a output gate and an input cell , which are computed by the following functions:(5) 
where is the previous hidden state at time , is a transformation matrix with learnable parameters, and
are the activation functions.
is the projected latent node features, which is with a weight matrix . The memory cell and hidden state are updated by :(6)  
where represents elementwise multiplication operator.
Considering the directed message passing, we stack two bidirectional LSTM cells as our temporal contextaware module. Thus, the learned spatialtemporal context can be formulated as
3.3.3 Dynamic Attentive Diffusion
For diffusion stage, we expect the neural network to learn how much spatialtemporal context should be passed back when updating node features. With this diffusion, one node in a single graph can receive the information from interframe and intraframe nodes, which indicates the learned latent node can be regarded as a transformer.
In details, the net firstly learns a weight matrix for conveying global context to every spatial node in each frame. Then the node absorbs the diffused contextual information to concatenate with its node embedding . This combined node features are embedded via a convolution operation with a learnable . Hence, the diffusion is formulated as:
(7)  
where denotes concatenating along channel dimension.
4 Experiments
To evaluate the effectiveness of our approach, we conduct extensive experiments to compare with stateoftheart methods on two challenging benchmarks: NTURGB+D [28] and SkeletonKinetics[39]. Considering the limited computing resources, we choose the relative smaller NTURGB+D dataset to analyze the effect of components in our Bidirectional Attentive Graph Convolutional Network(BAGCN).
4.1 Datasets and Setting
NTURGB+D: NTURGB+D [28] is collected by Microsoft Kinect v2 with joints annotations. It consists of 56,880 video samples from 60 categories of human actions. 40 distinct subjects perform various actions, which are captured by three cameras simultaneously with different horizontal angles: , , . The human pose are articulated as 25 joints and their coordinates are labeled as annotations. The evaluation of this dataset is divided into two standard protocols: CrossSubject (Xsub) and CrossView (Xview). In the former configuration, half of the 40 subjects consists of the training set and the other for testing. For the latter, training and testing groups are split in terms of the camera views, where the training group has 37920 video samples from camera 2 and 3. Only top1 accuracy is reported on both of the two protocols. Data preprocessing follows that in [29, 28].
SkeletonKinetics: DeepMind Kinetics [11]
human action video dataset contains 400 human action classes, with at least 400 video clips for each action taken from the Internet. Its activity categories contains various indoors and outdoors daily actions, like driving car and dying hair. The skeletonkinetics data was obtained by estimating the location of 18 articulated joints on every frame with public available
OpenPose[2] toolbox. The annotation of these joints consists of their 2D coordinates and 1D confidence score. We follow the same data preparation strategy in STGCN [39]for fair comparison and select two bodies from multiperson cases according to the highest average joints confidence. All the samples are resized to 300 frames by padding. Top1 and Top5 classification accuracy are chosen to serve as evaluation metric with 240,000 and 20,000 clips for training and testing, respectively.
Implementation details: Here, we introduce how to implement our Bidirectional Attentive Graph Convolutional Network (BAGCN) and the training details. We concatenate joints and bones along the channel dimension following [30] and fed it into networks for almost the experiments.
Our BAGCN consists of 9 building blocks and the first layer have 64 channels for output. In the 4th and 7th blocks, we double the number of channels while downsample the temporal length by a factor 2, as the same in STGCN[39]
. A data BN layer performs the normalization at the beginning and a global average pooling layer follows the last block to generate a 256 dimension vector for each sequence, which is then classified by a softmax classifier. For the contextaware focusing and diffusion,
and is the input and output channels of corresponding blocks, where and . Besides, and share the weight and learn from directly using aconvolution. Stochastic gradient descent with 0.9 momentum, weight decay with 0.0001 and Crossentropy are applied for training. We initialize the base learning rate as 0.1 and decay it by 10 at
and epoch of 50 epochs for NTURGB+D dataset while , of 65 epochs for SkeletonKinetics dataset. Besides, we also explore the spatial and motion information referred in DGNN [29]. The motion stream takes the movements of joints and the deformations of bones as input, which is exploited only in the last part in ablation study. Otherwise, all the other experiments are based on the spatial stream for fair comparison.4.2 Compared with Stateoftheart Methods
Methods  CrossSubject  CrossView 

Lie Group [35]  50.1%  82.8% 
HRNN[3]  59.1%  64.0% 
Deep LSTM[28]  60.7%  67.3% 
STLSTM[21]  69.2%  77.7% 
STALSTM[32]  73.4%  81.2% 
Temporal Conv[13]  74.3%  84.1% 
Clip+CNN+MTLN[12]  79.6%  84.8% 
Synthesized CNN[22]  80.0%  87.2% 
STGCN[39]  81.5%  88.3% 
TwoStream CNN[16]  83.2%  89.3% 
SRTSL[31]  84.8%  92.4% 
HCN [17]  86.5%  91.1% 
ASGCN[18]  86.8%  94.2% 
2sAGCN[30]  88.5%  95.1% 
DGNN[29]  89.9%  96.1% 
BAGCN (Ours)  90.3%  96.3% 
To validate the superior performance of the BAGCN, we compare it with stateofthearts on both NTURGB+D and SkeletonKinetics datasets.
For the former, we report the crosssubject and crossview protocols of all the compared methods in Tab.1. These approaches can be divided into handcraftbased [35], RNNbased [3, 28, 21, 32], CNNbased [13, 17, 16, 12, 22] and recent GCNbased[39, 31, 18, 30, 29]. Our model outperforms previous methods. Compared with our baseline STGCN, the contextaware focusing and diffusion mechanism enhances the performance by a great margin, which increases from 81.5% to 90.3% and 88.3% to 96.3% in terms of crosssubject and crossview, respectively. This performance gain verifies the superiority and effectiveness of BAGCN. The effect of components in BAGCN is explored in the later ablation study.
For SkeletonKinetics datasets, we compare with other methods in terms of top1 and top5 accuracies in Tab.2. We see that our BAGCN achieves the best performance, which is identical to the former. From these two table, it is found that graphbased methods exploit better kinematic dependency than RNN/CNN based ones and our focusing and diffusion mechanism can achieve the message passing between intraframe joints and interframe joints for better recognition accuracies.
Methods  Top1  Top5 

Feature Enc. [4]  14.9%  25.8% 
Deep LSTM [28]  16.4%  35.3% 
Temporal Conv[13]  20.3%  40.0% 
STGCN [39]  30.7%  52.8& 
ASGCN [18]  34.8%  56.5% 
2sAGCN [30]  36.1%  58.7% 
DGNN [29]  36.9%  59.6% 
BAGCN(Ours)  37.3%  60.2% 
4.3 Ablation Study
In this session, we evaluate the effect of components in our bidirectional attentive graph convolutional network. In details, we analyze each component by testing recognition performance of BAGCN without that part on NTURGB+D. Besides, we also exploit the impact from different modalities. Compared with the Xview protocol, the Xsubject requires a model the ability to recognize actions performed by unseen subjects, which is more challenging.
Methods  Accuracy 

STGCN  81.5% 
STGCN [39]  84.8% 
Alink GCN  83.2% 
Slink GCN  84.2% 
ASGCN [18]  86.1% 
BAGCN (wo/F)  86.4% 
BAGCN (max)  87.4% 
BAGCN (avg)  87.8% 
BAGCN (att)  89.4% 
Original results reported in [39] with only joints information
Actional links (Alink) and Structural links (Slink) learned in [18]
4.3.1 Attentive Focusing and Diffusion
We analyze the importance of our attentive focusing and diffusion by exploring the different ways of focusing. First, we delete it from our Bidirectional Attentive Graph Convolutional network (BAGCN) and then investigate three types for the focusing stage, results shown in Tab.3.
Without attentive focusing and diffusion. The original performance in STGCN[39] is increased to 84.8% with the concatenated joints and bones information, serving as our baseline method. According to the graph construction, our BAGCN degrades to bidirectional STGCN without the focusing and diffusion, denoted as BAGCN (wo/F). The message first flows out from the center joint to peripheral joints and then flows back after nodes updating. This message passing way improves the performance from 84.8% to 86.4%, that indicates the necessity for a joint node receiving information from spatial nonneighbor joints. ASGCN [18] also captures intraframe dependencies beyond physical connection to update node embeddings. It learns actional links (Alink) and structural links (Slink) simultaneously, which causes difficulty in optimizing. The limited performance of ASGCN (86.1%) illustrates this problem and it demonstrates the effectiveness and efficiency of our proposed way of message passing in turn. On the other hand, it indicates that dense dependencies make it hard for the net to overcome noise from those irrelevant ones.
With other ways of focusing and diffusion.
For the focus stage, a latent transformer node first passes the spatial contextual information in a frame and then captures temporal dynamics over frames for spatialtemporal contextual information. This global context will be conveyed back to each spatial skeleton graph in the diffusion stage. Two more types of focus way, average and maxpool, are investigated for proving the importance of attentive focus. Recognition performances are shown in the last three lines of Tab.
3.It is observed that attentive focusing and diffusion can significantly improve performance. For the maxpool configuration, our BAGCN takes embedding features of maximum response node to initialize the latent node in each frame and then captures temporal contextual information between those maximum response nodes. This captured spatialtemporal context is demonstrated useful in augmenting with the spatial node embedding for every frame, owing to the performance increased by 1.0% compared with BAGCN (wo/F). However, It is not reasonable to only focus on the maximum responsive node and ignore those physically associated or semantically associated joint nodes in an action. Hence, we can find that the other two configurations bring higher performance improvement. The average focus way, denoted as BAGCN (avg), emphasizes the mean statistics of the joint nodes’ response in every graph convolutional layer and finally achieves 87.8%. Further, BAGCN (att) leans dynamic spatial attention over graph layers for each frame to selectively pass message forth and back between informative joints. It leads to around 6% performance gain compared with our baseline STGCN. This attentive focusing and diffusion learns a sequence of latent nodes to capture spatial and temporal contextual information from implicit connections and allows spatial nodes learn how much context should be used when updating node embedding.
4.3.2 Latent Edge Visualization
Various actions my refer to activate different joints while ignore the irrelevant ones. Fig.5 shows some action samples and their associated joints with our learned latent transformer node. We choose the last graph convolutional layer to display the implicitly sparse dependencies between joint nodes. The attention matrix measures the informative degree of connections between the latent transformer node and joints nodes. In the attentive focusing, if the value of is larger than 0.8, the connection between latent and joint will be drawn. All the connected joints reflects the activated joint nodes related to a certain action.
4.3.3 Temporal Contextaware or not
The latent node captures spatial contextual information in a frame and passes its message back after node updating. In this section, we evaluate the effect of temporal contextaware module, which is used to extract the global context in temporal dimension. First, we delete the contextaware and then employ two different types of LSTM for modeling temporal dynamics. The configuration of only learning a latent node with focusing and diffusion graphs obtains the 86.6% classification accuracy in Tab.4. Its performance is only 0.2% higher than BAGCN (wo/F). In contrast, results in the last two lines demonstrates the power of temporal contextaware module. BAGCN (1Ca) represents one direction temporal context module achieving 87.3% and BAGCN (2Ca) denotes bidirectional one can obtain 89.4% recognition accuracy, a higher improvement. These improvements validate the effectiveness and necessity of capturing spatialtemporal contextual information for action recognition.
Methods  Accuracy 

STGCN [39]  84.8% 
BAGCN (wo/F)  86.4% 
BAGCN (wo/Ca)  86.6% 
BAGCN (1Ca)  87.3% 
BAGCN (2Ca)  89.4% 
4.3.4 Twostream or not
Inspired by previous works [16, 17, 29], we test the supplementary effect of motion information in our bidirectional attentive graph convolutional network. We perform experiments to train the spatial and motion branch separately and then employ late fusion on NTURGB+D dataset in CrossSubjects and CrossView protocols, which are shown in Tab.5. From this table, it can be found that the spatial branch achieves higher classification performance than the motion one while fusing them together can still improve the accuracy from 89.4% to 90.3% in Crosssubject and 95.6% to 96.3% in crossview.
Methods  XSub  XView  Top1  Top5 

Spatial  89.4%  95.6%  36.4%  58.9% 
Motion  86.3%  94.1%  31.8%  54.9% 
Fusion  90.3%  96.3%  37.3%  60.2% 
5 Conclusion
In this paper, we introduce a focusing and diffusion mechanism to enhance the graph convolutional network for receiving the information beyond the spatial and temporal neighbors restrictions. Besides, we propose a contextaware module to capture the global spatialtemporal contextual information for modeling implicit correlations between interframe and intraframe joints. We then develop a Bidirectional Attentive Graph Convolutional Network (BAGCN) as an exemplar. This graphbased model is enhanced with our contextaware focusing and diffusion mechanism. Its superior performance is demonstrated on two public challenging skeletonbased action recognition benchmarks: NTURGB+D and SkeletonKinetics.
References
 [1] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
 [2] Zhe Cao, Gines Hidalgo, Tomas Simon, ShihEn Wei, and Yaser Sheikh. Openpose: realtime multiperson 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008, 2018.

[3]
Yong Du, Wei Wang, and Liang Wang.
Hierarchical recurrent neural network for skeleton based action recognition.
InProceedings of the IEEE conference on computer vision and pattern recognition
, pages 1110–1118, 2015.  [4] Basura Fernando, Efstratios Gavves, Jose M Oramas, Amir Ghodrati, and Tinne Tuytelaars. Modeling video evolution for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5378–5387, 2015.
 [5] Tong He, Haibin Huang, Li Yi, Yuqian Zhou, Chihao Wu, Jue Wang, and Stefano Soatto. Geonet: Deep geodesic networks for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6888–6897, 2019.
 [6] Tong He and Stefano Soatto. Mono3d++: Monocular 3d vehicle detection with twoscale 3d hypotheses and task priors. In AAAI, 2019.
 [7] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163, 2015.
 [8] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.

[9]
Mohamed E Hussein, Marwan Torki, Mohammad A Gowayyed, and Motaz ElSaban.
Human action recognition using a temporal hierarchy of covariance
descriptors on 3d joint locations.
In
TwentyThird International Joint Conference on Artificial Intelligence
, 2013.  [10] YuGang Jiang, Qi Dai, Wei Liu, Xiangyang Xue, and ChongWah Ngo. Human action recognition in unconstrained videos by explicit motion modeling. IEEE Transactions on Image Processing, 24(11):3781–3795, 2015.
 [11] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
 [12] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, and Farid Boussaid. A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3288–3297, 2017.
 [13] Tae Soo Kim and Austin Reiter. Interpretable 3d human action analysis with temporal convolutional networks. In 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pages 1623–1631. IEEE, 2017.
 [14] Thomas Kipf, Ethan Fetaya, KuanChieh Wang, Max Welling, and Richard Zemel. Neural relational inference for interacting systems. arXiv preprint arXiv:1802.04687, 2018.
 [15] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.

[16]
Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu.
Skeletonbased action recognition with convolutional neural networks.
In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pages 597–600. IEEE, 2017.  [17] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Cooccurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055, 2018.
 [18] Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. Actionalstructural graph convolutional networks for skeletonbased action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3595–3603, 2019.
 [19] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.

[20]
Fang Liu, Xiangmin Xu, Shuoyang Qiu, Chunmei Qing, and Dacheng Tao.
Simple to complex transfer learning for action recognition.
IEEE Transactions on Image Processing, 25(2):949–960, 2015.  [21] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang. Spatiotemporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, pages 816–833. Springer, 2016.
 [22] Mengyuan Liu, Hong Liu, and Chen Chen. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68:346–362, 2017.
 [23] Xingyu Liu, , Charles R Qi, and Leonidas J Guibas. Flownet3d: Learning scene flow in 3d point clouds. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2019.
 [24] Ramin Mehran, Alexis Oyama, and Mubarak Shah. Abnormal crowd behavior detection using social force model. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 935–942. IEEE, 2009.

[25]
Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov.
Learning convolutional neural networks for graphs.
In
International conference on machine learning
, pages 2014–2023, 2016.  [26] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
 [27] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
 [28] Amir Shahroudy, Jun Liu, TianTsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016.
 [29] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Skeletonbased action recognition with directed graph neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7912–7921, 2019.
 [30] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Twostream adaptive graph convolutional networks for skeletonbased action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12026–12035, 2019.
 [31] Chenyang Si, Ya Jing, Wei Wang, Liang Wang, and Tieniu Tan. Skeletonbased action recognition with spatial reasoning and temporal stack learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 103–118, 2018.

[32]
Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu.
An endtoend spatiotemporal attention model for human action recognition from skeleton data.
In Thirtyfirst AAAI conference on artificial intelligence, 2017.  [33] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
 [34] Lorenzo Torresani, Aaron Hertzmann, and Christoph Bregler. Learning nonrigid 3d shape from 2d motion. In Advances in Neural Information Processing Systems, pages 1555–1562, 2004.
 [35] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. Human action recognition by representing 3d skeletons as points in a lie group. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 588–595, 2014.
 [36] Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. Mining actionlet ensemble for action recognition with depth cameras. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1290–1297. IEEE, 2012.
 [37] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
 [38] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
 [39] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeletonbased action recognition. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [40] Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L Berg, and Dimitris Samaras. Twoperson interaction detection using bodypose features and multiple instance learning. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 28–35. IEEE, 2012.