1 Introduction
Human action recognition is one of the most challenging and important computer vision tasks. It has received considerable attention from both academia and industry, due to its wide application in video surveillance, virtual reality, humancomputer interaction and robotics. Classical studies used to investigate action recognition based on monocular RGB videos
[1, 2, 3, 4, 5, 6], which is difficult to comprehensively represent actions in 3D space. Given the fast development of lowcost devices to capture 3D data (e.g. camera arrays and Kinect), increasingly more researches are actively conducted over 3D action recognition [7, 8, 9, 10, 11].Skeletons generated from depth maps are often invariant to viewpoint or appearance. Most importantly, as a highlevel abstraction of human actions, skeleton data greatly simplifies the difficulty in representing and understanding different action categories. Recently, different technologies have been developed to estimate the temporal motion of the skeleton by tracking and analyzing the motion of human joints. One of the pretty straightforward approaches is to concatenate coordinates of joints into a long feature vector at each time step, and then feed it into temporal analysis models
[12, 13]. However, spatial relationship between joints, as an indispensable part of human skeleton, has been neglected. To exploit connections between joints, [14] used covariance matrix for skeleton joint locations over time as a discriminative descriptor; [12] proposed to represent an action by a weighted summation of actionlets. To automatically capture the patterns embedded in spatial configuration and temporal dynamics of joints, [15] proposed spatialtemporal graph convolutional networks to conduct convolution on human skeleton in both spatial and temporal domain [16, 17, 18, 19, 20]. [21] proposed a twostream CNN to process both raw coordinates and motion data obtained by subtracting joint coordinates in consecutive frames; [22] designed a model that first transforms skeleton sequence into three clips corresponding to three cylindrical coordinates of skeleton sequence, and then applies deep CNN.These methods have brought in impressive performance improvements of action recognition. They mostly focus on human joints, since movements originate at joints. But it is also instructive to note the fact that a joint connects two bones, whose shapes, lengths and places dictate how the human body moves in practice. In most cases, some joints without obvious changes of coordinates in 3D space would still be able to drive the significant motion of bones, e.g. the hip and shoulder joints during walking or running. On the other hand, when we humans identify the actions of a person, we may ignore movements of particular joints, but we must always note the obvious motions of bones. Hence, instead of calculating subtle changes of body joints, movements magnified on bones deserve more attention.
In this paper, we revisit skeletal data from the perspective of bones, and propose an edge convolutional neural network for action recognition. Different from conventional graph convolutional neural networks concentrating on graph nodes [22, 23, 15], we focus on invaluable information carried by graph edges (e.g. bones generated from skeleton data), while graph nodes only indicate connections between edges. We develop graph edge convolution, which represents an edge by learning weights to integrate its spatial neighboring edges (see Fig. 1). To handle temporal changes of skeleton data in continuous video frames, we extend graph edge convolution by including temporal neighboring edges into consideration. In addition, graph node and edge convolutions process skeleton data from two different perspectives, and they are complementary with each other. We design two different hybrid networks to integrate results of these two types of graph convolutions, one is through a shared dense layer and the other is to add two common convolutional layers. We evaluate the proposed graph edge CNN on two datasets Kinetics and NTURGB+D, which suggests the advantages of the proposed graph edge convolution over existing stateoftheart techniques, and the combination of graph edge and node convolutions can lead to further performance improvement.
The organization of this paper is as follows. In the following section, we review some related works, and in Section III, we introduce our graph edge convolutional neural networks as well as two hybrid models integrating node and edge convolution. In Section IV, the experimental results are presented. Finally, we draw some conclusions in Section V.
2 Related Work
In this section, we will briefly review related works on skeleton based action recognition and graph convolutional neural networks.
2.1 Graph Convolutional Networks
Generalizing convolutional networks from images to graph structured data has been an active topic in recent years. [18, 19, 21, 20]. These works mainly fall into two categories: 1) Spectral perspective, which converts graph data into its spectrum and apply CNNs on spectral domain, for example, [18] defines a spectral convolutional layer to apply filters on frequency domain; [19]
develops an extension of spectral networks which incorporates a graph estimation procedure, and applies their model on ImageNet dataset. There are many similar works in this stream
[24, 16, 25], but a fundamental limitation of the spectral method has to be stressed is that the spectral construction is limited to a single domain. The reason is that the spectral filter coefficients are basis dependent. If we learn a filter w.r.t to a certain set of basis, this filter will not be able to be applied to another domain with another set of basis. This problem can be solved if we can construct compatible orthogonal bases across different domains. However, such construction requires the knowledge of some correspondence between domains, which is extremely hard in most cases. For example, in computer graphics applications, finding correspondence between shapes is itself a very hard problem; 2) Spatial perspective, works in this stream directly design convolution operation on spatial domain, which resembles the convolution on images a lot [16, 17]. Applying graph CNN designed on spatial domain to solve problems in computer vision is also attracting more and more attention: [26] utilize polynomials of functions of graph adjacency matrix as filters to conduct graph convolution, then apply it to image classification and 3D mesh classification; [27] designs depthwise separable graph convolution and apply it to do image classification on CIFAR datasets; [28] formulates a convolutionlike operation on graph signals performed in the spatial domain where filter weights are conditioned on edge labels and dynamically generated for each specific input sample. Our work also proposes a new graph convolution model, and we apply it to human action recognition.2.2 Skeleton Based Human Action Recognition
The development of highly accurate depth sensors and pose estimation algorithms [29, 30] makes it easy to get skeletal data of human action, and skeleton based human action recognition is attracting more and more interests. The work on this topic mainly follows two streams: 1) Handcrafted features. Work belonging to this stream leverages the dynamics of joint motion by handcrafted features. For example, [14] used covariance matrix for skeleton joint locations over time as a discriminative descriptor for a sequence, and used multiple covariance matrices to encode the temporal dependency of joint locations; [12]
used actionlet ensemble obtained by data mining to represent action, and designed LOP feature to overcome intraclass variance caused by the imperfectness of raw data;
[31] utilized histograms of 3D joint locations (HOJ3D) as a compact representation of postures, then reprojected HOJ3D using LDA and clustered it into k posture visual words; [32] proposed to explicitly model 3D geometric relationships between various body parts using rotations and translations in 3D space to represent human skeleton, and then model human actions as curves in a Lie group; [33] describes a new 3D saliency prediction model, in which salient 3D spacetime regions in a video are detected and segmented, and then the saliency strength of each segment is calculated using different attributes including motion, disparity, texture, and the predicted degree of visual discomfort experienced. 2) Deep learning approaches, which processes the skeletal data by deep learning methods.
[34] divides human skeleton into five parts and feed them into five RNN networks; [35] proposes new RNN structure to model the longterm temporal correlation of the features for each body part; [36] extends RNN to spatiotemporal domains to analyze the hidden sources of action related information within the input data over both domains concurrently; [37] proposes fully connected deep LSTM for skeleton based recognition; [38] selects a set of geometric features to describe joint relations and apply LSTM thereon. Besides RNN and LSTM, CNN is also explored to recognize human action: [21] proposes a twostream CNN in which one stream’s input is the raw coordinates and the other stream’s input is motion data obtained by subtracting joint coordinates in each two consecutive frames; [22] first transforms skeleton sequence into three clips corresponding to three cylindrical coordinates of skeleton sequence, and then applies deep CNN on them; [23] proposes temporal convolutional neural networks to explicitly learn readily interpretable spatiotemporal representations for 3D human action recognition; [15] constructs a spatialtemporal human skeleton graph first, and then apply CNN networks. Our model is also a CNN based model that utilizes skeletal data. But unlike the existing models, our convolution is performed on edges instead of nodes.3 Graph Edge Convolutional Neural Networks
In this section, we first review classical convolution operation conducted on images, then we move to our graph edge convolution and graph edge convolutional neural networks as well as its application to skeleton based action recognition. Finally, two hybrid models in which we integrate graph edge convolution and graph node convolution into one single neural network are introduced.
3.1 Classical Image Convolution
To make our graph edge convolution more understandable and exhibit its connection with image convolution, we first present a reformulation of classical convolutions on image pixels. Given a grayscale image, we center the convolution kernel at pixel where and denote the row index and column index, respectively. The convolution output can be derived as follows:
(1) 
where is the output of this convolution operation, N() represents the set of neighboring pixels of , and denotes the pixels in N(). In practice for a grayscale image, N() used to be composed of eight pixels next to and itself, which is exactly the receptive field of this convolution. is a labeling function that assigns a different order onto each individual element in N(). For example, given a 33 convolutional kernel on a grayscale image, will assign respectively to nine pixels in N() from left to right and top to bottom. Weight function will give each pixel a weight according to the order of this pixel. The value of each pixel is then multiplied with its corresponding weight, and the summarization of these weighted values of pixels is regarded as the convolution output at the center pixel. Similar convolution operation can be applied for RGB color images. Since there are three different channels in a RGB image, the value of each pixel is a threedimensional feature vector and its corresponding weight becomes a threedimensional weight vector as well. The multiplication between pixels and weights is extended to inner product.
Convolutional neural networks (CNNs) have received impressive performance in various scenarios, e.g. image classification and object detection [39, 40, 41, 42, 43, 44, 45]. However, it is not straightforward to apply pixel convolution to graph data. This is because graph data does not have the gridded array structure as image, video, and signal data. Each pixel in gridded images would have the same number of neighbors and the same relationships to a neighbor in a given direction. Nongridded graphs do not have these limitations. A nongridded graph can vary in the number of neighbors from vertex (edge) to vertex (edge), and there is not necessarily a geometrical interpretation for any given connection between two vertexes (edges).
3.2 Graph Edge Convolution
We represent a graph by G = (V, E), in which is the node set with elements, and E = { are connected } is the edge set containing all edges of G. And several concepts have to be introduced to facilitate the explanation of graph convolution. 1) Path between two edges: A path is a set of distinct nodes and edges connecting two edges in a graph (Fig. 3a). More than one path may exist between two edges. 2) Length of path between two edges: Given a specific pair of edges, more than one path may exist between the two edges, and these different paths contain different number of nodes and edges. For a certain path, the number of the nodes contained in it is defined as the length of the path. Paths with different lengths between two specific edges are illustrated in Fig. 3b. 3) Shortest path between two edges: Given a certain pair of edges, among all paths connecting the two edges, the path with the smallest length is defined as the shortest path between the two edges. An illustration of shortest path is given in Fig. 3b. 4) Distance between two edges: Given a certain pair of edges, the length of the shortest path is defined as the distance between these two edges, denoted as D(). An illustration of distance between a specific pair of edges is given in Fig. 3b.
To conduct convolution operation, we need a receptive field, which is also called a neighborhood in the following. Here we define the neighborhood of a certain edge as N() = { E and D, R}, where D, is the distance between root edge and edge , is a integer denoting the maximal distance between root edge and neighboring edges. Thus we can also express the neighborhood of a root edge as: edges within a certain distance from root edge. In our convolution, we set as 1 (Fig. 4a), and the resulting neighborhood only contains the edges that are directly connected with root edge by one node, i.e. in our convolution, the neighborhood of a certain edge is N() = { E and D, 1}.
The convolution at a certain edge is a weighted summation on all the neighboring edges. Formally, the output of convolution operation at edge can be written as (Fig. 1):
(2) 
The on the left of eq.2 is the output of convolution conducted at edge , and on the right of the equation is a weighted summation, in which represents all neighboring edges being summed and denotes the weight assigned to edge .
As the neighboring edges in a graph do not have an implicit spatial order that neighboring pixels in an image have, to assign weights onto neighboring edges, we have to firstly assign an order on them, and then match each edge with a weight value according to its order. The labeling function is just the function that assigns order on the neighboring edges. For each edge in the neighborhood, the labeling function will assign a labeling value on it, denoting the order of this edge, and the weight assigned to is depended on the labeling value . Unlike in the images in which each pixel has a fixed number of neighboring pixels, edges in graphs do not have a fixed number of neighboring edges. Thus we cannot prepare a fixed number of weights to assign on edges as the number of neighboring edges may vary. So, instead of giving each neighboring edge a unique labeling value, our labeling function will map neighboring edges into a fixed number of subsets, and edges in same subset will have same labeling value. i.e. : N(){1,…,}, every single edge in the neighborhood will be labeled an integer ranging from 1 to , and this integer is the edge’s order that determines which weight value is going to be assigned to this edge. Thus we see, even if the number of neighboring edges is not fixed, we can always assign them with weight values because the edges will always be divided into subsets. in eq.2 is weight function that assigns weights to edges according to their labeling values, and the edges with same labeling values will be given a same weight. After being multiplied with weights, the summation is conducted on all of the neighboring edges and the convolution operation is completed by now.
Similar to image convolution mentioned above, by replacing edge feature and weight with edge feature vector and weight vector, we can easily extend our model to graphs with multiple channels.
To balance the contribution of edges with different weights, we add a normalizing term in our eq.2
(3) 
is the number of neighboring edges with the same labeling value as in edge ’s neighborhood, i.e.
(4) 
Thus we can see, when the neighboring edges are not divided evenly into subsets, is the term to balance the contribution of neighboring edges with different labeling values. By now, the convolution operation of edges has been elucidated, and we can build our Graph Edge Convolutional Neural Networks using it.
3.3 Graph Edge Convolutional Neural Networks
For each video clip containing human actions, the extracted skeletal data are stored as a sequence of frames. In each frame, human bodies are represented as skeleton graphs, in which nodes denote human joints and edges denote human bones. Our edge convolution can be directly conducted on a single frame of human skeleton graph. But in human action recognition, a sequence of human skeleton graphs has to be dealt with, and skeleton graphs in different frames have to be processed concurrently to capture the dynamics of human body. Thus we have to redefine the neighborhood of an edge to incorporate edges in different frames to get our convolution conducted in both spatial and temporal domains.
For a certain edge in frame , we denote it as , and the neighboring edges of is defined from two aspects: 1) in frame , edges connected to via only one node are its spatial neighbors, which is just the definition of neighborhood mentioned above; 2) in another frame , if is within a certain distance from i.e.  , the spatial neighbors of is also regarded as neighbors of and are called temporal neighbors of , i.e.
N()={ E and  and D(, ) 1}
Here and denote two different frames. is an integer called temporal kernel size, which is to restrict that temporal neighbors of must not be more than frames away. And the constraint D(, ) 1 is to define spatial neighbors.
The labeling function defined in section Graph Edge Convolution is for spatial neighbors in a single frame. Considering the temporal neighbors, labeling function is rewritten as:
(5) 
Here is the rewritten labeling function, is labeling function in a single frame for spatial neighbors, is temporal kernel size, and is the number of subsets divided by labeling function mentioned in section Graph Edge Convolution. Moreover, adding to is to ensure () is nonnegative, and multiplying at the end is to ensure the labeling values for temporal neighbors are not same as spatial neighbors.
As our model is to convolve edges instead of nodes and the original skeletal data are coordinates of human joints (nodes), we have to firstly transform the skeletal data from node form into edge form. The data of each bone (edge) is calculated from coordinates of the two joints on the bone’s end. For each bone, we first get the coordinates of the bone’s center by averaging the coordinates of the joints on two ends, and then subtract one joint’s coordinates from the other to obtain a vector whose length and orientation represent the length and orientation of the bone between the two joints. Denoting the bone (edge) of which the information is being calculated as , the two joints (nodes) on the bone’s two ends are and , we can formally write the calculation of the center coordinates of as:
(6)  
(7)  
(8) 
in which , , and denote three coordinates of the center of bone , , and are coordinates of node , which also applies to node . And the orientation vector whose length and orientation denote the length and orientation of bone can be formulated as:
(9) 
So we see that each bone is represented by a set of coordinates denoting the center of the bone and a vector denoting the bone’s length and orientation.
Motions of different human body parts can be roughly categorized as concentric and eccentric, thus we choose our labeling function for spatial neighbors as spatial configuration labeling. Specifically, for convolution at a certain edge , this labeling function divides neighboring edges into three subsets: 1) edges that are closer to gravity center than ; 2) edges with an equal distance to gravity center as ; 3) edges that are farther to gravity center than . i.e.:
(10) 
In eq.11, is the gravity center of human body calculated by averaging coordinates of all body parts, and d(, ) denotes the distance from edge to . Having the labeling function for spatial neighbors, the labeling function for temporal neighbors can be formulated according to eq.5.
3.4 Combining Graph Edge and Node Convolutions
A convolutional network is to extract a set of high level features from the original skeleton sequence, but the node convolutional network and the edge convolutional network extract features from different perspectives, thus the two sets of features (output of the convolutional layer) represent a same video sequence from different viewpoints. Both edge convolution and node convolution have their unique advantages that cannot be replaced by each other. Just like we mentioned before, the edge convolutional network leverages the dynamics of bones, while the node convolutional network leverages dynamics of joints.
Thus we wonder what if we combine them together to get hybrid models that can utilize advantages of both edge and node convolution? As dynamics of joints and bones are complementary to each other, we guess that designing a model to utilize the two sets of features together will enable us to leverage the human skeleton dynamics from both jointperspective and boneperspective and further improve the performance on human action recognition task. So we design two different hybrid models that combine edge and node convolution models by features of different levels.
1) Sequence level hybrid model: A straight forward realization of this idea is our sequence level hybrid model. This is a twostream model. In one stream we perform layers of edge convolution onto the skeleton sequence and in the other stream we perform layers of node convolution. Given the input sequence , the output of the convolutional layer in the edge convolution stream can be formulated as:
(11) 
in which represents layers of edge convolution, and is the output from the last convolutional layer of the edge convolution stream. And the output of the convolutional layer in the node convolution stream can be formulated as:
(12) 
in which represents layers of node convolution, and is the output from the last convolutional layer of the node convolution stream.
In the sole edge convolution or node convolution networks, we have only one set of features or
and we apply a global pooling on it to get a representation for the whole sequence, and then input it into a fully connected layer to output final class scores denoting the probabilities for the sequence to be classified into each class. But now we have two sets of features and two different representations for a same sequence. What we do is to concatenate these two representations into a single tensor to get the input for the last fully connected layer, i.e.
(13) 
in which denotes the input to the fully connected layer in sequence level hybrid model, is the global pooling layer, and is a concatenation operator. And the final output of the whole model is:
(14) 
in which denotes the fully connected layer, and is the final output of the sequence level hybrid model.
By concatenating the outputs from edge and node convolution streams, features extracted from both edge and node convolutional networks contribute to final classification result, i.e. dynamics of both joints and bones are leveraged in our classification.
In this model, the input to the shared layer are features of the whole skeleton sequence, thus we call the first hybrid model as sequence level hybrid model.
2) Bodypart level hybrid model: In sequence level hybrid model, two sets of features complement each other in a shared fully connected layer after being separately extracted by node and edge convolutional network. But we also want the two convolutional networks to supplement each other during the feature extracting period rather than after it, i.e. we want to merge the two networks in a convolutional layer rather than a fully connected layer. Thus we design another hybrid model, and the most distinctive part of this model is the shared convolutional layer.
In a shared convolutional layer, our convolution operation is neither edge convolution nor node convolution, but a combination of both. As mentioned before, we conduct convolution on a human skeleton graph, when we perform edge convolution, the data are information of bones and attached to edges, and when we perform node convolution, data are information of joints and attached to nodes. But in a shared convolutional layer, both nodes and edges are attached with information, thus when conducting convolution, information of edges can be passed to nodes, and vice versa. By this kind of convolution, dynamics of bones and joints can supplement each other during the period of feature extraction in the shared convolutional layer. An illustration of convolution operation in the shared convolutional layer can be found in Fig. 6.
Similar to sequence level hybrid model, the first part of bodypart level hybrid model are two separated streams including an edge convolutional network and a node convolutional network, and the output of node and edge convolutional networks are two sets of features denoted as and , which can be formulated as follows:
(15) 
(16) 
in which the , , and have exactly the same meanings as we mentioned in sequence level hybrid model. Compared to the sequence level hybrid model, the subscripts are changed from to to indicate that they are parts of bodypart level hybrid model. More specifically, the feature set extracted by edge convolutional network contains feature vectors for each bone in each frame, and the feature set extracted by node convolutional network contains feature vectors for each joint in each frame. After getting these two feature sets, we construct a new human skeleton graph, then we fill in each edge with the corresponding feature vector extracted by the edge convolutional network and fill in each node with the corresponding feature vector extracted by the node convolutional network. In this new human skeleton graph, both edges and nodes are filled with feature vectors, which are the high level features of bones and joints containing the dynamic information extracted by the two separated convolutional networks. Then the shared convolutional layer is applied thereon and the input can be formulated as follows:
(17) 
in which denotes the input to shared convolutional layer in the bodypart level hybrid model. In the shared convolutional layer, the high level features of bones and joints communicate with each other and get furnished by each other. The output of the shared convolutional layer is also feature vectors of bones and joints like the output we get from the two separate streams. The difference is that in the two separate streams, the features of joints and bones are separately extracted, but in our shared convolutional layer, each time we perform convolution operation on the human body graph, the information of joints get chances to flow to the neighboring bones, thus the features of bones are informed and get trimmed by dynamics of joints, and vice versa, making the final features more representative. We can formally write it as:
(18) 
in which represents the output from a shared convolutional layer in the bodypart level hybrid model, and denotes the shared convolutional layer. After two layers of shared convolution, we apply a global pooling layer and a fully connected layer to get final classification result just like we do in sequence level hybrid model:
(19) 
(20) 
is the output from global pooling layer and is the input to the final fully connected layer. The final output from the fully connected layer can be formulated as:
(21) 
in which is the output of the whole bodypart hybrid model, and denotes the fully connected layer.
In our hybrid models, the highlevel features extracted from both edge convolution and node convolution can corroborate each other. As we mentioned above, in a lot of situations, the movements of the joints are too subtle to capture, but the movements of bones are always obvious, in these cases, the hybrid models can utilize the power of edge convolution to classify the human actions. However, among some classes of human action, the movements of bones may be too similar to be distinguished from each other, in these cases, the dynamics on joints may help the hybrid model to do the classification. That is to say, dynamics of both joints and bones are leveraged and fused together to recognize human action, thus the skeletal data is fully utilized in our hybrid models.
In our second hybrid model, the features input into the shared layers are features of each specific human body joint and bone, which are bodypart level features. Thus we name this model as bodypart level hybrid model.
4 Experiments
Firstly, we conduct ablation study on Kinetics dataset to explore the influence of temporal kernel size on action recognition and choose a proper temporal kernel size, then we evaluate our model’s performance on two datasets: Kinetics human action dataset (Kinetics) [46], which is by now the largest unconstrained action recognition dataset, and NTURGB+D dataset [35] the largest inhouse captured human action recognition dataset. Then we compared our performance with previous stateoftheart algorithms. On Kinetics, we compared our results with feature encoding approach [13]; Deep LSTM [35]; Temporal ConvNet [23]; STGCN [15]. On NTURGB+D, we made comparision with Lie Group model [32]; HRNN [34]; Deep LSTM [35]; PALSTM [35]; STLSTM + TS [36]; Temporal Conv [23]; CCNN + MTLN [22]; and STGCN [15]. To further analyze the improvement brought by our edge convolution, we visualized the classification accuracies for each class on NTURGB+D dataset, and compared the accuracy between node convolution model and edge convolution model on some representative classes.
4.1 Datasets and Settings
4.1.1 Kinetics
Kinetics actually is a dataset containing raw videos retrieved from YouTube. It consists of 300,000 video clips, and all the samples fall into 400 classes with at least 400 video clips for each action class. To get the skeletal data, [15] used toolbox [30] to obtained 2D coordinates of joints in the video. Each joint located by is in the form of , and each body is represented by 18 joints. (Fig. 7) and are two coordinates of the joint, while is the confidence score. For each video clip, two bodies with the highest average joint confidence score is reserved. Thus each video is represent by a tensor of (), where 3 denotes (), denotes number of frames in the video, 18 is the number of joints of each body, and 2 is number of bodies in the video.
As mentioned before, we have to transform the skeletal data from joint form into bone form. In this dataset, the original skeletal data is 2D coordinates and confidence score of joints, and our human skeleton graph is composed of 18 joints and 17 bones. We represent each bone by five values: the first two values are 2D coordinates of the bone center obtained by averaging the coordinates of the joints on the bone’s two ends, the third value corresponds to the confidence score of the bone obtained by averaging the confidence scores of the two joints on the bone’s two ends, and the last two values are components of a 2dimension vector (orientation vector) obtained by subtracting the coordinates of the joints on the bone’s ends that denotes the length and orientation of the bone.
The whole skeletal dataset retrieved from Kinetics contains 266,440 samples. 246,534 of the dataset are used for training and 19,906 samples are used for testing in our setting. And on Kinetics dataset, we report both top1 and top5 accuracies.
4.1.2 NtuRgb+d
By now, this is the largest dataset with 3D joint annotations for human action recognition task. The dataset is captured by 3 Microsoft Kinetic v.2 cameras concurrently by [35]. It contains 56,880 action samples in form of RGB video, depth map sequences, 3D skeletal data, and infrared videos. All samples in this dataset fall into 60 action classes, Among the multiple modalities in this dataset, the skeletal data is the only one we use. In the skeletal data, each human body is represented by 25 joints (Fig. 7), and each joint is represented by X, Y, Z coordinates.
For this dataset, we construct a graph containing 25 joints and 24 bones. For each bone, we use a sixdimension feature vector to denote the bone’s location and orientation. The first three values of the feature vector are 3D coordinates of the bone center calculated by averaging the 3D coordinates of two joints on the bone’s ends and the last three values are 3 components of the orientation vector of the bone, which is obtained by subtracting the coordinates of the two joints on the bone’s ends.
The author of this dataset recommend to divide the data in two ways: 1) crosssubject (XSub) In this setting, the data are separated into 40,320 samples and 16,560 samples, where the former part come from one group of actors is used for training, and the latter part come from the other group of actors is used for testing. 2) crossview (Xview) In this setting, the training set is clips captured by one set of cameras containing 37,920 samples, and the testing set is captured by the other set of cameras containing 18,960 samples. On NTURGB+D dataset, we only report top1 accuracy.
4.2 Pipeline
4.2.1 Graph Edge Convolutional Neural Network (GECNN)
The skeletal data (coordinates of joints) input to our model is first normalized by a batch normalization layer. Then center coordinates and orientation of bones is calculated from the normalized data. Then the data is fed into 9 layers of graph edge convolutional neural networks. The first three convolutional layers have 64 channels for output, the next three layers have 128 channels for output, and the last three layers have 256 channels for output. In the fourth and seventh convolutional layer, the stride is set to 2 to conduct pooling. The output tensor of convolutional layers is then fed into a global pooling layer. The global pooling layer is to get a 256dimension feature vector for each video sequence. After the global pooling layer, the output tensor is fed into a dense layer to get the class scores for each video sequence. (Fig. 8)
4.2.2 Sequence level hybrid model
As mentioned before, the first part of sequence level hybrid model are two separated streams including an edge convolutional stream and a node convolutional stream. The edge convolutional includes a normalization layer, following 9 edge convolutional layers, and a global pooling layer. The node convolution stream has same structure including a normalization layer, 9 node convolutional layers, and a global pooling layer.
After the global pooling layer in the first part, each sequence is represented as a 256dimension feature vector. In the second part of sequence level hybrid model, we concatenate the two 256dimension output tensors from the global pooling layer into a single tensor, and then input it into a fully connected layer to get the final classification result, which is the class scores for each sequence indicating the probabilities of classifying the sequence into each class (Fig. 10).
4.2.3 Bodypart level hybrid model
Hybrid model 2 is also divided into two parts. The first part is same as the first part of bodypart level hybrid model, except that the global pooling layer is removed. The second part of bodypart level hybrid model includes 4 layers. In the second part, the output tensors of the two former streams are firstly input into two layers of common convolutional layers and then a global pooling is conducted on the output tensors of the common convolutional network, after that, we feed it into a fully connected layer to get the final classification result (Fig. 11).
4.3 Implementation of Edge Convolution
The implementing of edge convolution actually consists two stages including a temporalconvolution and a spatialconvolution. Each input sequence can be represented as a tensor with shape (C,E,T), in which C denotes the multiple channels of the feature vector, E is the number of edges, and T is the length of the sequence. We first perform traditional 2D convolution with kernel of size 1 on the third dimension to conduct temporalconvolution, and then we multiply the resulting tensor with the adjacency matrix of the human skeleton graph to conduct spatial convolution. The operation to conduct spatial convolution by multiplying the tensor with adjacency matrix is inspired by [47], in which the concept graph shift corresponds to the spatial convolution on graphs.
4.4 Hyper parameter analysis
To figure out to what extent incorporating temporal neighbors will improve the performance of action recognition, we choose 6 different temporal size for our edge convolution model and compare the results on Kinetics dataset.
Temporal kernel size  top1  top5 

3  29.30%  51.71% 
5  30.58%  52.84% 
7  31.17%  53.95% 
9  31.41%  53.90% 
11  31.49%  54.38% 
13  31.43%  53.86% 
From Table 1 we see that when the temporal kernel size rises from 3 to 9, the top1 accuracy increases significantly, while further enlargement to 11 and 13 doesn’t help much. This demonstrates the necessity to incorporate enough number of consecutive frames in a sequence to capture the motion when conducting the convolution. However, incorporating too many frames is not promising and will cost more computation resources. Thus, after comparision, we decide to set temporal kernel size as 9 to get both satisfying accuracy and lower computation resource cost.
4.5 Experiment Results
4.5.1 Kinetics.
On this dataset, 246,534 samples of the dataset are used for training while 19,906 samples are used for testing, and we report both top1 and top5 accuracies. We compare our graph edge convolutional neural network (GECNN) model, sequence level hybrid model (SLHM), and bodypart level hybrid model (BPLHM) with four other characteristic skeleton based models including: feature encoding approach [13]; Deep LSTM [35]; Temporal ConvNet [23]; and STGCN [15]. In Table 2, we can see that our model achieves superior performance than these previous approaches.
On this dataset, our edge convolution network achieves superior classification accuracy over previous stateoftheart models by 0.7% in top1 accuracy and 1.1% in top5 accuracy. Our sequence level hybrid model further improves the performance by 0.4% in top1 accuracy and 0.3% in top5 accuracy. Bodypart level hybrid model performs the best, with 2.7% higher top1 accuracy and 3.4% higher top5 accuracy.
4.5.2 NtuRgb+d
On this dataset, we compare our edge convolution model with eight other previous stateoftheart models including: Lie Group [32]; HRNN [34]; Deep LSTM [35]; PALSTM [35]; STLSTM + TS [36]; Temporal Conv [23]; CCNN + MTLN [22]; and STGCN [15]. In Table 3, we can see the result and the great improvement compared with former models. On this dataset, only top1 accuracy is reported.
For this dataset, we divide data in two ways as recommended by author of the dataset. The first setting is crosssubject (XSub) in which 40,320 samples coming from one group of actors are used for training and 16,560 samples coming from the other group of actors is used for testing. In this setting, our edge convolution model greatly improves the prediction accuracy by 2.5% than previous stateoftheart model. Moreover, our sequence level hybrid model and bodypart level hybrid model further improves this by 0.7% and 1.4%. Overall, bodypart level hybrid model performs the best, whose accuracy is 3.9% higher than previous stateoftheart model.
The second setting is crossview (XView) in which the training set is 37,920 samples captured by one set of cameras, and the testing set is 18,960 samples captured by the other set of cameras. In this setting, our edge convolution network achieves accuracy that is 1.1% higher than previous model, and our sequence level hybrid model and bodypart level hybrid model further improve it by 0.3% and 1.7%. Bodypart level hybrid model still performs the best with an accuracy that is 2.8% higher than the previous stateoftheart model.
Model  XSub  XView 

Lie Group [32]  50.1%  52.8% 
HRNN [34]  59.1%  64.0% 
Deep LSTM [35]  60.7%  67.3% 
PALSTM [35]  62.9%  70.3% 
STLSTM + TS [36]  69.2%  77.7% 
Temporal Conv [23]  74.3%  83.1% 
CCNN + MTLN [22]  79.6%  84.8% 
STGCN [15]  81.5%  88.3% 
GECNN  84.0%  89.4% 
SLHM  84.7%  89.7% 
BPLHM  85.4%  91.1% 
Overall, on all datasets and all different settings, our graph edge convolutional neural network model outperforms the previous stateoftheart model. This result preliminarily corroborates our conjecture that in some circumstances, movements of human bones are more obvious than joints, and on average, recognizing human action by analyzing bone dynamics is more efficient than analyzing the joints. Our hybrid models further improve the performance than graph edge convolutional network model, which means that combining edge and node convolution does help improving the performance. The fact that bodypart level hybrid model performs better than sequence level hybrid model validates our guess that instead of simply combining the extracted features, merging node and edge convolution models in the feature extracting period enable the two models get more help from each other. To further validate our analysis, we analyzed the classification accuracies on each class of NTURGB+D dataset.
Advantages of GECNN on certain classes: In Fig. 12, we report classification accuracy for each class, and if we have a detailed observation and consideration on certain classes in which edge convolution model greatly outperforms node convolution model, we can have more interesting findings and leads us to realize more advantages of recognizing human action from dynamics of bones. For example, in class 3, 10, 44, 47, 53, edge convolution model significantly outperforms node convolution model. And we can see that in these action classes, some joints are likely to overlap with other joints. e.g. in class 3, when brushing teeth, hand is overlapped with head (head is also represented as a joint point in skeletal data); in class 10, two hands overlap with each other when clapping; in class 44, hand overlaps with head when touching head; in class 47, hand overlap with neck when touching neck (neck is either represented as a joint point or ignored in skeletal data); in class 53, when patting on back of other person, the hand is likely to overlap with certain joint point located on the back. We suspect that overlapping between joint points may cause the lowering of accuracy of node convolution model. Because joints are relatively small compared to the whole environment in which actors conduct actions, thus if the distance between two joints is too small, the location of the joints and the dynamics of the joints will become difficult to capture (Fig. 11).
However, things are different for bones, as bones are long parts that extend a certain distance in 3D space, it’s difficult for two bones to totally overlap with each other. Two bones may cross at a point, but the major parts of the two bones are still separated. Thus, in these cases, although some of the joints are difficult to be distinguished from each other, the bones are always easy to be separated. Thus, besides the advantages of recognizing human action by bone dynamics we mentioned in Introduction section, the robustness against joint overlapping is also an advantage of utilizing dynamics of bones to do human action recognition.
In above paragraph, we analyze the advantages of edge convolution model by finding out in which classes it outperforms other models significantly and then analyze the characteristics of these classes. Similar analysis can also be conducted on other classes to find other advantages of different models, but we are not going to discuss further in this way, because this kind of analysis is not so rigorous and needs further validation to be fully convincing. Our discussion in the paragraph above is to provide a possible way to analyze the characteristics of different models, and more detailed validation is left for future work.
Model comparision on class accuracies: Besides the accuracy reported in Table 2, we can also compare the performance of the four models from another point of view. In Fig. 12, we can count that in 25 classes, our bodypart level hybrid model outperforms other three models, in 18 classes, sequence level hybrid model performs the best, in 15 classes, edge convolution model gets the highest accuracies, and in only 8 classes, node convolution model gets better result than other three models. (In some classes, two of the three models may achieve same accuracy, thus the four numbers: 25,18,15,8 do not sum to 60, which is the total number of classes). The result that in more classes GECNN gets higher accuracies than node convolution model also corresponds to our analysis in the introduction that in many cases, the movements of bones is more obvious and easier to capture. But we also note that on some classes, node convolution model gets the highest accuracies, demonstrating the indispensable role of joint dynamics, which is the reason we design hybrid models.
The hybrid models outperform the two models that based solely on joints or bones in both overall accuracy and number of bestperformance classes, which demonstrates that our hybrid models successfully inherit the advantages from both bone based model and joint based model. Moreover, the result that bodypart level hybrid model outperforms sequence level hybrid model tell us that combination of node and edge convolution will let the data utilized more throughly. Letting the information flow between connected joints and bones by involving them in the same convolution layer make joints and bones corroborate each other more accurately, while in sequence level hybrid model, the feature sets of bones and joints are separately extracted, and information exchange only happen between features of whole video sequences, which is rougher than bodypart level hybrid model.
5 Conclusion
In this paper, we propose graph edge convolutional neural network, which is a novel way to conduct convolution operation on graph data. Our model captures the relationship and dependencies between edges by conducting convolution on edges, which is different from previous graph convolution methods that convolve nodes. Convolution on edges also enables us to leverage the dynamics of bones, and we apply our model to human action recognition tasks on two datasets, resulting in remarkably superior performance over existing stateoftheart methods.
Moreover, we also seek to combine node convolution and edge convolution together into two hybrid models that inherit advantages from both models, and the experiment result shows that our hybrid models can further improve the performance.
References
 [1] L. Wang and D. Suter, “Learning and matching of dynamic shape manifolds for human action recognition,” IEEE Transactions on Image Processing, vol. 16, no. 6, pp. 1646–1661, 2007.
 [2] K. Simonyan and A. Zisserman, “Twostream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.

[3]
H. Wang, A. Kläser, C. Schmid, and C.L. Liu, “Action recognition by dense
trajectories,” in
Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on
. IEEE, 2011, pp. 3169–3176.  [4] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013, pp. 3551–3558.
 [5] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchical invariant spatiotemporal features for action recognition with independent subspace analysis,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 3361–3368.
 [6] H. Liu, N. Shu, Q. Tang, and W. Zhang, “Computational model based on neural network of visual cortex for human action recognition,” IEEE transactions on neural networks and learning systems, vol. 29, no. 5, pp. 1427–1440, 2018.
 [7] A. W. Vieira, E. R. Nascimento, G. L. Oliveira, Z. Liu, and M. F. Campos, “Stop: Spacetime occupancy patterns for 3d action recognition from depth map sequences,” in Iberoamerican Congress on Pattern Recognition. Springer, 2012, pp. 252–259.
 [8] C. Chen, K. Liu, and N. Kehtarnavaz, “Realtime human action recognition based on depth motion maps,” Journal of realtime image processing, vol. 12, no. 1, pp. 155–163, 2016.
 [9] J. Luo, W. Wang, and H. Qi, “Group sparsity and geometry constrained dictionary learning for action recognition from depth maps,” in Computer vision (ICCV), 2013 IEEE international conference on. IEEE, 2013, pp. 1809–1816.
 [10] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, “Robust 3d action recognition with random occupancy patterns,” in Computer vision–ECCV 2012. Springer, 2012, pp. 872–885.

[11]
X. Yang and Y. L. Tian, “Eigenjointsbased action recognition using naivebayesnearestneighbor,” in
Computer vision and pattern recognition workshops (CVPRW), 2012 IEEE computer society conference on. IEEE, 2012, pp. 14–19.  [12] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 1290–1297.
 [13] B. Fernando, S. Gavves, O. Mogrovejo, J. Antonio, A. Ghodrati, and T. Tuytelaars, “Modeling video evolution for action recognition,” in Proceedings CVPR 2015, 2015, pp. 5378–5387.
 [14] M. E. Hussein, M. Torki, M. A. Gowayyed, and M. ElSaban, “Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations.” in IJCAI, vol. 13, 2013, pp. 2466–2472.
 [15] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeletonbased action recognition,” arXiv preprint arXiv:1801.07455, 2018.
 [16] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
 [17] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009.
 [18] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
 [19] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graphstructured data,” arXiv preprint arXiv:1506.05163, 2015.

[20]
M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks
for graphs,” in
International conference on machine learning
, 2016, pp. 2014–2023.  [21] C. Li, Q. Zhong, D. Xie, and S. Pu, “Skeletonbased action recognition with convolutional neural networks,” in Multimedia & Expo Workshops (ICMEW), 2017 IEEE International Conference on. IEEE, 2017, pp. 597–600.
 [22] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A new representation of skeleton sequences for 3d action recognition,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 4570–4579.
 [23] T. Soo Kim and A. Reiter, “Interpretable 3d human action analysis with temporal convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 20–28.
 [24] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. AspuruGuzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in neural information processing systems, 2015, pp. 2224–2232.
 [25] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 [26] F. P. Such, S. Sah, M. A. Dominguez, S. Pillai, C. Zhang, A. Michael, N. D. Cahill, and R. Ptucha, “Robust spatial filtering with graph convolutional neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 6, pp. 884–896, 2017.
 [27] G. Lai, H. Liu, and Y. Yang, “Learning graph convolution filters from data manifold,” arXiv preprint arXiv:1710.11577, 2017.
 [28] M. Simonovsky and N. Komodakis, “Dynamic edgeconditioned filters in convolutional neural networks on graphs,” in Proc. CVPR, 2017.
 [29] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Realtime human pose recognition in parts from single depth images,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. Ieee, 2011, pp. 1297–1304.
 [30] Z. Cao, T. Simon, S.E. Wei, and Y. Sheikh, “Realtime multiperson 2d pose estimation using part affinity fields,” in CVPR, vol. 1, no. 2, 2017, p. 7.
 [31] L. Xia, C.C. Chen, and J. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on. IEEE, 2012, pp. 20–27.
 [32] R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition by representing 3d skeletons as points in a lie group,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 588–595.
 [33] H. Kim, S. Lee, and A. C. Bovik, “Saliency prediction on stereoscopic videos,” IEEE Transactions on Image Processing, vol. 23, no. 4, pp. 1476–1490, 2014.

[34]
Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1110–1118.  [35] A. Shahroudy, J. Liu, T.T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” arXiv preprint arXiv:1604.02808, 2016.
 [36] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatiotemporal lstm with trust gates for 3d human action recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 816–833.
 [37] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, X. Xie et al., “Cooccurrence feature learning for skeleton based action recognition using regularized deep lstm networks.” in AAAI, vol. 2, 2016, p. 8.
 [38] S. Zhang, X. Liu, and J. Xiao, “On geometric features for skeletonbased action recognition using multilayer lstm networks,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017, pp. 148–157.
 [39] L. Ge, H. Liang, J. Yuan, and D. Thalmann, “Robust 3d hand pose estimation in single depth images: from singleview cnn to multiview cnns,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3593–3601.
 [40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [41] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, realtime object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
 [42] S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
 [43] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [44] X. Zhang, Y. Zhuang, H. Hu, and W. Wang, “3d laserbased multiclass and multiview object detection in cluttered indoor scenes,” IEEE transactions on neural networks and learning systems, vol. 28, no. 1, pp. 177–190, 2017.
 [45] H. Xiong, W. Yu, X. Yang, M. Swamy, and Q. Yu, “Learning the conformal transformation kernel for image recognition,” IEEE transactions on neural networks and learning systems, vol. 28, no. 1, pp. 149–163, 2017.
 [46] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
 [47] A. Sandryhaila and J. M. Moura, “Discrete signal processing on graphs,” IEEE transactions on signal processing, vol. 61, no. 7, pp. 1644–1656, 2013.
Comments
There are no comments yet.