Log In Sign Up

Skeleton-Based Relational Modeling for Action Recognition

by   Lin Li, et al.

With the fast development of effective and low-cost human skeleton capture systems, skeleton-based action recognition has attracted much attention recently. Most existing methods use Convolutional Neural Network(CNN) and Recurrent Neural Network(RNN) to extract spatio-temporal information embedded in the skeleton sequences for action recognition. However, these approaches are limited in the ability of relational modeling in a single skeleton, due to the loss of important structural information when converting the raw skeleton data to adapt to the CNN or RNN input. In this paper, we propose an Attentional Recurrent Relational Network-LSTM(ARRN-LSTM) to simultaneously model spatial configurations and temporal dynamics in skeletons for action recognition. The spatial patterns embedded in a single skeleton are learned by a Recurrent Relational Network, followed by a multi-layer LSTM to extract temporal features in the skeleton sequences. To exploit the complementarity between different geometries in the skeleton for sufficient relational modeling, we design a two-stream architecture to learn the relationship among joints and explore the underlying patterns among lines simultaneously. We also introduce an adaptive attentional module for focusing on potential discriminative parts of the skeleton towards a certain action. Extensive experiments are performed on several popular action recognition datasets and the results show that the proposed approach achieves competitive results with the state-of-the-art methods.


page 1

page 2

page 3

page 4


Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition

It remains a challenge to efficiently extract spatialtemporal informatio...

Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks

Recently, skeleton based action recognition gains more popularity due to...

Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning

Skeleton-based action recognition has made great progress recently, but ...

Learning and Refining of Privileged Information-based RNNs for Action Recognition from Depth Sequences

Existing RNN-based approaches for action recognition from depth sequence...

View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data

Skeleton-based human action recognition has recently attracted increasin...

Logsig-RNN: a novel network for robust and efficient skeleton-based action recognition

This paper contributes to the challenge of skeleton-based human action r...

Interaction Relational Network for Mutual Action Recognition

Person-person mutual action recognition (also referred to as interaction...

1 Introduction

Action recognition provides a reasonable approach for video understanding and is under great demand, especially in the domains of intelligent surveillance and human-computer interaction. Traditional approaches are mainly based on the modeling of appearance and optical flow. However, the noise interference in RGB video dramatically obstructs the extraction of high-level features for action recognition.

Benefited from the advent of affordable depth sensors and efficient algorithms, dynamic human skeleton becomes an available and effective modality for action recognition. Meanwhile, compared with RGB video, the characteristics of high-level representation and robustness to viewpoints,

Figure 1: The expanded structure of Recurrent Relational Network (RRN), which is used to learn the spatial pattern in the single skeleton frame by modeling joints and lines separately.

appearances and background noise make skeletons have advantages in action recognition. As a result, many early skeleton-based methods were proposed and have showed encouraging improvements, such as [1, 2, 3]. However, these approaches were significantly limited in either the lack of exploring spatial structures [1] or the dependence for hand-crafted features to analyze the spatial patterns [2, 3].

Recently, various deep learning based methods have been proposed to conduct skeleton-based action recognition. In general, these approaches are mainly based on CNN and RNN for capturing spatio-temporal information in skeletons. Specifically, the CNN-based methods utilize the powerful representation ability of CNN and achieve better performances than those hand-crafted feature based methods. And the RNN-based models have shown great advantages in capturing temporal dynamics in sequential skeletons. However, CNNs usually lose important structural information in the process of encoding skeletons into spatial-temporal images

[4, 5, 6], and RNNs have the same weakness when learn the spatial features in a single skeleton [7, 8, 9, 10]. Thus, when converting the raw skeleton data to match with the CNN or RNN input format, the destruction of the original structures among the skeleton joints and lines leads to the difficulty in extracting robust spatial features in a single skeleton, which remains the main weakness of these frameworks.

In this paper, we propose an Attentional Recurrent Relational Network [11]-LSTM (ARRN-LSTM) to model temporal dynamics and spatial configurations in skeletons for action recognition. Our approach is based on a two-stream architecture to learn sufficient relational information from both joints and lines in the skeleton. In each stream, we use the Recurrent Relational Network to learn the spatial patterns in a single skeleton and exploit a multi-layer LSTM to extract temporal information in skeleton sequences. Between the two modules, we design an adaptive attentional module for focusing on potential discriminative parts of a skeleton towards a certain action. Compared with other graph networks [12, 13], we believe Recurrent Relational Network is a better framework for learning spatial information in a single skeleton, as depicted in Fig.1, because it can ensure the flexible flow of information and build long-range dependencies among all joints or lines, which is important for learning robust features from graph structure data. Overall, our contributions can be summarized as follows:

  • We introduce the Recurrent Relational Network to the domain of skeleton-based action recognition and prove that it is a very good framework to learn the spatial feature in the single skeleton.

  • We design an organic framework, the two-stream ARRN-LSTM, to conduct skeleton-based action recognition, and achieve better results than most mainstream methods on popular skeleton datasets.

2 Related Work

2.1 Relational Network

Santoro et al. [14] propose a simple plug-and-play neural network module for relational reasoning. With this module, a neural network gains the ability of handling unstructured inputs and inferring their hidden relationship, which achieves state-of-the-art results on visual question answering datasets. Based on this work, Palm et al. [11] propose the Recurrent Relational Networks for complex relational reasoning, such as learning an iterative strategy to solve Sudoku.

2.2 Skeleton-based Action Recognition with Deep Networks

To utilize the powerful representation ability of CNN, skeletons are usually encoded into spatial-temporal images to fit the inputs of CNN. Hou et al. [4] accumulate the raw skeleton frames directly and encode the color based on temporal information. Liu et al. [6] exploit the 3DCNN to extract spatio-temporal features for avoiding the loss of information in projecting process. Du et al. [5] divide the joints into five main parts according to human physical structures (four limbs and one trunk) and take the 3D coordinates of joints as 3 channels of RGB image. Specifically, Yan et al. [13] use the Graph Convolutinal Network to form hierarchical representation of skeletons and achieve good results.

RNN is good at processing sequential data due to the extraordinary ability of capturing structural information in sequences. Du et al. [7] divide the human skeleton into five parts according to human physical structure and separately feed them into different RNNs. Song et al. [8] modify RNN to design an attentional module and use multi-layer LSTM to learn spatial and temporal information. Shahroudy et al. [9] propose a Part-Aware LSTM unit that builds full connections between all the memory cells and all the input features for acquiring richer information. Liu et al. [10] transform the joints in the form of tree structure based on traversal and propose a spatio-temporal LSTM framework to learn spatio-temporal information in joint sequences.

3 Proposed Method

3.1 Pipeline Overview

The pipeline of our framework is depicted in Fig.2. Our framework consists of two streams for learning structural features from joints and lines separately. In each stream, an embedding operation is first performed on each joint or line to increase its dimension. Then, the embedding results of joints or lines are sent to RRN for capturing the spatial patterns in the single skeleton. To focus more attention on potential discriminative parts in the skeleton, we generate a learnable mask and then use it to perform point-wise multiplication with the outputs of RRN. After that, we use a multi-layer LSTM to learn temporal features in skeleton sequences. Finally, we take the weighted average operation as our fusion strategy to combine the predictions from both streams.

Figure 2: Framework of the proposed two-stream ARRN-LSTM model. It is recommended to view the digital version.

3.2 The Construction of the Two-Stream ARRN-LSTM

3.2.1 Embedding

The raw skeleton is defined by a fixed number of joints in the form of 3D coordinates, with the number denoted as . To improve the representation of joints and strengthen the discrimination among them, we use a fully-connected layer to map the 3D coordinates to a high dimensional space. Thus, given a joint that means the 3D coordinate of the -th joint in the -th frame, the -dimensional embedding result is:


With the exception of original joints, we believe that the lines between pair-wise joints are also important geometric structures in the skeleton and contain rich structural or relational information. Specifically, joints emphasize the absolute position, which can figure out discriminative moving parts of body in the action by analyzing the distribution or local density of joints. While lines emphasize relative position, which can build angles to help figure out specific poses and actions. Thus, there exists potential complementarity for action recognition between both geometric structures. We calculate lines between and other joints as:


The denotes all lines that connect the joint and all the other joints, but these distances exclude the one between and itself. Thus, if the dimension of one joint is , the dimension of the corresponding lines is . Similarly, the line embedding process is:


Because the output dimensions of both joint and line embeddings are equal, we use the same symbol to denote the embedding result of lines in following expressions.

3.2.2 Recurrent Relational Network

In the domain of skeleton-based action recognition, some graph networks based frameworks [12, 13] have been proposed recently and achieved substantial improvements than mainstream methods. While these frameworks applied the notion of receptive fields to the graph structure and extracted features only from a small adjacent range. Compared with them, we believe Recurrent Relational Network is a better framework for learning spatial features in a single skeleton, because it can ensure the flexible flow of information among joints and build long-range dependencies among them, which could help to learn more robust features from graph structure data. Besides, the implementation of RRN is much easier than previous graph networks.

Specifically, we use all joints or lines from a single skeleton to feed the RRN. In the process, each joint or line embedding will be the input to one node of RRN and the number of nodes in RRN is , which can be denoted as:


As for the detailed process, if we denote the states of node and in the -th iteration of RRN as and , we can define the information flowing from node to node as:


is a message function. The messages from all neighbouring nodes to joint are then summed up as:


Then the ouput of node can be updated by a trainable node function as:


Given the number of iterations in RRN as , the output of node after the final iteration can be expressed as:


At this point, we calculate the relational modeling results of the joints or lines in a skeleton as , in the dimension of .

3.2.3 Attentional Module

For a certain action, humans usually recognize it by focusing the most discriminative parts. For example, the action of kicking can be identified through the legs, while the drinking action can be recognized by the arms. However, some different actions, such as flipping and reading a book, cannot be distinguished until the subtle differences in the hand part are identified. Thus, we design an attentional module to address these problems, with the module following the RRN. Specifically, we first generate a learnable mask, it could be a random vector of size

, and then we use it to perform point-wise multiplication with the node outputs of RRN. After that, we use a fully-connected layer to reduce the product dimension to avoid overfitting in following multi-layer LSTM. We express the process as follows:


The mask is a -dimensional vector and can be trained with back-propagation algorithm, which aims to emphasize the impacts of some joints or lines while neglect other unimportant ones by allocating different weights. The output can expressed as .

3.2.4 Multi-Layer LSTM

Following RRN, we exploit a multi-layer LSTM to learn the temporal dynamics in skeleton sequences. We first concatenate all joints or lines features from a single skeleton as the input of one cell in LSTM, and the number of cells in LSTM is equal to the skeleton sequence length , so all frames in a skeleton sequence can exchange information with internal connections in LSTM. This process is denoted as follows:


The number of layers in multi-layer LSTM is and the dimension of is .

3.2.5 Score Fusion

To get the final prediction from each stream, we connect a fully-connected layer after multi-layer LSTM to map the extracted spatial-temporal features to the categories of size

. Then we run a softmax operation on the output to obtain the predicted probabilities. We take joint stream as an example:


To exploit the complementarity between joints and lines, we take the weighted average as the fusion strategy and get the final prediction. We use and to denote the scores of joint and line stream, respectively. The final prediction is:


and are relative weights of the two stream predictions and their sum is equal to 1.

4 Experiments and Results

In our experiments, firstly we perform detailed ablation study to validate our framework, showing the results in Tables LABEL:table5 and LABEL:table4. Then we train the proposed two-stream Attentional RRN-LSTM model and test it on NTU RGB+D, Florence 3D, and MSRAction3D datasets, with results and comparisons shown in Tables LABEL:table1, LABEL:table2 and LABEL:table3 separately.

4.1 Implementation Details

We normalize the joint coordinators by subtracting the average value of the 5 joints close to the hip joint. The lines are calculated according to the normalized joints. We perform zero padding on videos with frames less than

and random sampling on videos with frames more than to fix all videos as frames. According to the differences of frame numbers in different datasets, we set for NTU-RGBD, Florrence3D and MSRAction3D, respectively. Besides, we conduct the embedding with a fully-connected layer and set the output size for the three datasets based on cross-validation. The following RRN executes iterations per frame with each node function realized by an GRU unit and the message function constructed with 3 fully-connected layers. The attentional module is built with a trainable mask and a fully-connected layer for reducing the product to a 256-dimensional vector. We use layers LSTM to extract temporal information in skeleton sequences, with input and output set as a 256-dimensional and -dimensional vector separately. The mid-layer LSTM has an output size of 512,256,256 for NTU-RGBD, Florrence3D and MSRAction3D, respectively. Both and

are set as the optimal values based on validation set in our experiments. We use Stochastic Gradient Descent to train our model from scratch on NTU-RGBD and set the initial learning rate as 0.01, we multiply the learning rate with 0.1 when the accuracy gets saturated. On other datasets, we use Adam optimizer to train our model from scratch. Our model is trained on a NVIDIA TITAN X GPU with PyTorch.

4.2 Datasets

NTU RGB+D Dataset. This is the most popular and largest depth-based skeleton action recognition dataset currently, with more than 56 thousand video samples and 4 million frames collected from 40 different subjects. It consists of 60 different action classes. We evaluate our model according to the metrics proposed in [9], including Cross-Subject (CS) and Cross-View (CV) evaluations.

Florence 3D. This dataset includes 215 action sequences performed by 10 subjects for 2 to 3 times. It is made up of 9 activities and each skeleton is represented by 15 joints. The difficulty of this dataset lies in its similarities between actions, such as drinking from a bottle, answering phone and reading watch. We follow the standard metric, i.e., leave-one-subject-out cross validation, to evaluate our model.

MSRAction3D. This dataset contains 20 actions performed by seven subjects for three times, which totally consists of 4020 action samples. The dataset is divided into three subsets and each subset has 8 actions. In each subset, the samples of subjects 1, 3, 5, 7, 9 are used for training while the samples of subjects 2, 4, 6, 8, 10 are used for testing. Final accuracy is calculated as average accuracies of three subsets.

4.3 Ablation Study

We examine the effectiveness of the proposed ARRN-LSTM framework and study the impact of each part by ablation study in this section, with results on NTU shown in Table LABEL:table5 and LABEL:table4.

Methods CS CV
2-Layer LSTM[9] 60.7 67.3
2-Layer P-LSTM[9] 62.9 70.3
MT-3D-CNN[6] 66.9 72.6
STA-LSTM[8] 73.4 81.2
VA-LSTM[15] 79.4 87.6
Joint-ARRN-LSTM 79.6 87.8
Table 1: Comparisons with Baselines on NTU RGB+D.
Model CS CV
Line-RRN-LSTM 74.5 83.3
Joint-RRN-LSTM 74.6 83.1
Line-ARRN-LSTM 76.4 87.2
Joint-ARRN-LSTM 79.6 87.8
Two-Stream RRN-LSTM 77.6 84.2
Two-Stream ARRN-LSTM 81.8 89.6
Table 2: Results of Ablation Study on NTU RGB+D.

To illustrate the effectiveness of RRN in learning spatial features in single skeleton, we compare the results of our joint stream with several baselines in Table LABEL:table5, and these methods also learn features from joints. Our method uses RRN and multi-layer LSTM to learn spatial and temporal information separately. Differently, 2-layer LSTM and 2-layer P-LSTM use pure LSTM or its variant to extract spatio-temporal features from joints, and our method outperforms it substantially. Furthermore, MT-3D-CNN [6] uses CNN to build a two-stream model for learning spatial and temporal features from skeletons, while both STA-LSTM [8] and VA-LSTM [15] use multi-layer LSTM as backbone to learn spatial and temporal information from joints. But our method achieves much better results than these baselines, which validates that RRN can model spatial features in single skeleton effectively.

From Table LABEL:table4, we analyze the performance of each stream and attentional module. Each stream ARRN-LSTM can achieve relatively good performance, which validates the effectiveness of the basic ARRN-LSTM. For attentional module, it increases the accuracy by points under both metrics, validating the insight of the attention mechanism. Furthermore, the combination of both streams can result in an increase of points in final accuracy, proving the complementarity between joints and lines. Finally, the overall framework increases the accuracy by almost points, which proves the reasonability and effectiveness of our framework.

For embedding layer, we cannot ensure the convergence of the framework if we remove it, so we do not show the ablation study of this part. For fusion of two streams, we tried early fusion and several operations like max and multiply, but they are hard to converge or achieve good performances.

Methods CS(%) CV(%)
Lie Group [2] 50.1 52.8
Deep LSTM [9] 60.7 67.3
PA-LSTM [9] 62.9 70.3
ST-LSTM+TS [10] 69.2 77.7
ST-NBMIM [16] 80.0 84.2
Deep STGCK [12] 74.9 86.3
VA-LSTM [15] 79.4 87.6
C-CNN + MTLN [17] 79.6 84.8
ST-GCN [13] 81.5 88.3
ARRN-LSTM 81.8 89.6
Table 3: Comparison of Accuracies on NTU RGB+D.
Methods Accuracy(%)
Lie Group [2] 90.88
Graph-Based [18] 91.63
P-LSTM [9] 95.35
STGCK [12] 97.67
Deep STGCK [12] 99.07
Table 4: Comparison of Accuracies on Florence 3D.

4.4 Comparisons with Mainstream Methods

NTU RGB+D Dataset. We compare our ARRN-LSTM model with mainstream methods on this dataset in Table LABEL:table1, it is clear that our ARRN-LSTM method achieves better results than those CNN or RNN based methods. Specifically, ST-GCN [13] takes the graph convolution to learn the spatial and temporal features in skeleton sequences, while our method takes the RRN to perform relational modeling in the single skeleton and uses a multi-layer LSTM to obtain the temporal information in skeleton sequences. The results could prove the strong ability of the RRN in modeling spatial features in single skeleton and the effectiveness of the whole framework.

Methods Accuracy(%)
Lie Group [2] 92.5
HBRNN [7] 94.5
ST-LSTM [10] 94.8
Graph-Based [18] 94.8
ST-NBNN [16] 94.8
ST-NBMIM [16] 95.3
Table 5: Comparison of Accuracies on MSRAction3D.

Florence 3D. As shown in Table LABEL:table2, the proposed ARRN-LSTM framework is superior to most mainstream methods that are based on LSTM, CNN and traditional algorithms. Compared with the state-of-the-art Deep STGCK [12], the performance of our model is very close to its result. Our model is also based on a graph network like Deep STGCK, but the multi-layer LSTM make model complexity much larger than Deep STGCK, making it easy to be overfitting on small datasets.

MSRAction3D. As shown in Table LABEL:table3, our method achieves accuracy and outperforms most mainstream methods, and the result is also competitive with the state-of-the-art method ST-NBMIM [16]. Although our framework suffers overfitting on small datasets again, the result could still validate the effectiveness of our method on small datasets.

5 Conclusion

In this paper, we introduced the Recurrent Relational Network to the domain of skeleton-based action recognition, and we also designed an organic two-stream ARRN-LSTM framework and achieved better results than most mainstream methods. The experiments results proved the strong modeling ability of RRN in the single skeleton and the effectiveness of the whole framework. However, we believe there still exists possibility of modeling both single skeleton and skeleton sequences with relation network for action recognition, which may provide a brand new perspective to address this problem and a potential direction to achieve further progress.


This work was supported in part by the National Key RD Program of China (No. 2018YFB1004600), the Beijing Municipal Natural Science Foundation (No. Z181100008918010), the National Natural Science Foundation of China (No. 61836014, No. 61761146004, No. 61773375), and in part by the Microsoft Collaborative Research Project.