Action recognition (AR) has been recognized as an activity in which individuals’ behavior can be observed. Assembling profiles of regular activities such as activities of daily living (ADL) can support identifying trends in the data during critical events. These include actions that might compromise a person’s life. For that reason, human action recognition has become an active research area. Generally, human activity is characterized by different recipes. These include optical flows, appearance, and body skeletons. [kinoshita2006tracking, bobick1997movement, yan2018spatial]. Amidst these recipes, dynamic human skeletons (DHS) usually carry vital information that encompasses other modalities. One of the main benefits of this approach is that it minimizes the need for wearing sensors. Therefore, to collect the data, surveillance cameras can be mounted on the ceiling or walls of the environment of interest; ensuring an efficient indoor monitoring system [foroughi2008intelligent]. However, DHS modeling has not yet been fully explored.
A performed action is typically described by a time series of the 2D or 3D coordinates of human joint positions [li2019actional, yan2018spatial]. Furthermore, action is recognized by examining the motion patterns. A skeleton representation of the human body has been proven to be effective for this task. It provides a robust solution to noise, and it is considered to be a computational and storage-efficient solution[li2019actional]
. Additionally, it provides a background-free data representation to the classification algorithms. This allows the algorithms to focus only on the human body pattern recognition without being concerned about the surrounding environment of the performed action scenarios. This work aims to develop a unique and efficient approach for modeling the DHS for human action recognition.
I-a Open Pose
There are multiple sources of camera-based skeleton data. Recently, Cao et al. [cao2019openpose]
released the open-source libraryOpenPose which allows real-time skeleton-based human detection. Their algorithm outputs the skeleton graph represented as an array with the 2D and the 3D coordinates. They are 18 tuples with values (X, Y, C) for 2D and (X, Y, Z, C) for 3D; where C is the confidence score of the detected joint, X, Y and Z represent the coordinates on the X-axis, Y-axis and the Z-axis of the video frame, respectively.
I-B Spatial Temporal Graph Neural Network
New techniques have been proposed recently to exploit the connections between the joints of a skeleton. Among these, Convolutional Neural Networks (CNNs) are used to address human action modeling tasks due to their ability to automatically capture the patterns contained in the spatial configuration of the joints and their temporal dynamics[tu2018skeleton]
. However, the skeletons are presented in graphs form-like, making it difficult to use conventional CNNs to model the dynamics of human actions. Thanks to the recent evolution of Graph Convolutional Neural Networks (GCNNs), it is possible to analyze the non-structured data in an end-to-end manner. These techniques generalize CNNs to the graphs structures[si2019attention].
In order to achieve an accurate ADL recognition, the temporal dimension has to be considered. An action can be considered as a time-dependent pattern of a set of joints in motion[li2019actional]. Given the advantages of GCNs mentioned previously, numerous approaches for skeleton-based action recognition using this architecture have been proposed. The first GCN-based solution for action recognition using skeleton data was presented by Yan et al. [yan2018spatial]. They considered both spatial and temporal dimensions of skeleton joints movements at the modeling stage. This approach is called the Spatiotemporal Graph Convolutional Network (ST-GCN) model. In the ST-GCN model, every joint has a set of edges for the spatial and temporal dimensions independently, as it is illustrated in Fig. 1. Suppose a given sequence of frames with skeleton joints coordinates; then the spatial edges connect each joint with its neighborhood per frame. On the other hand, temporal boundaries connect each joint with another joint corresponding to the exact location from a consecutive frame. Meaning that the temporal edge set represents the joint trajectory over time [yan2018spatial]. However, the topology of the graph is not implicitly structured like Euclidean-based data. For instance, most of the nodes have different numbers of neighbors. Therefore, multiple strategies for applying the convolution operation upon skeleton joints have been proposed.
In their work, Yan et al. [yan2018spatial] presented multiple solutions to perform the convolution operation over the skeleton graph. They first divided the skeleton graph into a fixed subset of nodes (the skeleton joints) they called neighbor sets. Every neighbor set has a central node (the root node) and its adjacent nodes. Subsequently, it is performed a partitioning of the neighbor set into a fixed number of subsets, where a numeric label (which we call priority) is assigned to each of them. Formally, each adjacent node in a neighbor set of a root node is mapped to a label . On the other hand, each filter of the CNN has a
number of subsets of values. Therefore, each subset of values of a filter performs the convolution operation process upon the feature vector of its corresponding node. Given that the skeleton data has been obtained using the Open Pose toolbox[cao2019openpose], each feature vector consists of the 2D coordinates of the joints, including a value of confidence . These ideas are illustrated in Fig.2.
Spatial configuration partitioning strategy
In this strategy, the partitioning for the label mapping is performed according to the distance of each node in the neighbor set with respect to the center of gravity of the skeleton graph. According to [yan2018spatial], each neighbour set is divided into three (filter size K = 3). Therefore, each kernel has three subsets of values; one for the root node, one for the joints closer to and another one for the joints located farther with respect to . As it can be seen in Fig. 3, each filter with three subsets of values is applied to the node feature vectors in order to create the output feature map. In this technique, the filter size , and the mapping are defined by the following [yan2018spatial]:
where presents the label map for each joint in the neighbor set of the root node , is the average distance from to the root node over each frame and is the average distance from to the i-th joint over each frame across all the training set. Once the labeling of each node in the neighbor set has been set, the convolution operation is performed to produce the output feature maps, as shown in Fig. 3.
I-C Learnable edge importance weighting
It is important to note that complex movements can be inferred from a small set of representatives bright spots on the joints of the human body [johansson1973visual]. However, not all the joints provide the same quality and quantity of information regarding the movement performed. Therefore, it is intuitive to assign a different level of importance to every joint in the skeleton.
In the ST-GCN framework proposed by Yan et al.[yan2018spatial], the authors added a mask M (or M-mask) to each layer of the GCN to express the importance of each joint. The mask applied scales the contribution of each joint of the skeleton according to the learned weights of the spatial graph network. Accordingly, the proposed M-mask considerably improves architecture’s performance. Therefore, the M-mask is applied to the ST-GCN network throughout their experiments.
I-D Our contribution
This work proposes an improved set of label mapping methods for the ST-GCN framework by introducing three split processes (full distance split, connection split, and index split) as an alternative approach for the convolution operation. It is based upon the ST-GCN framework proposed by Yan et al. [yan2018spatial]. Our results indicate that all of our proposed split strategies outperform the baseline model. Furthermore, the proposed frameworks are more stable during training. Finally, our proposals do not require additional training parameters of the edge importance weighting applied by the ST-GCN model. This proves that our proposal can provide a more suitable solution for real-time applications focused on daily living recognition systems activities for indoor environments.
The contributions are summarized below:
We present an improved set of label mapping methods for the ST-GCN framework by introducing three split processes (full distance split, connection split, and index split) as an alternative approach for the convolution operation.
Instead of the traditional way of extracting information from the skeleton without considering the relations between the joints, we exploit the relationship between the joints during the action execution to provide valuable and accurate information about the action performed.
We find that an extensive analysis of the inner skeleton joint information by partitioning the skeleton graph in the most number of pieces possible results in more accurate data.
We propose split strategies that focus on capturing the patterns in the relationship between the skeleton joints by carefully analyzing the partition strategies utilized to perform the movement modeling using the ST-GCN framework.
The rest of the paper is structured as follows: Section II presents state-of-the-art review for previous skeleton graph based action recognition approaches. The details of the proposed skeleton partition strategies are presented in Section III. Section IV discuses the experimental settings we use to obtain the results. The results and discussion are presented in Section V. Finally, Section VI concludes the paper.
Ii Related Literature
There has been previous work on activity recognition upon skeleton data. Due to the emergence of low-cost depth cameras, access to skeleton data has become relatively easy [vemulapalli2014human]. Therefore, there has been an increasing interest in using skeleton representations to recognize human activity in general. For the sake of being conscience, few most recent but relevant works are mentioned. Zhang et al. [zhang2019constructing]
combined skeleton data with machine learning methods (such as logistic regression) upon dataset benchmarks. They demonstrated that skeleton representations provide better performance in terms of accuracy than other forms of motion representations. In order to model the dependencies between joints and bones, Shiet al. [shi2019skeleton] presented a variety of graph networks denominated Directed Acyclic Graph (DAG). Later, Cheng et al. [cheng2020skeleton] presented a shift CNN inspired method called Shift- GCN. Their approach aims to reduce the computational complexity of previous ST-GCN-based methods. The results showed the achievement of 10× less computational complexity. However, to the best of our knowledge, there have not been unique partition strategies proposed to enhance the performance of an AR using the ST-GCN model presented in [yan2018spatial].
Iii Proposed Split Strategies
In this section, we present a new set of methods to create the label mapping for the nodes in the neighbor sets of the skeleton graph. The techniques are modifications of the previously proposed spatial configuration partitioning presented in [yan2018spatial]. As the baseline model, a maximum distance of one node with respect to the root node defines the neighbor sets in the skeleton graph. However, every node in the neighbor set is labeled separately in every strategy presented in this section. Therefore, in every proposed approach, the filter size K = 4. For instance, consider a neighbor set consisting only of the root node with a single adjacent node. For this case, the third and fourth subsets values of the kernel are set by zeros. Each of the split strategies proposed is computed in each frame of a training video sample individually.
Fig. 4 illustrates our proposed partitioning strategy. As it can be seen, a different label mapping is assigned to each node in the neighbor set. Therefore, a different subset of values of each filter is applied to each joint feature vector. However, the bottleneck is defining each node’s order (split criterion) in the neighbor set. We propose three different approaches to address this issue: full distance split, connection split, and index split. These proposals are explained in the following sections.
Iii-a Full Distance Split
In this method, the partitioning for the label mapping is performed according to the distance of each node in the neighbor set with respect to . As can be noticed, this solution is similar to the spatial configuration partitioning approach previously explained. However, here we consider the distance of every node in the neighbor set. Thus, this solution is named the full distance split method. Therefore, depending on the neighbor set in the skeleton, each kernel can have up to four subsets of values. Fig. 5(i) shows that each filter with four subsets of values is applied to the node feature vectors. The order is defined by their relative distances with respect to to create the output feature map. To explain this strategy, we define the set as the Euclidean distances of the i-th adjacent node (of the root node ) with respect to sorted in ascending order as:
where is the number of adjacent nodes to the root node . For instance, and have the minimum and maximum values in , respectively. In this strategy, the label mapping is given by:
where resents the label map for each joint in the neighbor set of the root node , is the Euclidean distance from the root node to .
Iii-B Connection Split
In this approach, the number of adjacent joints of each joint (i.e., the joint degree) represents the split criterion in the neighbor set. Thus, the more connections the joint has, the higher priority is assigned to it. Fig. 5(ii) shows that the joint with label A represents the root node, and B is the joint with the highest priority since it has three adjacent joints connected. We observe that both C and D joints have two connections. Hence, the priority for these nodes is set randomly. Once the joint priorities have been set, the convolution operation is performed with a subset of values of each filter for every joint in the neighbor set independently.
To define the label mapping in this approach, we first define the neighbor set of a root node and adjacent nodes as [yan2018spatial], and, we also define the degree matrix of as , where . Therefore, the values at the position of contain the degree value of the each of the adjacent nodes of the root node . Similarly, we define a set as the degree values of each of the adjacent nodes of the root node sorted in descending order as follows:
For instance, and have the maximum and minimum values of , respectively. Finally, the label mapping is thus defined as:
where represents the label map for each adjacent joint to the root node in the neighbor set, and is the degree corresponding the root node
Iii-C Index Split
The skeleton data utilized for our study is gathered using the Open-Pose [cao2019openpose] library. According to the library documentation, the output file with the skeleton information consists of critical/key points. The output skeleton provided by the Open Pose toolbox is shown in Fig. 6.
In this approach, the value of the index of each key point defines the priority criterion of the neighbor set. An illustrative example is shown in Fig. 5(iii). For instance, joint B is assigned with the highest priority since it has a key-point index value of 1, and C is the joint with the second priority since it has a keypoint index value of 3. Finally, D is the joint with the least priority since it has a keypoint index value of 8. Therefore, we define the set as the indexes of the key points of the i-th adjacent nodes (of the root node ) sorted in ascending order as:
where is the number of adjacent nodes to the root node . The label mapping is therefore defined as:
where represents the label map for each joint in the neighbor set of the root node and is the index of the keypoint corresponding to the root node .
To evaluate the performance of our proposed partitioning schemes, we train our models on two benchmark datasets: the NTU RGB+D [shahroudy2016ntu] and the Kinetics [kay2017kinetics] dataset. These two datasets were considered in order to provide a valid comparison with the original ST-GCN framework.
Up to date, the NTU-RGB+D is known to be the most extensive dataset with 3D joints annotations for human action recognition tasks [yan2018spatial]. The samples have been recorded using the Microsoft Kinect V2 camera. In order to take the most advantage of the chosen camera device, each action sample consists of a depth map modality, 3D joint information, RGB frames, and IR sequences. The information provided by this dataset consists of the tri-dimensional location of the 25 main joints of the human body.
In their study, Shahroudy et al. [shahroudy2016ntu] proposed two evaluation criteria for the NTU-RGB+D dataset: the Cross-Subject (X-sub) and the Cross-View (X-view) evaluations. In the first approach, the train/test split for evaluation was based upon groups of subjects performing the action; the data corresponding to 20 participants is used for training and the remaining samples for testing. On the other hand, the Cross-View evaluation approach considers the camera view as criteria for the train/test split; the data collected by the camera 1 is used for testing and the data collected by the other two cameras is used for training.
The NTU-RGB+D dataset provides a total of 56,880 action clips performing 60 different actions classified into three major groups: daily actions, health-related actions, and mutual actions. Forty participants performed the test action samples. Each sample has been captured with 3 different cameras simultaneously located at the same height but different angles. Later, this dataset was extended twice its size by adding 60 more classes and another 57,600 video samples[shahroudy2016ntu]. This extended version is called NTU RGB+D 120 (120-class NTU RGB+D dataset). By considering the 3D skeletons modality of the NTU-RGB+D dataset only, the storage was reduced from 136 GB to 5.8 GB. Therefore, the computational speed is reduced considerably.
While the NTU-RGB+D dataset is widely known to be the largest in-house captured action recognition dataset, the Deepmind Kinetics human action dataset is the largest set with unconstrained action recognition samples. The 306,245 videos provided by the Kinetics dataset are obtained from YouTube. Each video sample is supplied with no previous editing to ensure good variable resolution and frame rate for action modeling and is classified into 400 different action classes.
Due to the vast quantity of classes, one video sample can be classified into more than one cluster. For instance, a video sample with a person texting while driving a car can be classified with the “texting” label or the “driving a car” label. Therefore, the authors in [kay2017kinetics] suggest to consider a top-5 performance evaluation rather than a top-1 approach. Meaning that, a labelled sample is considered a true positive if its ground truth label appears within the 5 classes with the highest scores predicted by the model (top-5); contrary to considering only the predicted class with the highest score (top-1).
The Kinetics dataset provides the raw RGB format videos. Therefore, it requires the skeleton information to be extracted from the sample videos. Accordingly, we use the dataset that contains the Kinetics-skeleton information provided by Yan et al. [yan2018spatial] for our experiments.
Iv-B Model Implementation
The experiment process comprises of three stages: Data Splitting, ST-GCN model setup, and Model Training. These stages are explined as follows:
The datasets is divided into two subsets: the training and the validation sets. In our experiments, we consider a 3:1 relation for training and validation split, respectively.
ST-GCN Model Setup
The ST-GCN model uses a baseline architecture. It consists of a stack of 9 layers that are divided into 3-layer blocks stacked together. Each layer block consists of 3 layers each. The layers of the first block have 64 output channels each. The second and third blocks have 128 and 256 output channels, respectively. Finally, the 256 feature vector output by the last layer is fed into a Softmax classifier to predict the performed action [yan2018spatial].
. The models are trained using stochastic gradient descent (SGD) with learning rate decay as an optimization algorithm. The initial learning rate is 0.1. The number of epochs and decay schedule for training varies depending on the dataset used. For the NTU-RGB+D dataset, we train the models for 80 epochs, and the learning rate decays by a factor of 0.1 on the 10th and the 50th epochs. On the other hand, for the Kinetics dataset, we train the models for 50 epochs, and the learning rate decays by a factor of 0.1 every 10th epochs. Similarly, the batch size also varies according to the dataset utilized; for the NTU-RGB+D dataset, the batch sizes for training and testing used were 32 and 64, respectively; on the other hand, for the Kinetics dataset, the batch sizes for training and testing used were 128 and 256, respectively. To avoid overfitting, a weight decay value of 0.0001 has been considered. Additionally, a dropout value of 0.5 has been set for the NTU-RGB+D dataset experiments. To provide a valid comparison with the baseline model, an M-mask implementation is considered in the experiments presented in this study.
V Experimental Results and Discussion
This section discusses the performance of our proposals against the benchmark ST-GCN models based on [yan2018spatial] using the spatial configuration partition approach. This strategy provides the best performance in terms of accuracy in [yan2018spatial]. Therefore, it has been chosen as a baseline to prove the effectivenes of the partition strategies introduced in this study.
V-a Results Evaluation on NTU-RGB+D
Note that we aim to recognize ADL in an indoor environment. Therefore, the NTU-RGB+D dataset serves as a more accurate reference than the Kinetics dataset since it was recorded using the same conditions. Hence, we focus on the results obtained with this dataset. We use the 3D joint information provided in [shahroudy2016ntu] in our experiments. The Table I shows the performance comparisons of our proposals and the state-of-the-art ST-GCN framework. It can be observed that all of our partition strategies outperform the spatial configuration strategy of the ST-GCN. For the X-sub benchmark, the connection split achieves the highest performance of 82.6% accuracy, more than 1% higher than the ST-GCN performance. On the other hand, the index split outperforms the rest of the strategies with 90.5% accuracy on the X-view benchmark, more than 2% higher than the ST-GCN performance.
|ST-GCN||Spatial configuration partitioning||81.5%||81.5%|
|Ours||Full distance split||81.6%||89.3%|
Figs 6(a), 6(b), 6(c) and 6(d) show the training behavior of the models using the spatial configuration partitioning of the ST-GCN framework and the proposed connection split on both X-sub and X-view benchmarks without the M-mask implementation. The blue and orange plots show the performance of the models using the training and the validation sets, respectively. The training score plots show that the learning performance of the proposed connection split stabilizes while increasing over time compared with the ST-GCN outcome. Our proposals provide a considerable advantage over the benchmark framework because it demonstrates that the M-mask is not required to yield satisfactory performance. The omission of the M-mask results in a reduction of computational complexity. Hence, our proposal can provide a more suitable solution for real-time applications. Moreover, given the performance superiority on accuracy and time consumption, our proposed method offers a practical solution an ADL recognition system.
V-B Performance on the Kinetics Dataset
The recognition performance has been evaluated using the top-1 and top-5 criterion using the Kinetics dataset. We validate the performance of our proposed methods with the ST-GCN framework, as shown in Table II.
|ST-GCN||Spatial configuration partitioning||30.7%||52.8%|
|Ours||Full distance split||31.7%||54.5%|
As the results indicate, all of our partition strategies outperform the spatial configuration strategy of the ST-GCN using the top-5 criteria. We observe that 54.5% accuracy is achieved using the full distance split approach, which is 2% higher than the performance obtained with the baseline model. On the other hand, by using the top-1 evaluation criteria, our proposal achieves the same performance as the ST-GCN model. Similarly, using this evaluation basis, the highest performance achieved is a 31.7% accuracy using the full distance split approach resulting in a 1% margin higher than the result obtained with the ST-GCN model. Therefore, we can conclude that the performance metrics presented in Table II validates the superiority of the full distance split method proposed on the Kinetics dataset.
In this work, we propose an improved set of label mapping methods for the ST-GCN framework (full distance split, connection split, and index split) as an alternative approach for the convolution operation. Our results indicate that all of our split processes outperform the previous partitioning strategies for the ST-GCN framework. Moreover, they demonstrate to be more stable during training without using the additional training parameter of the edge importance weighting applied by the baseline model. Therefore, the results obtained with our current split proposals can provide a more suitable solution for real-time applications focused on activities of daily living recognition systems for indoor environments than the baseline strategies for the ST-GCN framework.
A significant computational effort is involved in using heterogeneous methods to calculate the distances between the joints and the for each frame in the video sample for full distance split and spatial configuration partitioning. It will be computationally less demanding to use a homogeneous technique to calculate the distance between the joints and the for both splitting strategies. Furthermore, while our current methodology considers greater distances from the root node to perform the skeleton partitioning, additional flexibility can be made by increasing the amount joints per neighbor set. This may give room to cover larger body sections (such as limbs), making it possible to find more complex relationships between the joints during the execution of the actions.