SIA-GCN: A Spatial Information Aware Graph Neural Network with 2D Convolutions for Hand Pose Estimation

09/25/2020
by   Deying Kong, et al.
4

Graph Neural Networks (GNNs) generalize neural networks from applications on regular structures to applications on arbitrary graphs, and have shown success in many application domains such as computer vision, social networks and chemistry. In this paper, we extend GNNs along two directions: a) allowing features at each node to be represented by 2D spatial confidence maps instead of 1D vectors; and b) proposing an efficient operation to integrate information from neighboring nodes through 2D convolutions with different learnable kernels at each edge. The proposed SIA-GCN can efficiently extract spatial information from 2D maps at each node and propagate them through graph convolution. By associating each edge with a designated convolution kernel, the SIA-GCN could capture different spatial relationships for different pairs of neighboring nodes. We demonstrate the utility of SIA-GCN on the task of estimating hand keypoints from single-frame images, where the nodes represent the 2D coordinate heatmaps of keypoints and the edges denote the kinetic relationships between keypoints. Experiments on multiple datasets show that SIA-GCN provides a flexible and yet powerful framework to account for structural constraints between keypoints, and can achieve state-of-the-art performance on the task of hand pose estimation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 10

05/23/2021

A hybrid classification-regression approach for 3D hand pose estimation using graph convolutional networks

Hand pose estimation is a crucial part of a wide range of augmented real...
10/03/2019

Graph Analysis and Graph Pooling in the Spatial Domain

The spatial convolution layer which is widely used in the Graph Neural N...
06/09/2020

On the Bottleneck of Graph Neural Networks and its Practical Implications

Graph neural networks (GNNs) were shown to effectively learn from highly...
07/01/2021

Hippocampal Spatial Mapping As Fast Graph Learning

The hippocampal formation is thought to learn spatial maps of environmen...
10/26/2020

GraphMDN: Leveraging graph structure and deep learning to solve inverse problems

The recent introduction of Graph Neural Networks (GNNs) and their growin...
04/15/2021

Convolutions for Spatial Interaction Modeling

In many different fields interactions between objects play a critical ro...
09/26/2019

PairNorm: Tackling Oversmoothing in GNNs

The performance of graph neural nets (GNNs) is known to gradually decrea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hand pose estimation is a long standing research area in computer vision, given its vast potential applications in computer interaction, augmented reality, virtual reality and so on [DoostiSurvey]. It aims to infer 2D or 3D positions of hand keypoints from a single input image or a sequence of images, which could possibly take the form of RGB, RGB-D or grayscale. Although 3D hand pose estimation is drawing increasing attention in the research community [wang2019geometric, malik2020handvoxnet, xiong2019a2j, wan2019self, yuan2018depth, ge2018point], 2D hand pose estimation still remains a valuable and challenging problem [simon2017hand, wang2018mask, kong2019adaptive]. A plentiful of 3D hand pose estimation algorithms rely on their 2D counterparts [cai2018weakly, zimmermann2017learning], attempting to lift 2D predictions to 3D space. In this paper, we investigate the problem of 2D handpose estimation from single RGB image.

The progress in hand pose estimation research has been boosted greatly by the invention of deep Convolutional Neural Networks (CNNs). Deep CNN models like Convolutional Pose Machine 

[wei2016convolutional] and Stacked Hourglass [newell2016stacked] have been successfully applied to 2D hand pose estimation, though they are originally proposed to solve the task of human pose estimation. Some methods [kong2020rotation, kong2019adaptive, chen2014articulated] also integrate deep CNNs with probabilistic graphical model to harvest both the powerful representation ability of deep CNNs and the capability of explicitly expressing spatial relationships attributed to graphical model.

In contrast to CNN, graph neural network has the ability to handle irregular structured data. The joints of a human body, and keypoints of a hand can be conveniently considered as irregular graphs, giving possibilities of applying Graph Convolutional Network (GCN) [kipf2016semi] on human/hand pose estimation tasks. However, in the vanilla GCN [kipf2016semi], all the nodes share the same one-hop propagation weight matrix, which makes it unready to be applied to pose estimation task because different human body joints and bones should have different semantics. Authors in [doosti2020hope, zhao2019semantic, cai2019exploiting] have proposed different variants of the vanilla GCN from [kipf2016semi] for the purpose of human or hand pose estimation. However, all these methods take as input a one dimensional vector for each node, and the node feature at each layer is always a one dimensional vector. Thus, they are not ready to process 2D confidence map. Although, in [doosti2020hope, zhao2019semantic, cai2019exploiting], modifications are made to vanilla GCN, they still do not allow full independence among the edges.

In this paper we propose the Spatial Information Aware Graph Neural Network with 2D convolutions (SIA-GCN). In SIA-GCN, the feature of each node is a two dimensional matrix, and the information propagation to neighboring nodes are carried out via 2D convolutions along each edge. By using 2D convolutions instead of flattening the 2D feature map to a 1D vector and then performing linear multiplications, the spatial information encoded in the feature map is reserved and appropriately exploited. We also propose to use different 2D convolutional kernels on different edges, aiming to capture different spatial relationships for different pairs of neighboring nodes. The SIA-GCN is very flexible and could be easily combined with off-the-shelf 2D pose estimators. In this work, we demonstrate the efficacy of SIA-GCN on 2D hand pose estimation. For this application, the 2D feature maps at the nodes are actually the confidence maps of the hand keypoint positions. With a designated matrix for each edge, the SIA-GCN has the ability to capture various spatial relationships between different pairs of hand keypoints.

Our main contributions are threefold:

  • [noitemsep, topsep=1pt]

  • We propose the novel SIA-GCN which can process 2D confidence maps for each node efficiently and effectively, by integrating graph neural networks and 2D convolutions. Using 2D convolutions, our SIA-GCN can exploit and harvest the spatial information provided in the 2D feature maps.

  • By assigning different convolutional kernels on different edges, the SIA-GCN has the property of full edge-awareness. Distinct spatial relationships can be learned on different edges.

  • We deploy SIA-GCN in the task of hand pose estimation. Utilizing SIA-GCN, the constructed neural network can achieve state-of-the-art performance.

2 Related Work

There exists a vast amount of research focusing on topics of human/hand pose estimation [simon2017hand, wang2019geometric, malik2020handvoxnet, xiong2019a2j, wan2019self, wang2020predicting, yuan2018depth, baek2018augmented, wan2018dense, ge2018point, mueller2017real, ge2016robust] and graph neural networks [simon2017hand, doosti2020hope, zhao2019semantic, cai2019exploiting]. In the related work, we focus on 2D hand pose estimation from single RGB images and graph convolutional network [simon2017hand]’s applications to pose estimation tasks.

2D hand pose estimation. Studies of RGB image based 2D hand pose estimation has long benefited from that of human pose estimation, where deep Convolutional Neural Networks (CNNs) have enjoyed great success  [toshev2014deeppose, wei2016convolutional, newell2016stacked, xiao2018simple, chen2018cascaded, sun2019deep]. Among these deep CNN models, Convolutional Pose Machines [wei2016convolutional] and Stacked Hourglass [newell2016stacked] are commonly used in various RGB-based 2D hand pose estimation methods [simon2017hand, kong2020rotation, kong2019adaptive, chen2020nonparametric, wang2018mask]. Compared with deep CNNs, Graphical Model (GM) has also played a significant role in solving the pose estimation task. GM has the power of modeling spatial constraints among the joints explicitly. Recently, several works in pose estimation combine GM and neural network to fully exploit the structural information [tompson2014joint, chen2014articulated, song2017thin, yang2016end, kong2019adaptive, kong2020rotation]. Traditionally, GM with fixed parameters [tompson2014joint, song2017thin, chen2014articulated] are applied to the pose estimation task, while the most recent work in [kong2019adaptive, kong2020rotation] propose to adopt GM with adaptive parameters conditioning on input images. Although all take advantage of structural information, our proposed method is based on graph convolutional network while these previous works [kong2019adaptive, kong2020rotation] are based on graphical models.

Graph convolutional network. Graph Convolutional Network (GCN), which generalizes deep CNNs to graph structured data, have attracted increasing attention in recent years. One main research direction is to define graph convolutions from the spectral perspective [shuman2013emerging], while the other works on the spatial domain [kipf2016semi]. For a comprehensive survey on GCN, we refer readers to [wu2020comprehensive]. The most related works to ours are [doosti2020hope, zhao2019semantic, cai2019exploiting], in which variants of spatial GCNs have been proposed and applied to human/hand pose estimation tasks in the computer vision field. In the following, we discuss the key differences between our SIA-GCN and those in [doosti2020hope, zhao2019semantic, cai2019exploiting].

In [cai2019exploiting]

, the authors have proposed to classify neighboring nodes according to their semantic meanings and use different kernels for different neighboring nodes. The purpose of their proposed GCN is to regress 3D position vectors from 2D position vectors, and the input to the GCN for each node is a one dimensional

vector, representing predicted 2D position of a corresponding body joint. However, our proposed SIA-GCN aims to handle two dimensional confidence maps for each node. The confidence map inherently contains much more information than the two-element position vector. Our goal is to refine final 2D predictions, other than lifting 2D predictions to 3D space. Besides, instead of classifying nodes into different classes, we treat every edge independently and attach a designate weight kernel to each edge.

In [doosti2020hope], the authors directly adopt the propagation rule from [kipf2016semi] with the modification that, instead of using a predefined adjacency matrix, they have proposed to use an adaptive adjacency matrix which could be learned from data. The feature for each node is a one dimensional vector. Our method differs from [doosti2020hope] in that edge-dependent weights are considered explicitly and our SIA-GCN works on 2D confidence maps for each node.

In [zhao2019semantic], the proposed Semantic Graph Convolution (SemGConv) adds a learnable weighting matrix to conventional graph convolutions from [kipf2016semi]. The weight matrix serves as a weighting mask on the edges of a node when information aggregation is performed. The SemGConv is inherited from ST-GCN [yan2018spatial], but is equipped with additional important features such as softmax non-linearity and channel wise masks. The weighting mask adds a scalar importance weight (or a vector if it’s channel wise) to each edge. However, in SIA-GCN, we directly attach to each edge a fully independent convolution matrix. Besides, our SIA-GCN works on 2D node features with spatial information awareness.

3 Methodology

In this section, we present the SIA-GCN, and its application to hand pose estimation. We refer to the resulted pose estimator as SiaPose, which is illustrated in Fig 1.

Figure 1: System diagram of the SiaPose, utilizing SIA-GCN.

The SiaPose takes as input a RGB image, to which a preliminary pose estimator is applied. The preliminary pose estimator could be any 2D pose estimator, such as the famous Convolutional Pose Machine [wei2016convolutional] and Stacked Hourglass [newell2016stacked], which would output a set of confidence maps of keypoint positions. Then, at the top branch, the confidence maps are fed into a block of multi-head SIA-GCNs. Each SIA-GCN processes a copy of the confidence maps parallelly and independently. Meanwhile at the bottom branch, the input image goes through a pointer network, which gives a weight vector, indicating which head is important in the multi-head SIA-GCNs. Finally, at the information fusion stage, confidence maps output from the multi-head SIA-GCNs are aggregated according to the weight vector.

In the following subsections, we revisit the graph convolutional network first, and discuss the motivation for our SIA-GCN. Then, we present a compact formulation of our proposed edge-aware graph convolutional layers in SIA-GCN, and demonstrate how to implement it efficiently using 2D convolutional operations. Finally, we describe the training procedure of the SiaPose.

3.1 Revisiting Graph Convolutional Network

The Graph Convolutional Network (GCN) proposed in [kipf2016semi] has enjoyed great success on a variety of applications since its advent. Given a graph with nodes , edges , adjacency matrix , and a degree matrix with , the layer-wise propagation rule is characterized by the following equation

(1)

where is the adjacency matrix of the undirected graph with self-connections [kipf2016semi].

is the identity matrix,

. is the matrix of activations in the layer, or input feature matrix of the layer. The parameter is the trainable weight matrix of layer .

In the scenario of human and hand pose estimation, it is well studied that probabilistic graphical models could be deployed to enhance structural consistency [tompson2014joint, kong2020rotation, chen2014articulated]. The graphical model could take in some preliminarily generated 2D confidence maps of each body joint or hand points. These confidence maps are usually considered as the unary potential functions by the graphical model. Then the graphical model could impose some learned pairwise potential functions on the initial confidence maps, thus enforcing spatial consistency of the body joints/keypoints. Can we also apply GCN to the confidence maps and then enhance spatial consistency?

The answer is positive, but it’s not trivial. To apply the above GCN to pose estimation, some modifications are needed due to the dimensionality. In Eq. (1), the activation matrix is a two dimensional matrix, corresponding to nodes and each node is associated with a 1-d feature of size . However, for the case of 2D pose estimation, each graph node (usually corresponding to a joint or keypoint) can be associated with a two dimensional confidence map. This discrepancy could be handled by flattening the two dimensional confidence map to a single long vector and then perform layer propagation according to Eq. (1). However, this would result in very large feature size, significantly increase the computational complexity (imagine that a matrix would result in a one dimensional vector of size 4069). Besides, by flattening the confidence map, spatial information encoded in the confidence map would be corrupted. Thus, we propose to use 2D convolutional operations directly on 2D confidence maps when propagating information along the edges.

Moreover, in Eq. (1), since all the node share the same weight matrix and information aggregation is only controlled by the adjacency relationships between nodes, it would be difficult for the propagation rule in Eq. (1) to characterize different positional relationships for different pairs of neighboring joints. For example, the positional information propagation between two neighboring thumb joints should be different from that between the neighboring joints on the middle finger. One simple reason is that the bones from the thumb and middle finger actually have different lengths.

3.2 Sia-Gcn

To resolve the above mentioned concerns, we propose the spatial information aware graph neural network with 2D convolutions (SIA-GCN), where each edge of the graph is associated with an individual learnable 2D convolutional kernel. A toy example of a graph consisting of four nodes is shown in Fig. 2, where green matrices represent 2D features (heatmaps) at each node and red matrices represent designated 2D kernels associated with each edge.

Figure 2: A simple illustration of SIA-GCN.

For the task of hand pose estimation, we could define a graph where is the set of nodes corresponding to hand keypoints, and is the set of edges encoding the neighboring relationships among the keypoints. Each node is associated with a 2D confidence map , which encodes the positional information of keypoint. We could stack all for in a 3D matrix, and denote it as .

One important feature of our SIA-GCN is that each edge in is associated with an individual weight matrix or 2D convolutional kernal, , . Again, we compact all into a single matrix , which is actually the set of learnable parameters of the edge-aware graph convolutional layer. The information propagated from node to node along edge is obtained by calculating the 2D convolution of . Then, all the information propagated into node are aggregated according to the adjacency matrix. The propagation rule could be presented compactly in matrix multiplications and convolutions as

(2)

where the superscript and denote the layer and layer respectively, is the channel-wise 2D convolution operator, and

is the non-linear activation function. The matrix

is the broadcast matrix, which broadcasts node features to its outgoing edges. Note that the matrix multiplication results in a shape of , whereas originally the dimension of is . In other words, the operation simply prepares the input along each edge for the following channel-wise convolution, . Finally, the matrix is the aggregation matrix, which harvests all the information from the incoming edges to the graph nodes.

It is worth pointing out that, in Eq. (2), only is the learnable parameter, while the broadcast matrix and the aggregation matrix are both determined and constructed from the graph’s adjacency matrix by Algorithm 1. In Algorithm 1, we assume the input adjacency matrix is already included with self connections.

1:procedure ConstructMatrices() Input is the adjacency matrix
2:     Find the number of directed edges, , from
3:     Find the number of nodes, , from
4:     Initialize Initialization for and
5:          

as a zero matrix of size

6:           as a zero matrix of size
7:           as a zero vector of size
8:          
9:     for  in  do Calculate for
10:         for  in  do
11:              if  then If is the starting node of edge
12:                 
13:                  Record the end node of edge
14:                                              
15:     for  in  do Calculate for
16:               
17:     Construct the diagonal degree matrix , with .
18:     Set Normalize
19:     return ,
Algorithm 1 Broadcast and Aggregation Matrices Construction

3.3 SiaPose and its training procedure

With SIA-GCN, we propose the SiaPose for 2D hand pose estimation, as in Fig. 1. The preliminary pose estimator could be any off-the-shelf 2D hand pose estimator. Multiple heads of SIA-GCN would benefit capturing different positional informations due to different hand shapes in the input images. Assume there are heads in the multi-head SIA-GCNs, then, we could denote the output of the multi-head SIA-GCNs as and the output at the SIA-GCN as . The pointer network, whose input is the image, is a regression network which generate a soft pointer vector . The weight vector actually indicates the importance of the information generated at different heads. Finally, at the information fusion stage, the aggregated confidence map is given by

(3)

which is a weighted sum of . The final predictions of the keypoint positions are obtained by taking the argmax of .

The training procedure of the SiaPose is simple and could be conducted in an end-to-end fashion. The total loss function is defined as

(4)

The first loss is responsible for the output of the preliminary pose estimator, while the second loss is added at the final output. The preliminary pose estimator itself (e.g. CPM and Stacked Hourglass) might consist of multiple stages. The term is the confidence map of keypoint generated by the stage of the preliminary pose estimator, while is the final confidence output of the SiaPose as in Eq.(3). Besides, is the ground truth confidence map of keypoint, created by placing a Gaussian peak at its ground truth position. The coefficient serves as a balancing weight between the two loss functions.

4 Experiments

Datasets. We evaluate our proposed method on three public hand pose datasets, the CMU Panoptic Hand Dataset (Panoptic) [simon2017hand], the MPII+NZSL Hand Dataset [simon2017hand] and the Large-scale Multiview 3D Hand Pose Dataset (MHP) [Francisco2017]. For Panoptic (~15k images) and MHP (~82k images), we follow the setting of [kong2020rotation] and randomly split all samples into training set (70%), validation set (15%) and test set (15%). Since our contribution mainly focus on pose estimation instead of detection, we crop square image patches of annotated hands off the original images. A square bounding box which is 2.2 times the size of the hand is applied for cropping as in [simon2017hand, kong2020rotation, kong2019adaptive].

Evaluation metrics.

The Probability of Correct Keypoint (PCK) 

[simon2017hand]

is utilized as our evaluation metric. In this paper, we use normalized threshold with respect to the size of square bounding box. We report the performance under different thresholds,

= {0.01, 0.02, 0.03, 0.04, 0.05, 0.06}, and also their average (mPCK). More formally, for a single cropped input image of size , the PCK at can be defined as

(5)

where is the number of predicted keypoints which are within an interval threshold of its correct location and is the total number of keypoints.

Implementation details. In the experiments, two baselines, i.e., six-staged Convolutional Pose Machine (CPM) as in [simon2017hand] and eight-staged Stacked Hourglass (SHG) are used as preliminary pose estimators in our SiaPose. For the SIA-GCN, we use 5 edge-aware graph convolutional layers defined in Eq. (2), which adopts a tree structured graph according to the kinematic structure of the hand skeleton, adding self connections. The size of the convolutional kernels in Eq. (2) is set to 45. ResNet-18 is used as the backbone of the pointer network. The input image is resized to and

for the cases of CPM and SHG, respectively. Images are then scaled to [0,1], and normalized with mean of (0.485, 0.456, 0.406) and standard deviation of (0.229, 0.224, 0.225). We use Adam as our optimizer. For SHG-based SiaPose, the initial learning rate is set to 7.5e-4 while for the CPM-based SiaPose, we set it to 1e-4. For both cases, we train the model for 100 epochs, with learning rate reduced by a factor of 0.5 at milestones of the 60-th and 80-th epoch. The weight coefficient

in loss function Eq. (4) is set to drop from 1.0 to 0.1 at the 40th epoch.

Comparison with baselines. In Table 1 and Table 2, we compare the performance of our SiaPose with two baselines, CPM and SHG. (1) First, we conduct an experiment where edge-unaware GCN is utilized, where a shared weight matrix is used for all the edges. Interestingly, it performs worse than the baseline models. This is reasonable, because it’s not appropriate to assume that relative positions of neighboring keypoints are always the same. For example, index finger and thumb naturally have bones with different lengths. (2) Then we conduct experiments with our edge-aware SIA-GCNs, where different numbers of heads are explored. The results demonstrate that our proposed SiaPose could consistently improve both baselines noticeably. The ablative study on different numbers of heads validates the benefit of multi-heads and the effectiveness of the proposed SIA-GCN. For SHG, there is a 2.12 percent improvement at threshold and for CPM, a 1.95 percent improvement is seen at threshold . (3) Also, inspired by the state-of-the-art algorithm [kong2020rotation], by adding a rotation network into our SiaPose (R-SiaPose) and using a similar training strategy, the performance of our method is further boosted, leading to significant improvements from baselines. Improvements of about 5 percent for SHG and nearly 4 percent for CPM are observed. We would also compare our model with that proposed in [kong2020rotation] in next subsection.

PCK@ 0.01 0.02 0.03 0.04 0.05 0.06 mPCK
SHG Baseline 35.85 71.47 83.15 88.21 91.10 92.92 77.12
SharedWeight GCN 34.76 69.66 81.33 86.19 89.14 90.95 75.34
1-head SiaPose 35.78 71.16 83.57 88.98 92.00 93.84 77.55
5-head SiaPose 37.53 73.07 84.60 89.51 92.14 93.85 78.45
10-head SiaPose 37.97 73.53 84.95 89.70 92.26 93.91 78.72
Improvement 2.12 2.06 1.80 1.49 1.16 0.99 1.60
10-head R-SiaPose 39.46 77.22 88.45 92.97 94.85 96.09 81.48
Improvement 3.61 5.75 5.30 4.76 3.75 3.17 4.36
Table 1: SHG based SiaPose on Panoptic Dataset.
PCK@ 0.01 0.02 0.03 0.04 0.05 0.06 mPCK
CPM Baseline 25.73 62.77 77.80 84.35 88.11 90.57 71.55
SharedWeight GCN 25.14 61.76 77.13 83.60 86.97 89.20 70.63
1-head SiaPose 25.90 63.36 78.98 85.69 89.44 91.90 72.55
5-head SiaPose 26.36 64.05 79.11 85.74 89.38 91.78 72.74
10-head SiaPose 26.45 64.19 79.67 86.30 89.83 92.20 73.11
Improvement 0.72 1.42 1.87 1.95 1.72 1.63 1.56
10-head R-SiaPose 26.62 65.80 81.60 88.02 91.39 93.36 74.47
Improvement 0.89 3.03 3.80 3.67 3.28 2.79 2.92
Table 2: CPM based SiaPose on Panoptic Dataset.

Comparison with state-of-the-art methods. We further compare our approach with the current state-of-the-art methods [kong2020rotation, kong2019adaptive]. Probabilistic graphical models are deployed in [kong2020rotation] and [kong2019adaptive], where the output confidence maps from CPM are utilized as unary potential functions. The CPM used in [kong2020rotation] and [kong2019adaptive] is the version where convolutional kernels are replaced by three convolutional kernels. To make fair comparison, we follow their configurations and use their version of CPM as our preliminary pose estimator. The fundamental difference between our method and [kong2020rotation] is that we have adopted our SIA-GCN instead of graphical models. As observed from Table 3, our method outperforms both [kong2020rotation, kong2019adaptive] on the Panoptic dataset. On the MHP dataset, our SiaPose also achieves the state-of-the-art level performance. The size of the MHP dataset is about five times the size of the Panoptic, making the MHP dataset an easier task and allows less room for improvement. Methods focused on modeling structural relationships between keypoints would benefit more from smaller and challenging datasets that require models to extrapolate beyond pose templates seen in the training data.

Complexity analysis. Regarding the size of the proposed models, the 5-head and 10-head models increase the model size by about 30% and 40%, respectively, compared to the 1-head model. The increment of the model size from 1-head to multiple heads is primarily due to the added pointer network, which is drawn in Fig. 1. However, going from 5-head to 10-head does not significantly increase model complexity. This is because the pointer network only needs to output 5 more scalers and the overall overhead mostly comes from adding more GCN layers, which are shallow and not associated with too many parameters (note that we use “channel-wise” 2D convolutions). It’s also worth to point out that, using a 10-head SIA-GCN, our model is about 80% and 60% the size of those in [kong2019adaptive] and [kong2020rotation], respectively.

Domain generalization of our model. Table 4 demonstrates the domain generalization ability of our model. All the models in Table 4 are pretrained on Panoptic dataset, and then finetuned for about 40 epochs on the MPII+NZSL dataset. Consistent improvements over baselines are seen for all the ranges of PCK thresholds.

Qualitative results. Some qualitative examples are given in Fig. 3, which indeed shows that the SIA-GCN helps to enhance structural consistency and alleviate the spatial ambiguity. For example, in the third column, although the right hand is partially occluded by the earphone, our model could still correctly predict the position of all keypoints. We also show some failure cases of our model in Fig. 4, which are due to very heavy occlusion and foreshortened view of a fist.

PCK@ 0.01 0.02 0.03 0.04 0.05 0.06 mPCK
CMU Panoptic Hand Dataset
R-MGMN [kong2020rotation] 23.67 60.12 76.28 83.14 86.91 89.47 69.93
AGMN [kong2019adaptive] 23.90 60.26 76.21 83.70 87.72 90.27 70.34
R-SiaPose (Ours) 24.94 62.08 77.83 84.91 88.78 91.34 71.65
Large-scale Multiview 3D Hand Pose Dataset (MHP)
R-MGMN [kong2020rotation] 41.51 85.97 93.71 96.33 97.51 98.17 85.53
AGMN [kong2019adaptive] 41.38 85.67 93.96 96.61 97.77 98.42 85.63
R-SiaPose (Ours) 41.27 85.89 93.82 96.43 97.61 98.29 85.56
Table 3: Comparison to state-of-the-art methods.
PCK@ 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
CPM 8.05 23.78 37.74 48.00 55.65 61.68 66.58 70.82
R-SiaPose (Ours) 8.40 24.71 39.33 50.31 59.04 66.01 71.29 75.63
Improvement 0.35 0.93 1.59 2.31 3.39 4.33 4.71 4.81
SHG 11.72 30.85 44.82 54.71 62.35 68.48 73.47 77.61
R-SiaPose (Ours) 12.19 33.34 49.13 59.86 67.83 73.69 78.26 81.72
Improvement 0.47 2.49 4.31 5.15 5.48 5.21 4.79 4.11
Table 4: Domain generalization of our model to MPII+NZSL from Panoptic Dataset.
Figure 3: Qualitative results of baseline (top) and our model (bottom) on Panoptic and MPII.
Figure 4: Failure cases of our model. Each pair contains an input image and its prediction.

5 Conclusion

In this paper, we propose a novel spatial information aware graph neural network with 2D convolutions (SIA-GCN), which has the advantage of processing 2D spatial features for each node, with additional capability of learning different spatial relationships for different pair of neighboring nodes. We show the efficacy of our SIA-GCN in the 2D hand pose estimation task, by implementing a network which achieves the state-of-the-art performance. The SIA-GCN has the potential to generalise well to other tasks.

References