1 Introduction
Hand pose estimation is a long standing research area in computer vision, given its vast potential applications in computer interaction, augmented reality, virtual reality and so on [DoostiSurvey]. It aims to infer 2D or 3D positions of hand keypoints from a single input image or a sequence of images, which could possibly take the form of RGB, RGBD or grayscale. Although 3D hand pose estimation is drawing increasing attention in the research community [wang2019geometric, malik2020handvoxnet, xiong2019a2j, wan2019self, yuan2018depth, ge2018point], 2D hand pose estimation still remains a valuable and challenging problem [simon2017hand, wang2018mask, kong2019adaptive]. A plentiful of 3D hand pose estimation algorithms rely on their 2D counterparts [cai2018weakly, zimmermann2017learning], attempting to lift 2D predictions to 3D space. In this paper, we investigate the problem of 2D handpose estimation from single RGB image.
The progress in hand pose estimation research has been boosted greatly by the invention of deep Convolutional Neural Networks (CNNs). Deep CNN models like Convolutional Pose Machine
[wei2016convolutional] and Stacked Hourglass [newell2016stacked] have been successfully applied to 2D hand pose estimation, though they are originally proposed to solve the task of human pose estimation. Some methods [kong2020rotation, kong2019adaptive, chen2014articulated] also integrate deep CNNs with probabilistic graphical model to harvest both the powerful representation ability of deep CNNs and the capability of explicitly expressing spatial relationships attributed to graphical model.In contrast to CNN, graph neural network has the ability to handle irregular structured data. The joints of a human body, and keypoints of a hand can be conveniently considered as irregular graphs, giving possibilities of applying Graph Convolutional Network (GCN) [kipf2016semi] on human/hand pose estimation tasks. However, in the vanilla GCN [kipf2016semi], all the nodes share the same onehop propagation weight matrix, which makes it unready to be applied to pose estimation task because different human body joints and bones should have different semantics. Authors in [doosti2020hope, zhao2019semantic, cai2019exploiting] have proposed different variants of the vanilla GCN from [kipf2016semi] for the purpose of human or hand pose estimation. However, all these methods take as input a one dimensional vector for each node, and the node feature at each layer is always a one dimensional vector. Thus, they are not ready to process 2D confidence map. Although, in [doosti2020hope, zhao2019semantic, cai2019exploiting], modifications are made to vanilla GCN, they still do not allow full independence among the edges.
In this paper we propose the Spatial Information Aware Graph Neural Network with 2D convolutions (SIAGCN). In SIAGCN, the feature of each node is a two dimensional matrix, and the information propagation to neighboring nodes are carried out via 2D convolutions along each edge. By using 2D convolutions instead of flattening the 2D feature map to a 1D vector and then performing linear multiplications, the spatial information encoded in the feature map is reserved and appropriately exploited. We also propose to use different 2D convolutional kernels on different edges, aiming to capture different spatial relationships for different pairs of neighboring nodes. The SIAGCN is very flexible and could be easily combined with offtheshelf 2D pose estimators. In this work, we demonstrate the efficacy of SIAGCN on 2D hand pose estimation. For this application, the 2D feature maps at the nodes are actually the confidence maps of the hand keypoint positions. With a designated matrix for each edge, the SIAGCN has the ability to capture various spatial relationships between different pairs of hand keypoints.
Our main contributions are threefold:

[noitemsep, topsep=1pt]

We propose the novel SIAGCN which can process 2D confidence maps for each node efficiently and effectively, by integrating graph neural networks and 2D convolutions. Using 2D convolutions, our SIAGCN can exploit and harvest the spatial information provided in the 2D feature maps.

By assigning different convolutional kernels on different edges, the SIAGCN has the property of full edgeawareness. Distinct spatial relationships can be learned on different edges.

We deploy SIAGCN in the task of hand pose estimation. Utilizing SIAGCN, the constructed neural network can achieve stateoftheart performance.
2 Related Work
There exists a vast amount of research focusing on topics of human/hand pose estimation [simon2017hand, wang2019geometric, malik2020handvoxnet, xiong2019a2j, wan2019self, wang2020predicting, yuan2018depth, baek2018augmented, wan2018dense, ge2018point, mueller2017real, ge2016robust] and graph neural networks [simon2017hand, doosti2020hope, zhao2019semantic, cai2019exploiting]. In the related work, we focus on 2D hand pose estimation from single RGB images and graph convolutional network [simon2017hand]’s applications to pose estimation tasks.
2D hand pose estimation. Studies of RGB image based 2D hand pose estimation has long benefited from that of human pose estimation, where deep Convolutional Neural Networks (CNNs) have enjoyed great success [toshev2014deeppose, wei2016convolutional, newell2016stacked, xiao2018simple, chen2018cascaded, sun2019deep]. Among these deep CNN models, Convolutional Pose Machines [wei2016convolutional] and Stacked Hourglass [newell2016stacked] are commonly used in various RGBbased 2D hand pose estimation methods [simon2017hand, kong2020rotation, kong2019adaptive, chen2020nonparametric, wang2018mask]. Compared with deep CNNs, Graphical Model (GM) has also played a significant role in solving the pose estimation task. GM has the power of modeling spatial constraints among the joints explicitly. Recently, several works in pose estimation combine GM and neural network to fully exploit the structural information [tompson2014joint, chen2014articulated, song2017thin, yang2016end, kong2019adaptive, kong2020rotation]. Traditionally, GM with fixed parameters [tompson2014joint, song2017thin, chen2014articulated] are applied to the pose estimation task, while the most recent work in [kong2019adaptive, kong2020rotation] propose to adopt GM with adaptive parameters conditioning on input images. Although all take advantage of structural information, our proposed method is based on graph convolutional network while these previous works [kong2019adaptive, kong2020rotation] are based on graphical models.
Graph convolutional network. Graph Convolutional Network (GCN), which generalizes deep CNNs to graph structured data, have attracted increasing attention in recent years. One main research direction is to define graph convolutions from the spectral perspective [shuman2013emerging], while the other works on the spatial domain [kipf2016semi]. For a comprehensive survey on GCN, we refer readers to [wu2020comprehensive]. The most related works to ours are [doosti2020hope, zhao2019semantic, cai2019exploiting], in which variants of spatial GCNs have been proposed and applied to human/hand pose estimation tasks in the computer vision field. In the following, we discuss the key differences between our SIAGCN and those in [doosti2020hope, zhao2019semantic, cai2019exploiting].
In [cai2019exploiting]
, the authors have proposed to classify neighboring nodes according to their semantic meanings and use different kernels for different neighboring nodes. The purpose of their proposed GCN is to regress 3D position vectors from 2D position vectors, and the input to the GCN for each node is a one dimensional
vector, representing predicted 2D position of a corresponding body joint. However, our proposed SIAGCN aims to handle two dimensional confidence maps for each node. The confidence map inherently contains much more information than the twoelement position vector. Our goal is to refine final 2D predictions, other than lifting 2D predictions to 3D space. Besides, instead of classifying nodes into different classes, we treat every edge independently and attach a designate weight kernel to each edge.In [doosti2020hope], the authors directly adopt the propagation rule from [kipf2016semi] with the modification that, instead of using a predefined adjacency matrix, they have proposed to use an adaptive adjacency matrix which could be learned from data. The feature for each node is a one dimensional vector. Our method differs from [doosti2020hope] in that edgedependent weights are considered explicitly and our SIAGCN works on 2D confidence maps for each node.
In [zhao2019semantic], the proposed Semantic Graph Convolution (SemGConv) adds a learnable weighting matrix to conventional graph convolutions from [kipf2016semi]. The weight matrix serves as a weighting mask on the edges of a node when information aggregation is performed. The SemGConv is inherited from STGCN [yan2018spatial], but is equipped with additional important features such as softmax nonlinearity and channel wise masks. The weighting mask adds a scalar importance weight (or a vector if it’s channel wise) to each edge. However, in SIAGCN, we directly attach to each edge a fully independent convolution matrix. Besides, our SIAGCN works on 2D node features with spatial information awareness.
3 Methodology
In this section, we present the SIAGCN, and its application to hand pose estimation. We refer to the resulted pose estimator as SiaPose, which is illustrated in Fig 1.
The SiaPose takes as input a RGB image, to which a preliminary pose estimator is applied. The preliminary pose estimator could be any 2D pose estimator, such as the famous Convolutional Pose Machine [wei2016convolutional] and Stacked Hourglass [newell2016stacked], which would output a set of confidence maps of keypoint positions. Then, at the top branch, the confidence maps are fed into a block of multihead SIAGCNs. Each SIAGCN processes a copy of the confidence maps parallelly and independently. Meanwhile at the bottom branch, the input image goes through a pointer network, which gives a weight vector, indicating which head is important in the multihead SIAGCNs. Finally, at the information fusion stage, confidence maps output from the multihead SIAGCNs are aggregated according to the weight vector.
In the following subsections, we revisit the graph convolutional network first, and discuss the motivation for our SIAGCN. Then, we present a compact formulation of our proposed edgeaware graph convolutional layers in SIAGCN, and demonstrate how to implement it efficiently using 2D convolutional operations. Finally, we describe the training procedure of the SiaPose.
3.1 Revisiting Graph Convolutional Network
The Graph Convolutional Network (GCN) proposed in [kipf2016semi] has enjoyed great success on a variety of applications since its advent. Given a graph with nodes , edges , adjacency matrix , and a degree matrix with , the layerwise propagation rule is characterized by the following equation
(1) 
where is the adjacency matrix of the undirected graph with selfconnections [kipf2016semi].
is the identity matrix,
. is the matrix of activations in the layer, or input feature matrix of the layer. The parameter is the trainable weight matrix of layer .In the scenario of human and hand pose estimation, it is well studied that probabilistic graphical models could be deployed to enhance structural consistency [tompson2014joint, kong2020rotation, chen2014articulated]. The graphical model could take in some preliminarily generated 2D confidence maps of each body joint or hand points. These confidence maps are usually considered as the unary potential functions by the graphical model. Then the graphical model could impose some learned pairwise potential functions on the initial confidence maps, thus enforcing spatial consistency of the body joints/keypoints. Can we also apply GCN to the confidence maps and then enhance spatial consistency?
The answer is positive, but it’s not trivial. To apply the above GCN to pose estimation, some modifications are needed due to the dimensionality. In Eq. (1), the activation matrix is a two dimensional matrix, corresponding to nodes and each node is associated with a 1d feature of size . However, for the case of 2D pose estimation, each graph node (usually corresponding to a joint or keypoint) can be associated with a two dimensional confidence map. This discrepancy could be handled by flattening the two dimensional confidence map to a single long vector and then perform layer propagation according to Eq. (1). However, this would result in very large feature size, significantly increase the computational complexity (imagine that a matrix would result in a one dimensional vector of size 4069). Besides, by flattening the confidence map, spatial information encoded in the confidence map would be corrupted. Thus, we propose to use 2D convolutional operations directly on 2D confidence maps when propagating information along the edges.
Moreover, in Eq. (1), since all the node share the same weight matrix and information aggregation is only controlled by the adjacency relationships between nodes, it would be difficult for the propagation rule in Eq. (1) to characterize different positional relationships for different pairs of neighboring joints. For example, the positional information propagation between two neighboring thumb joints should be different from that between the neighboring joints on the middle finger. One simple reason is that the bones from the thumb and middle finger actually have different lengths.
3.2 SiaGcn
To resolve the above mentioned concerns, we propose the spatial information aware graph neural network with 2D convolutions (SIAGCN), where each edge of the graph is associated with an individual learnable 2D convolutional kernel. A toy example of a graph consisting of four nodes is shown in Fig. 2, where green matrices represent 2D features (heatmaps) at each node and red matrices represent designated 2D kernels associated with each edge.
For the task of hand pose estimation, we could define a graph where is the set of nodes corresponding to hand keypoints, and is the set of edges encoding the neighboring relationships among the keypoints. Each node is associated with a 2D confidence map , which encodes the positional information of keypoint. We could stack all for in a 3D matrix, and denote it as .
One important feature of our SIAGCN is that each edge in is associated with an individual weight matrix or 2D convolutional kernal, , . Again, we compact all into a single matrix , which is actually the set of learnable parameters of the edgeaware graph convolutional layer. The information propagated from node to node along edge is obtained by calculating the 2D convolution of . Then, all the information propagated into node are aggregated according to the adjacency matrix. The propagation rule could be presented compactly in matrix multiplications and convolutions as
(2) 
where the superscript and denote the layer and layer respectively, is the channelwise 2D convolution operator, and
is the nonlinear activation function. The matrix
is the broadcast matrix, which broadcasts node features to its outgoing edges. Note that the matrix multiplication results in a shape of , whereas originally the dimension of is . In other words, the operation simply prepares the input along each edge for the following channelwise convolution, . Finally, the matrix is the aggregation matrix, which harvests all the information from the incoming edges to the graph nodes.It is worth pointing out that, in Eq. (2), only is the learnable parameter, while the broadcast matrix and the aggregation matrix are both determined and constructed from the graph’s adjacency matrix by Algorithm 1. In Algorithm 1, we assume the input adjacency matrix is already included with self connections.
3.3 SiaPose and its training procedure
With SIAGCN, we propose the SiaPose for 2D hand pose estimation, as in Fig. 1. The preliminary pose estimator could be any offtheshelf 2D hand pose estimator. Multiple heads of SIAGCN would benefit capturing different positional informations due to different hand shapes in the input images. Assume there are heads in the multihead SIAGCNs, then, we could denote the output of the multihead SIAGCNs as and the output at the SIAGCN as . The pointer network, whose input is the image, is a regression network which generate a soft pointer vector . The weight vector actually indicates the importance of the information generated at different heads. Finally, at the information fusion stage, the aggregated confidence map is given by
(3) 
which is a weighted sum of . The final predictions of the keypoint positions are obtained by taking the argmax of .
The training procedure of the SiaPose is simple and could be conducted in an endtoend fashion. The total loss function is defined as
(4) 
The first loss is responsible for the output of the preliminary pose estimator, while the second loss is added at the final output. The preliminary pose estimator itself (e.g. CPM and Stacked Hourglass) might consist of multiple stages. The term is the confidence map of keypoint generated by the stage of the preliminary pose estimator, while is the final confidence output of the SiaPose as in Eq.(3). Besides, is the ground truth confidence map of keypoint, created by placing a Gaussian peak at its ground truth position. The coefficient serves as a balancing weight between the two loss functions.
4 Experiments
Datasets. We evaluate our proposed method on three public hand pose datasets, the CMU Panoptic Hand Dataset (Panoptic) [simon2017hand], the MPII+NZSL Hand Dataset [simon2017hand] and the Largescale Multiview 3D Hand Pose Dataset (MHP) [Francisco2017]. For Panoptic (~15k images) and MHP (~82k images), we follow the setting of [kong2020rotation] and randomly split all samples into training set (70%), validation set (15%) and test set (15%). Since our contribution mainly focus on pose estimation instead of detection, we crop square image patches of annotated hands off the original images. A square bounding box which is 2.2 times the size of the hand is applied for cropping as in [simon2017hand, kong2020rotation, kong2019adaptive].
Evaluation metrics.
The Probability of Correct Keypoint (PCK)
[simon2017hand]is utilized as our evaluation metric. In this paper, we use normalized threshold with respect to the size of square bounding box. We report the performance under different thresholds,
= {0.01, 0.02, 0.03, 0.04, 0.05, 0.06}, and also their average (mPCK). More formally, for a single cropped input image of size , the PCK at can be defined as(5) 
where is the number of predicted keypoints which are within an interval threshold of its correct location and is the total number of keypoints.
Implementation details. In the experiments, two baselines, i.e., sixstaged Convolutional Pose Machine (CPM) as in [simon2017hand] and eightstaged Stacked Hourglass (SHG) are used as preliminary pose estimators in our SiaPose. For the SIAGCN, we use 5 edgeaware graph convolutional layers defined in Eq. (2), which adopts a tree structured graph according to the kinematic structure of the hand skeleton, adding self connections. The size of the convolutional kernels in Eq. (2) is set to 45. ResNet18 is used as the backbone of the pointer network. The input image is resized to and
for the cases of CPM and SHG, respectively. Images are then scaled to [0,1], and normalized with mean of (0.485, 0.456, 0.406) and standard deviation of (0.229, 0.224, 0.225). We use Adam as our optimizer. For SHGbased SiaPose, the initial learning rate is set to 7.5e4 while for the CPMbased SiaPose, we set it to 1e4. For both cases, we train the model for 100 epochs, with learning rate reduced by a factor of 0.5 at milestones of the 60th and 80th epoch. The weight coefficient
in loss function Eq. (4) is set to drop from 1.0 to 0.1 at the 40th epoch.Comparison with baselines. In Table 1 and Table 2, we compare the performance of our SiaPose with two baselines, CPM and SHG. (1) First, we conduct an experiment where edgeunaware GCN is utilized, where a shared weight matrix is used for all the edges. Interestingly, it performs worse than the baseline models. This is reasonable, because it’s not appropriate to assume that relative positions of neighboring keypoints are always the same. For example, index finger and thumb naturally have bones with different lengths. (2) Then we conduct experiments with our edgeaware SIAGCNs, where different numbers of heads are explored. The results demonstrate that our proposed SiaPose could consistently improve both baselines noticeably. The ablative study on different numbers of heads validates the benefit of multiheads and the effectiveness of the proposed SIAGCN. For SHG, there is a 2.12 percent improvement at threshold and for CPM, a 1.95 percent improvement is seen at threshold . (3) Also, inspired by the stateoftheart algorithm [kong2020rotation], by adding a rotation network into our SiaPose (RSiaPose) and using a similar training strategy, the performance of our method is further boosted, leading to significant improvements from baselines. Improvements of about 5 percent for SHG and nearly 4 percent for CPM are observed. We would also compare our model with that proposed in [kong2020rotation] in next subsection.
PCK@  0.01  0.02  0.03  0.04  0.05  0.06  mPCK 

SHG Baseline  35.85  71.47  83.15  88.21  91.10  92.92  77.12 
SharedWeight GCN  34.76  69.66  81.33  86.19  89.14  90.95  75.34 
1head SiaPose  35.78  71.16  83.57  88.98  92.00  93.84  77.55 
5head SiaPose  37.53  73.07  84.60  89.51  92.14  93.85  78.45 
10head SiaPose  37.97  73.53  84.95  89.70  92.26  93.91  78.72 
Improvement  2.12  2.06  1.80  1.49  1.16  0.99  1.60 
10head RSiaPose  39.46  77.22  88.45  92.97  94.85  96.09  81.48 
Improvement  3.61  5.75  5.30  4.76  3.75  3.17  4.36 
PCK@  0.01  0.02  0.03  0.04  0.05  0.06  mPCK 

CPM Baseline  25.73  62.77  77.80  84.35  88.11  90.57  71.55 
SharedWeight GCN  25.14  61.76  77.13  83.60  86.97  89.20  70.63 
1head SiaPose  25.90  63.36  78.98  85.69  89.44  91.90  72.55 
5head SiaPose  26.36  64.05  79.11  85.74  89.38  91.78  72.74 
10head SiaPose  26.45  64.19  79.67  86.30  89.83  92.20  73.11 
Improvement  0.72  1.42  1.87  1.95  1.72  1.63  1.56 
10head RSiaPose  26.62  65.80  81.60  88.02  91.39  93.36  74.47 
Improvement  0.89  3.03  3.80  3.67  3.28  2.79  2.92 
Comparison with stateoftheart methods. We further compare our approach with the current stateoftheart methods [kong2020rotation, kong2019adaptive]. Probabilistic graphical models are deployed in [kong2020rotation] and [kong2019adaptive], where the output confidence maps from CPM are utilized as unary potential functions. The CPM used in [kong2020rotation] and [kong2019adaptive] is the version where convolutional kernels are replaced by three convolutional kernels. To make fair comparison, we follow their configurations and use their version of CPM as our preliminary pose estimator. The fundamental difference between our method and [kong2020rotation] is that we have adopted our SIAGCN instead of graphical models. As observed from Table 3, our method outperforms both [kong2020rotation, kong2019adaptive] on the Panoptic dataset. On the MHP dataset, our SiaPose also achieves the stateoftheart level performance. The size of the MHP dataset is about five times the size of the Panoptic, making the MHP dataset an easier task and allows less room for improvement. Methods focused on modeling structural relationships between keypoints would benefit more from smaller and challenging datasets that require models to extrapolate beyond pose templates seen in the training data.
Complexity analysis. Regarding the size of the proposed models, the 5head and 10head models increase the model size by about 30% and 40%, respectively, compared to the 1head model. The increment of the model size from 1head to multiple heads is primarily due to the added pointer network, which is drawn in Fig. 1. However, going from 5head to 10head does not significantly increase model complexity. This is because the pointer network only needs to output 5 more scalers and the overall overhead mostly comes from adding more GCN layers, which are shallow and not associated with too many parameters (note that we use “channelwise” 2D convolutions). It’s also worth to point out that, using a 10head SIAGCN, our model is about 80% and 60% the size of those in [kong2019adaptive] and [kong2020rotation], respectively.
Domain generalization of our model. Table 4 demonstrates the domain generalization ability of our model. All the models in Table 4 are pretrained on Panoptic dataset, and then finetuned for about 40 epochs on the MPII+NZSL dataset. Consistent improvements over baselines are seen for all the ranges of PCK thresholds.
Qualitative results. Some qualitative examples are given in Fig. 3, which indeed shows that the SIAGCN helps to enhance structural consistency and alleviate the spatial ambiguity. For example, in the third column, although the right hand is partially occluded by the earphone, our model could still correctly predict the position of all keypoints. We also show some failure cases of our model in Fig. 4, which are due to very heavy occlusion and foreshortened view of a fist.
PCK@  0.01  0.02  0.03  0.04  0.05  0.06  mPCK 

CMU Panoptic Hand Dataset  
RMGMN [kong2020rotation]  23.67  60.12  76.28  83.14  86.91  89.47  69.93 
AGMN [kong2019adaptive]  23.90  60.26  76.21  83.70  87.72  90.27  70.34 
RSiaPose (Ours)  24.94  62.08  77.83  84.91  88.78  91.34  71.65 
Largescale Multiview 3D Hand Pose Dataset (MHP)  
RMGMN [kong2020rotation]  41.51  85.97  93.71  96.33  97.51  98.17  85.53 
AGMN [kong2019adaptive]  41.38  85.67  93.96  96.61  97.77  98.42  85.63 
RSiaPose (Ours)  41.27  85.89  93.82  96.43  97.61  98.29  85.56 
PCK@  0.01  0.02  0.03  0.04  0.05  0.06  0.07  0.08 

CPM  8.05  23.78  37.74  48.00  55.65  61.68  66.58  70.82 
RSiaPose (Ours)  8.40  24.71  39.33  50.31  59.04  66.01  71.29  75.63 
Improvement  0.35  0.93  1.59  2.31  3.39  4.33  4.71  4.81 
SHG  11.72  30.85  44.82  54.71  62.35  68.48  73.47  77.61 
RSiaPose (Ours)  12.19  33.34  49.13  59.86  67.83  73.69  78.26  81.72 
Improvement  0.47  2.49  4.31  5.15  5.48  5.21  4.79  4.11 
5 Conclusion
In this paper, we propose a novel spatial information aware graph neural network with 2D convolutions (SIAGCN), which has the advantage of processing 2D spatial features for each node, with additional capability of learning different spatial relationships for different pair of neighboring nodes. We show the efficacy of our SIAGCN in the 2D hand pose estimation task, by implementing a network which achieves the stateoftheart performance. The SIAGCN has the potential to generalise well to other tasks.
Comments
There are no comments yet.