3D human pose estimation has attracted much attention recent years in computer vision as its numerous practical applications such as action recognitionweng2017spatio; yan2018spatial; li2019actional; jiang2021skeleton, virtual reality gan2019air; lu2020fmkit, etc. The 2D-to-3D human pose estimation task takes 2D joint coordinates as inputs and outputs the 3D pose target directly, which remains a challenging problem because of the very limited knowledge contained in 2D joint coordinates. Prior works martinez2017simple; mehta2017monocular have shown that the 2D kinematic structure in the coordinates is vital to learn feature representations for 3D pose estimation. However, CNN-based methods are difficult to directly process these graph-structured data.
To conquer this difficulty, recent literature doosti2020hope; zhao2019semantic; liu2020comprehensive; xu2021graph try to employ graph convolution networks (GCNs) to learn the representation of these graph-structured data because of their capabilities of learning structure information. These techniques achieve good performances but suffer from limited receptive field when learn better representations. This is because they restrict the graph convolution filter to operate only on the first-order neighbor nodes in the manner of kipf2016semi as illustrated Fig. 1 (a). This issue can be alleviated by stacking multiple GCN layers, while the performance may degrade due to the over-smoothing problem. Other attempts zhao2019semantic utilize non-local modules to increase the network receptive field and lin2021end utilize the recent popular transformer model to capture the global vision information over the entire RGB image alleviate the receptive field problem and reach the state of the art.The self-attention modules of transformers facilitate the interaction among all 2D joints, and thus relations of them are exploited. However, as illustrated in Fig. 1 (b), the self-attention mechanism builds upon calculating the similarities of these joints, which actually ignores or weakens the graph structure information among these 2D joints coordinates.
To address the aforementioned problems, we propose the GraAttention module to fill in these imperfections by combining the transformer and a graph convolution filter with a learnable adjacency matrix (referred to as LAM-Gconv in short) kipf2016semi; doosti2020hope
. In particular, we replace the multiple layer perceptron (MLP) of transformers with LAM-Gconv, which facilitates the training process and boosts the performance. The self-attention module facilitates the interaction among 2D joints based on the coordinate values. As a complementary, the LAM-Gconv utilizes graph structure to further boost interaction.With both superior properties of self-attention and LAM-Gconv, our GraAttention is able to learn robust features through capturing global vision information without losing the graph structure details.
Furthermore, we claim that graph-structured data contains richer joint relations than the apparent connections. For example, the two knee joints of humans indeed exist some implicit connections, though they are not visually connected. We note that self-attention is able to model this kind of relationship, yet based on the similarity calculations as noted above. Therefore, we propose to adopt the second module of GraFormer, ChebGConv block defferrard2016convolutional to conquer this difficulty. The proposed ChebGConv block can exchange information according to the structural relations of nodes thus attain a larger receptive field than the vanilla graph convolutions kipf2016semi. As a result, ChebGConv enables 2D joints to exploit these implicit connections via formulating the interactions in a high-order sphere. As shown in Fig. 1 (c), we verify the effectiveness of ChebGConv with the visualization of implicit connections and ablation studies.
We demonstrate the effectiveness of our method by conducting comprehensive evaluation and ablation studies on standard 3D benchmarks. Experimental results show that our GraFormer outperforms the state of the arts on Human3.6M ionescu2013human3 with only 0.65M parameters. In particular, we achieve 13 first-place results out of 15 categories on Human3.6M. In addition, the superiority of GraFormer is also verified on ObMan hasson2019learning, FHAD garcia2018first, and GHD mueller2018ganerated. The proposed method is task-independent and thus can be easily extended to other graph regression tasks.
In brief, the core contributions of our work are:
We claim that the relations of 2D joints are poorly exploited for 3D pose estimation up to now, and thus in-depth study of the relationship between 2D joints will further improve its performance.
We propose the novel GraFormer which comprises two modules, GraAttention and ChebGConv block. Compared to prior methods, both modules aim at further exploiting relations among 2D joints for 3D pose estimation.
We have conducted extensive experiments on public benchmarks, which show that our GraFormer outperforms the state-of-the-art techniques for 3D pose estimation task with much fewer parameters.
3D Pose Estimation
One-stage 3D pose estimation methods usually directly estimate 3D pose using image features. tekin2016direct regresses 3D pose from a spatio-temporal volume of bounding boxes in the central frame. pavlakos2017coarse first predicts the 3D heat map and then yields the 3D pose. mehta2017monocular
utilizes transfer learning to produce multi-modal data, which is fused to predict 3D pose.tekin2019h+ utilize 3D YOLO redmon2017yolo9000 model combined with the temporal information to predict the 3D pose of hand and object simultaneously. li2021exploiting groups and predicts the joint points in a multi-tasks manner. lin2021end
takes the features extracted from the images as inputs of a transformer to predict 3D pose.
Differently, multi-stage methods first adopt CNN networks to detect 2D joint coordinates, which then are used as inputs for 3D pose estimation. To yield 3D pose, chen20173d proposes to match 2D coordinates with a 3D pose database. Based on 2D coordinates, martinez2017simplesimon2017hand proposes a multi-view method that estimates 3D Hand pose by triangulating multiple 2D joints coordinates. hossain2018exploiting considers 2D coordinate information as a sequence and utilizes temporal information to predict the 3D coordinates in a sequence manner.
Instead of aggregating neighbor information in equal proportions according to the adjacency matrix in other works kipf2016semi; ge20193d, GAT velivckovic2017graph proposes to use self-attention to learn the weight of each node to aggregate neighbor information. aGCN yang2018graph works in the same way as GAT and learns the weights of neighbor joints through the self-attention mechanism. The difference lies in that aGCN uses the different activation functions and transformation matrix. Although GAT and aGCN improve the effectiveness by aggregating neighbor joints, the limited receptive field remains a challenging problem. To attain the global receptive field, zhao2019semantic utilizes the non-local layer wang2018non to learn the relationships between 2D joints. lin2021end modifies the standard transformer encoder and adjusts the dimension of encoder layers. However, such interaction ignores the graph structure as noted above. Our method increases the receptive field through self-attention mechanism and rediscovers graph structure by GCNs to effectively improve the performance for 3D pose estimation.
Recently, some works in the 3D pose estimation task have achieved state-of-the-art results by using graph convolutional networks, such as ge20193d; zhao2019semantic; doosti2020hope; liu2020comprehensive; xu2021graph. ge20193d uses a stacked hourglass network newell2016stacked to extract features from images, which are later reshaped to the graph structure. The 3D mesh is predicted using the extracted features by graph convolution, which then is used to predict the 3D pose. zhao2019semantic proposes semantic graph convolution, which learns the weights among neighbor joints. Non-local modules are used to enhance interaction among 2D joints. doosti2020hope proposes modified graph pooling and unpooling operations, which make the up-sampling and down-sampling procedures trainable for graph-structured data. xu2021graph proposes graph hourglass network and adopts SE Block hu2018squeeze to fuse features extracted from different layers of graph hourglass network. In this paper, we use graph convolution operations combined with transformer to solve the 3D pose estimation problem.
The proposed framework is shown in Fig. 2. Our method takes 2D joint coordinates as inputs and predicts the target 3D poses. GraFormer is a combination of transformer and graph convolution, which much further boosts the interaction of 2D joints to exploit the relations among them compared to prior works zhao2019semantic; doosti2020hope; liu2020comprehensive; xu2021graph. GraFormer is formed by stacking GraAttentions and ChebGConv blocks, which are described in detail as below.
Graph convolution operation and multi-head self-attention are the fundamental building elements of GraFormer. GCNs have the ability to handle graph-structured data because of the specific design described below. Assume that is the input of the -th layer of GCN, which contains joints and . denotes the input dimension. The initial input of GraFormer, , is the 2D joint coordinates of human body or hand. The output of a GCN layer can be formulated as:
where is the activation function, denotes a learnable weight matrix, is the adjacency matrix, and is the diagonal node degree matrix. We make learnable for LAM-Gconv in GraAttention.
The gains achieved by transformer-based methods lin2021end mainly come from the global vision view, which benefits from the self-attention mechanism as described below. We use the same notation of as input, which is put into a MLP layer to produce a feature with the dimension of , and then be cut into 3 portions, referred to as , and respectively. Thereafter, is attained by:
From this section, we start to introduce two GraFormer core modules, GraAttention and ChebGConv block. As illustrated in Fig. 2
, the 16 2D joint coordinates are first pre-processed by a ChebGConv layer (which will be introduced below) followed by the stack of GraAttention and ChebGConv block. Thereafter, 16 feature vectors with the dimension of(referred to as feature dimension) are yielded, which then are put into GraAttention. Before the multi-head self-attention block in GraAttention, the input feature vectors are first normalized by layer norm (LN), which is commonly used in transformer models vaswani2017attention.
The multi-head self-attention block is inherited from the transformer encoder layers vaswani2017attention. Different from other applications of transformer on 3D pose estimation like lin2021end, we remove the MLP layer from the standard transformer. Because we observe that the MLP impedes the learning procedure as shown in experiments. Then, the self-attention output is regularized by a dropout layer srivastava2014dropout. Each element of the output of the multi-head attention block contains all 2D joint information because the global interaction on the graph consists of all the 2D joints, such that the non-local relationships can be exploited.
Next, the output is normalized by the LN layer followed by GCN layers and a ReLU activation. Different from vanilla graph convolution operation, we make the adjacency matrix to be learnable thus that the GCN layer could be more flexible to learn graph-structured data. The GCN layer is referred to as LAM-Gconv as noted above. Next, a dropout layer is followed. GraAttention is a combination of multi-head self-attention block and GCN layer, and both blocks include a shortcut connection as shown in Fig. 2.
The second graph convolution operation we used is Chebyshev graph convolution defferrard2016convolutional, referred to as ChebGConv in short. Compared with traditional GCN layers, ChebGConv is more powerful to handle with graph-structured data, which can be formulated as below. We first introduce its calculation formula. The normalized graph Laplacian is computed as:
The formula of Chebyshev graph convolution is:
where denotes the Chebyshev polynomial of degree , , and denotes the rescaled Laplacian, ,
is the maximum eigenvalue of. denotes the trainable parameters in the graph convolutional layer. Since the convolution kernel is a K-order polynomial graph Laplacian, ChebGConv block is able to fuse information among the K top neighbors of a joint, which brings a larger receptive field. Coupled with the characteristic of graph convolution filter, ChebGConv block boosts the performance as verified in experiments.
Even though the more expensive computation than the traditional GCN layer, ChebGConv block complexity increases slightly because the graph is simple with only 16 joints.
To train GraFormer, we apply a loss function to the final output to minimize the error between the 3D predictions and ground truth. Given dataset, where is 2D joints of the human body or hand, is set to 16 for the human datasets and 21 for the hand datasets. is the 3D ground truth coordinates. denotes the total number of training samples.
We use the mean squared errors (MSE) to minimize the error between the 3D predictions and ground truth coordinates.
where denotes the predicted 3D coordinates. We calculate all 3D coordinate errors in millimeters.
In this section, we first introduce the experimental details and training settings. Next, we compare GraFormer with other state-of-the-art methods and analyze the results. Finally, we conduct ablation studies to verify the effectiveness of GraFormer.
We use 3 popular hand datasets, including ObMan hasson2019learning, FHAD garcia2018first, GHD mueller2018ganerated, and 1 human pose dataset Human3.6M ionescu2013human3 to evaluate our GraFormer. ObMan is a large synthetic dataset of hand-object interaction scenarios. The hands are generated from MANO romero2017embodied and the objects are selected from the Shapenet chang2015shapenet dataset. The ObMan dataset contains 141K training frames and 6K evaluation frames. Each frame contains an RGB image, a depth image, a 3D mesh of the hand and object, and 3D coordinates for the hand.
FHAD garcia2018first contains videos of manipulating different objects from the first-person perspective. There are a total of 21,501 frames of images, where 11019 frames are used for training and 10482 frames for testing.
There are 143,449 frames without objects, and 188,050 frames with objects in the GHD mueller2018ganerated dataset. The dataset is divided into groups and each group consists of 1024 frames. There are a total of 141 groups without objects and we use the first 130 groups for training and the last 11 groups for testing.
Human3.6M ionescu2013human3 is the most widely used dataset in the 3D human pose estimation task. It provides 3.6M accurate 3D poses captured by the MoCap system in the indoor environment. It contains 15 actions performed by seven actors from four cameras. There are two common evaluation protocols by splitting training and testing set in prior methods martinez2017simple; zhao2019semantic; liu2020comprehensive; xu2021graph. The first protocol uses subjects S1, S5, S6, S7 and S8 for training, and S9 and S11 for testing. Errors are calculated after the ground truth and predictions are aligned with the root joints. The second protocol uses subjects S1, S5, S6, S7, S8 and S9 for training, and S11 for testing. We conduct experiments using the first protocol. In this way, there are 1,559,752 frames for training, and 543,344 frames for testing.
We follow the same evaluation metric as inzhao2019semantic. The evaluation metric is the Mean Per Joint Position Error (MPJPE) in millimeters, which is calculated between the ground truth and the predicted 3D coordinates across all cameras and joints after aligning the predefined root joints (the pelvis joint).
For the three hand datasets ObMan, FHAD and GHD, we directly take 2D image coordinates and 3D camera coordinates as the inputs and ground truth. The 2D coordinates and 3D ground truth provided by ObMan can be used by simply converting the ground truth from meter to millimeter. The 3D ground truth provided by FHAD are world coordinates, and we use the extrinsic matrix to calculate the corresponding camera coordinates. The 3D ground truth of GHD are already the camera coordinate system, and the 2D coordinates need to be cropped, scaled and restored. We use the hand coordinates as input but object coordinates for all hand datasets. For Human3.6M, because of the multiple camera views, it needs to be normalized according to zhao2019semantic before training and evaluation.
In our experiment, we set the number of N in Fig. 2 to 5 and adopt 4 heads for self-attention. Different from the feature dimension value of 64 or 128 in prior works zhao2019semantic; xu2021graph, we set the middle feature dimension of the model to 96 for GraFormer with a dropout rate of 0.25. We adopt Adam kingma2014adam
optimizer for optimization with an initial learning rate of 0.001 and mini-batches of 64. For Human3.6M, we multiply the learning rate by 0.9 every 75000 steps. For hand datasets, the learning rate decays by 0.9 every 30 epochs. We train GraFormer for 50 epochs on Human3.6M, 900 epochs on Obman and GHD and 3000 epochs on FHAD.
Performance and Comparison
In this section, we evaluate our GraFormer on several popular datasets and analyze the performances compared with other state-of-the-art methods.
Performance on Hand Datasets
We first verify GraFormer on hand datasets via comparing with Graph U-Net doosti2020hope and Linear model martinez2017simple since all of these methods regress 3D pose results by taking 2D ground truth coordinates as inputs. The results on 3 hand datasets are shown in Tab. 2. One can see that GraFormer achieves the best performance across all these 3 hand datasets. In particular, GraFormer surpasses Graph U-Net by a large margin of , and on ObMan, FHAD and GHD respectively. We note that small datasets are not friendly to self-attention, even so, GraFormer still beats Graph U-Net on the severely small dataset FHAD.
Performance on Human Pose Dataset
Prior work results on Human3.6M can be divided into two groups as shown in Tab. 1. The top group methods take the images as inputs and then yield the 3D poses using the learned features. For our method, we adopt 2D coordinate detection results provided by zhao2019semantic, which are detected by stacked hourglass network newell2016stacked. In the top group, we only compare our method with the methods before zhao2019semantic. Because the new methods use more advanced 2D detectors, which can provide more accurate 2D coordinates but their codes are not public available. In the bottom group, we compare with the most advanced methods using the same 2D inputs. The results show that GraFormer surpasses all the methods, which evaluates the superiority of our method.
The bottom group methods take 2D ground truth as inputs to predict 3D pose coordinates directly. Compared with the previous methods, GraFormer achieves the best performance which evaluates the effectiveness of our method. In particular, GraFormer obviously improves scores in direction, eating, greet, phone, pose, smoke, and wait with only 0.65M parameters (18% of xu2021graph). Interestingly, we find that these actions have large magnitude motions, which implies longer distances among 2D joints than the actions of Photo and SittingD. This means GraFormer has a more powerful capability to capture information of 2D graph joints with larger magnitude action motions.
Discussion on model parameters
We start our ablation experiments by comparing GraFormer on different parameter configurations with other methods on the Human3.6M dataset, Tab. 3. We report the results of our models of two configurations to prove that our method can achieve better results with fewer parameters than other methods. GraFormer-small has only 2 layers, and the feature dimension is 64 with a dropout rate of 0.1. GraFormer sets the N to 5, and the feature dimension is set to 96 with a dropout rate of 0.25. Our GraFormer achieves better results with even much fewer parameters than GraphSH, etc. Our lightweight version, GraFormer-small, with only 72 fewer parameters than SemGCN, beats SemGCN by .
Effects of GraFormer Modules
Next, we test GraFormer modules on Human3.6M ionescu2013human3 and ObMan hasson2019learning
. We design 5 models by removing or replacing GraFormer’s modules to test the effects of our method. We keep all parameters the same for the 5 models if not particularly indicated. Specifically, Model-T is formed through replacing the stack of GraAttention and ChebGConv block by transformer encoder. Model-C removes GraAttention from GraFormer. Model-M replaces GraAttention with self-attention. Model-AT removes the ChebGconv block from GraFormer. And model-AM reserves MLP compared to GraFormer. The results are shown in Table4.
From the results of model-T, we can find that the transformer is poor for 2D-to-3D pose estimation. This is because transformer ignores graph structure information. The results of model-C and model-M are worse than GraFromer, which prove that GraAttention is necessary and more effective than self-attention. The loss of ChebGConv block in model-AT brings worse performance, which illustrates the effectiveness of ChebGConv block. The performance also degrades when plugging the MLP layer after self-attention in GraAttention, which verifies that the MLP layer impedes the learning of 3D poses actually. In Fig. 4, we show the test errors of model-AM and GraFormer. We find that the test error of model-AM is hindered at about 250 epochs while test error of GraFomer still decreases until below 12. This reconfirms that the MLP layer can be a deterrent on 2D-to-3D pose estimation task.
Fig. 3 shows the visualization results of learned adjacency matrixes of LAM-Gconv layers of GraAttention. In the first sub-figure (1), the color of some 33 regions is obviously brighter than other regions, which indicates that interaction among these joints takes greater weights and these joints are closely connected. Interestingly, we note that joints 2-4, 5-7, 11-13 and 14-16 are four limbs of the human, which are activated in regions. Which implies that the relations of joints on a limb are strongly connected, and our GraFormer is able to find these relations effectively. The regions of sub-figures from (2) to (4) become larger, which shows that GraFormer finds long-range relationships in these layers. The sub-figure (5) illustrates that the interaction regions become much smaller, which implies that joints mainly retain their own information, and a little information interacts among joints.
In Figure 5, we show the predicted 3D hand results on ObMan hasson2019learning (top row), FHAD garcia2018first (middle row) and GHD mueller2018ganerated (bottom row). The images in columns 1, 3 and 5 are skeleton figures drawn using 2D ground truth. In columns 2, 4 and 6, the colored skeletons are drawn using the 3D predictions, and the black skeletons are drawn using the 3D ground truth. We can find that our method is able to estimate the 3D poses accurately using 2D coordinates. It shows that our method could effectively learns 3D poses by exploiting the relationship among 2D joints.
Figure 6 is a visualization of graph Laplacian of different orders of Chebyshev graph convolution. The top row shows the schematic diagrams of joint information aggregation according to the bottom row. The width of the line implies the weights between 2D joints. The bottom row shows the visualization of the corresponding Laplacian matrix with 0-order (a), 1-order (b) and 2-order (c) respectively. It is easy to find that bigger orders matrix activates more 2D joints, which implies that bigger orders of Laplacian matrix have the capability to find more implicit relations.
2D-to-3D pose estimation task takes graph-structured 2D joint coordinates as inputs. We claim that 2D joints relations remain semi-developed and the performance can be further boosted by exploiting these relations. We propose a novel model combined graph convolution and transformer, called GraFormer, for 3D pose estimation, which aims at better exploiting relations among graph-structured 2D joints. Extensive experiment results across popular benchmarks show that our method achieves state-of-the-art performances compared with previous works using much fewer parameters.