TFPose: Direct Human Pose Estimation with Transformers

by   Weian Mao, et al.

We propose a human pose estimation framework that solves the task in the regression-based fashion. Unlike previous regression-based methods, which often fall behind those state-of-the-art methods, we formulate the pose estimation task into a sequence prediction problem that can effectively be solved by transformers. Our framework is simple and direct, bypassing the drawbacks of the heatmap-based pose estimation. Moreover, with the attention mechanism in transformers, our proposed framework is able to adaptively attend to the features most relevant to the target keypoints, which largely overcomes the feature misalignment issue of previous regression-based methods and considerably improves the performance. Importantly, our framework can inherently take advantages of the structured relationship between keypoints. Experiments on the MS-COCO and MPII datasets demonstrate that our method can significantly improve the state-of-the-art of regression-based pose estimation and perform comparably with the best heatmap-based pose estimation methods.


page 1

page 3

page 7

page 12

page 13

page 14

page 15


Poseur: Direct Human Pose Regression with Transformers

We propose a direct, regression-based approach to 2D human pose estimati...

Compositional Human Pose Regression

Regression based methods are not performing as well as detection based m...

Self-Constrained Inference Optimization on Structural Groups for Human Pose Estimation

We observe that human poses exhibit strong group-wise structural correla...

Rethinking the Heatmap Regression for Bottom-up Human Pose Estimation

Heatmap regression has become the most prevalent choice for nowadays hum...

T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression

6D pose estimation is the task of predicting the translation and orienta...

LM-Reloc: Levenberg-Marquardt Based Direct Visual Relocalization

We present LM-Reloc – a novel approach for visual relocalization based o...

Growing Regression Forests by Classification: Applications to Object Pose Estimation

In this work, we propose a novel node splitting method for regression tr...

1 Introduction

Human pose estimation requires the computer to obtain the human keypoints of interest in an input image and plays an important role in many computer vision tasks such as human behavior understanding.

Existing mainstream methods solving the task can be generally categorized into heatmap-based (Figure 1 top) and regression-based methods (Figure. 1 bottom). Heatmap-based methods often first predict a heatmap or a classification score map with fully convolutional networks (FCNs), and then the body joints are located by the peak’s locations in the heatmap or the score map. Most pose estimation methods are heatmap-based because it has relatively higher accuracy. However, the heatmap-based methods may suffer the following issues. 1) A post-processing (e.g.

, the “taking-maximum” operation) is needed. The post-processing might not be differentiable, making the framework not end-to-end trainable. 2) The resolution of heatmaps predicted by the FCNs is usually lower than the resolution of the input image. The reduced resolution results in a quantization error and limits the precision of the keypoint’s localization. This quantization error might be solved by shifting the output coordinates according to the value of the pixels near the peak, but it makes the framework much more complicated and introduces more hyper-parameters. 3) The ground truth heatmaps need to be manually designed and heuristically tuned, which might cause many noises and ambiguities contained in the ground-truth maps, as show in

[31, 41, 21].

(a) Heatmap-based method
(b) Regression-based method
Figure 1: Comparison of mainstream pose estimation pipelines. (a) Heatmap-based methods. (b) Regression-based methods.

In contrast, the regression-based methods usually directly map the input image to the coordinates of body joints with a FC (fully-connected) prediction layer, eliminating the need for heatmaps. The pipeline of regression-based methods is much more straightforward than heatmap-based methods as in principle pose estimation is a kind of regression tasks such as object detection. Moreover, the regssion-based method can bypass the aforementioned drawbacks of heatmap-based methods, thus being more promising.

However, there are only a few research works focusing on regression-based methods because regression-based methods often have inferior performance to heatmap-based methods. The reasons may be four-fold. First, in order to reduce the network parameters in the FC layer, in the DeepPose [46], a global average pooling is applied to reduce the feature map resolution before the FC layer. This global average pooling destroys the spatial structure of the convolutional feature maps, and significantly deteriorates the performance. Next, as shown in DirectPose [44] and SPM [35], in regression-based methods, the convolutional features and predictions are misaligned, which results in low localization precsion of the keypoints. Moreover, regression-based methods only regress the coordinates of body joints and does not take account of the structured dependency between these keypoints[41].

Recently, we have witnessed the rise of vision transformers [54, 15, 6]. The transformers are originally designed for the sequence-to-sequence tasks, which inspires us to formulate the single person pose estimation to the problem of predicting K-length sequential coordinates, where K is the number of body joints for one person. This leads to a simple and novel regression-based pose estimation framework, termed TFPose (i.e., Transformer-based Pose Estimation). As shown in Figure 2, taking as inputs the feature maps of CNNs, the transformer sequentially predict coordinates. TFPose can bypass the aforementioned difficulties of regression-based methods. First, it does not need the global average pooling as in DeepPose [46]. Second, due to the multi-head attention mechanism, our method can avoid the feature misalignment between the convolutional features and predictions. Third, since we predict the keypoints in the sequential way, the transformer can naturally capture the structured dependency between the keypoints, resulting in improved performance.

We summarize the main contributions as follows.

  • TFPose is the first transformer-based pose estimation framework. Our proposed framework adapts to the simple and straightforward regression-based methods, which is end-to-end trainable and can overcome many drawbacks of the heatmap-based methods.

  • Moreover, our TFPose can naturally learn to exploit the structured dependency between the keypoints without heuristic designs, e.g., in [41]. This results in improved performance and better interpretability.

  • TFPose achieves greatly advance the state-of-the-art of regression-based methods, making the regression-based methods comparable to the state-of-the-art heatmap-based ones. For example, we improve the previously best regression-based method Sun et al. [42] by 4.4% AP on the COCO keypoint detection task, and Aiden et al. [34] by 0.9% PCK on the MPII benchmark.

Figure 2: Overall pipeline of TFPose. The model directly predicts a sequence of keypoint coordinates in parallel by combining a common CNN with a transformer architecture. A transformer decoder takes as input a fix number of keypoint queries and encoder output. Then, we pass the output embedding of the decoder to a multi-layer feed forward network that predicts final keypoint coordinates.

2 Related Work

Transformers in computer vision. After being proposed in [47]

, Transformers have achieved significant progress in NLP (Natural Language Processing

[14, 2]. Recently, Transformers have also attracted much attention in computer vision community. For basic image classification task, ViT[15] apply a pure Transformer to sequential image patches. Expect for image classification, vision Transformer is also widely applied to object detection[6, 54], segmentation[48, 49], pose estimation[22, 23, 28], low-level vision task [8]. More details, we refer to [16]. Specially, DETR[6] and Deformable DETR[54] formulate the object detection task to predict a box set so that object detection model can be trained end-to-end; the Transformer applications in both 3D Hand Pose Estimation[22, 23] and 3D human pose estimation[22, 23] show that Transformer is suitable for modeling human pose.

Heatmap-based 2D pose estimation. Heatmap-based 2D pose estimation methods[9, 51, 40, 4, 26, 5, 10, 17, 33] perform the state-of-the-art accuracy in 2D human pose estimation. Recently, most work, including both top-down and bottom up, are heatmap-based methods. [33] firstly propose hourglass-style framework and hourglass-style framework also be widely applied, such as, [4, 26]. [40] propose a novel network architecture for heatmap-based 2D pose estimation and achieve a excellent performance. [10] propose a new bottom-up method achieve impressive performance in CrowdPose dataset[25] and improved by [31]. [4] propose a efficient network achieving the the-state-of-art performance in COCO keypoint detection dataset[29]. However, [42, 44] argue that heatmap-based methods cannot be trained end-to-end, due to the ”taking-maximum” operation. Recently, the noise and ambiguity in the ground truth heatmap are found by [31, 41]. [21] finds the heatmap data processing applied by most previous work is biased and proposed an new unbiased data processing method.

Regression-based 2D pose estimation. 2D human pose estiamtion is naturally a regression problem[42]. However, regression based methods are not accurate as well as heatmap-based methods, thus there are just a few works[46, 42, 41, 7, 44, 35] for it. Apart from that, although some methods, such as G-RMI [36], apply regression method to reduce the quantization errors casued by heatmap, they are essentially heatmap-based methods. There are some work point out the reason of the bad performance of regression-based method. Directpose[44] points out the feature mis-alignment issue and propose a mechanism to align the feature and the predictions; [41] indicates regression-based method cannot learn the structure-aware information well and proposal a hand-design model for pose estimation to force regression-based method learn the structure-aware information better; Sun et al.[42] propose integral regression, which shares the merits of both heatmap representation and regression approaches, to avoid non-differentiable postprocessing and quantization error issues.

3 Our Approach

3.1 TFPose Architecture

This work focus on the single pose estimation task. Following previous works, we first apply a person detector to obtain the bounding boxes of persons. Then, according to the detected boxes, each person is cropped from the input image. We denote the cropped image by , where ,

is the height and the width of the image, respectively. With the cropped image with a single person, the previous heatmap-based methods apply a convolutional neural network

to the patch to predict keypoint heatmaps ( for joint) of this person, where is the number of the predicted keypoint. Formally, we have


Each pixel of H

represents the probability that the body joints locate at this pixel. To obtain the joints’ coordinates

( for joint), those methods usually use the “taking-maximum” operation to obtain the locations with peak activations. Formally, let p be the spatial locations on H, and it can be formulated as


Note that in the heatmap-based methods, the localization precision of p is up to the resolution of H, which is often much lower than the resolution of the input and thus causes the quantization errors. Moreover, the operation here is not differential, making the pipeline not end-to-end trainable. In TFPose, we instead treat J as a K-length sequence and directly map the input I to the body joints’ coordinates J. Formally,


where is composed of three main components: a standard CNN backbone to extract multi-level feature representations, a feature encoder to capture and fuse multi-level features and a coarse-to-fine decoder to generate the a sequence of keypoint coordinates. It is illustrated in Figure 2. Note that our TFPose is fully differentiable and the localization precision is not limited by the resolution of the feature maps.

3.2 Transformer Encoder

Figure 3: Positional encoding. This figure illustrates the positional embeddings to the input of the transformer.

represents the level embeddings depicting which level a feature vector comes from.

represents the pixel embedding depicting the spatial location of a feature vector on the feature maps. We use to denote with position embedding. Following  [54], both and are the inputs of the transformer.

As shown in Figure 3, the backbone extracts multi-level features of the input image. The multi-level feature maps are denoted by , , and

, respectively, whose strides are 4, 8, 16 and 32, respectively. We separately apply a

convolution to these feature maps so that they have the same number of the output channels. These feature maps are flatten and concatenated together, which results in the input to the first encoder in the transformer , where is the number of the pixel in the . Here, we use denotes the output to the -th encoder in the transformer. Following [54, 47], is added with the positional embeddings and we denote with the positional embeddings by . The details of the positional embeddings will be discussed in Section 3.2. Afterwards, both and are sent to the transformer to compute the memory . With the memory , a query matrix will be used in the transformer decoder to obtain the K body joints’ coordinates .

We follow Deformable DETR [54] to design the encoder in our transformer. As mentioned before, before is taken as inputs, each feature vector of is added with the positional embeddings. Following Deformable DETR, we use both level embedding and pixel position embeddings . The former encodes the level where the feature vector comes from, and the latter is the feature vector’s spatial location on the feature maps. As shown in Figure 3, all the feature vectors from level are added with and then the feature vectors are added with their pixel position embeddings , where is the 2-D cosine positional embeddings corresponding to the 2-D location of the feature vector on the feature maps.

In TFPose, we use encoder layers. For encoder layer, as shown in Figure 4, the previous encoder layer’s outputs will be taken as the input of this layer. Following Deformable DETR, we also compute the pixel-to-pixel attention between the output vectors of each encoder layer (denoted by ‘p2p attention’). After transformer encoder layers are applied, we can obtain the memory .

3.3 Transformer Decoder

Figure 4: Transformer architecture. During training, deconvolution modules are used to upsamle transformer encoder output () for for auxiliary loss. During testing, only output Transformer decoder. ‘Norm’ represent normalization; (, ) represent the coordinate for keypoint.

In the decoder, we aim to decode the desired keypoint coordinates from the memory . As mentioned before, we use a query matrix to achieve this. is essentially an extra learnable matrix, which is jointly updated with the model parameters during training and each row of which corresponds to a keypoint. In TFPose, we have transformer decoder layers. As shown in Figure 4, each decoder layer takes as input the memory and the outputs of the previous decoder layer . The first layer takes as inputs and the matrix . Similarly, is added with the positional embeddings. The result is denoted by . The and will be sent to the query-to-query attention module (denoted as ‘q2q attention’), which aims to model the dependency between human body joints. The q2q attention module use , and as values, queries and keys, respectively. Later, the output of the q2q attention module and used to compute the pixel-to-query attention (denoted as ‘p2q attention’) with the value being the former and query being the latter. Then, an MLP will be applied to the output of p2q attention the output of the decoder . The keypoint coordinates are predicted by applying an MLP with output channels being to each row of .

Instead of simply predicting the keypoint coordinates in the final decoder layer, inspired by [7, 20, 54], we require all the decoder layers to predict the keypoint coordinates. Specifically, we let the first decoder layer directly predict the target coordinates. Then, every other decoder layer refines the predictions of its previous decoder layer by predicting refinements . In that way, the keypoint coordinates can be progressively refined. Formally, let be the keypoint coordinates predicted by the -th decoder layer, the predictions of the -th decoder layer are


where and

denote the sigmoid and inverse sigmoid function, respectively.

is a randomly-initialized matrix and jointly updated with model parameters during training.

3.4 Training Targets and Loss Functions

The loss functions of TFPose consist of two parts. The first part is the

regression loss. Let be the ground-truth coordinates. The regression loss is formulated as,


where is the number of the decoders, and every decoder layer is supervised with the target keypoint coordinates. The second part is an auxiliary loss . Following DirectPose [44], we use the auxiliary heatmap learning during training 111The heatmap branch is removed in inference., which can result in better performance. In order to use the heatmap leanrning, we gather the feature vectors that were from and reshape these vectors into the original spatial shape. The result is denoted by . Similar to simple baseline[51], we apply deconvolution to to upsample the feature maps by and generate the heatmap . Then, we compute the mean square error (MSE) loss between the predicted and ground-truth heatmaps. The ground-truth heatmaps are generated by following [33, 52]. Formally, the auxiliary loss function is


We sum the two loss functions to obtain the final overall loss


where is a constant and used to balance the two losses.

4 Experiments

4.1 Implementation details.

Datasets. We conduct a number of ablation experiments on two mainstream pose estimation datasets.

Our experiments are mainly conducted on COCO2017 Keypoint Detection[50] benchmark, which contains about person instances with 17 keypoints. Following common settings, we use the same person detector in Simple Baseline [52] for COCO evaluation. We report results on the val set for ablation studies and compare with other state-of-the-art methods on the test-dev

set. The Average Precision (AP) based on Object Keypoint Similarity (OKS) is employed as the evaluation metric.

Besides COCO dataset, we also report results on MPII dataset [1]. MPII is a popular benchmark for single person 2D pose estimation, which has images. In total, there are annotated poses for training, and another poses for testing. The Percentage of Correct Keypoints (PCK) metric is used for evaluation.

Model settings. Unless specified, ResNet-18[18] is used as the backbone in ablation study. The size of input image is or

. The weights pre-trained on ImageNet

[13] are used to initialize the ResNet backbone. The rest parts of our network are initialized with random parameters. For the Transformer, we adopt Defermable Attention Module proposed in [54] and the same hyper-parameters are used.

Training. All the models are optimized by AdamW[30] with a base learning rate of . and are set to and . Weight decay is set to . is set to by default for balancing the regression loss and auxiliary loss. Unless specified, all the experiments use a cosine learning schedule with base learning rate . Learning rate of the Transformers and the linear projections for predicting keypoints offsets is decreased by a factor of . For data augmentation, random rotation (), random scale (), flipping and half body data augmentation[50] are applied. For auxiliary loss, we follow Unbiased Data Processing (UDP) [21] to generate unbiased ground truth.

4.2 Ablation Study

Figure 5: Convergence curves of TFPose superivsed by different kinds of losses on COCO val2017 set. ”wo aux loss” indicates only regression loss is employed. ”with aux loss” indicates both regression loss and auxiliary loss are employed.

Query-to-query attention. In the proposed TFPose, query-to-query attention is designed to capture structure-aware information cross all the keypoints. Unlike [41]

which uses a hand-design method to explicitly force the model to learn the structure-aware information, query-to-query attention models human body structure implicitly. To study the effect of query-to-query attention, we report the results of removing the query-to-query attention in all decoder layers. As shown in Table 

1, the proposed query-to-query attention improve the performance by 1.3% AP with only 0.1 GFLOPs more computational cost.

3.51 63.2 85.1 69.9 60.3 70.2
3.61 64.5 85.2 71.2 61.5 71.5
Table 1: The effect of query-to-query attention in decoder layers on COCO val2017 set. ”q2q” indicates whether add query-to-query attention in the decoder. In this experiment, we set the number of transformer encoder layers: , and transformer decoder layers: . As shown in the table, decoder with query-to-query attention have better performance.

Configurations of Transformer decoder. Here we study the effect of width and depth of the decoder. Specifically, we conduct experiments by varying the number of channels of the input features and the number of decoder layers in Transformer decoder.

As shown in Table 2, Transformers with 256-channel feature maps is 1.3% AP higher than 128-channels ones. Moreover, we change the number of decoder layers. As shown in Table 3, the performance grows at the first three layers and saturates at the fourth decoder layer.

128 2.28 63.2 85.0 69.8 60.6 69.8
256 3.61 64.5 85.2 71.2 61.5 71.5
Table 2: The effect of the number of channels of the input features to the Transformer encoder on COCO val2017 set. is the number of channels. In this experiment, we set the number of transformer encoder layers: , and transformer decoder layers: .
1 6.32 65.7 86.3 73.4 63.0 72.4
2 6.41 66.9 86.5 74.0 64.2 73.8
3 6.50 67.1 86.6 74.2 64.5 73.9
4 6.59 67.2 86.6 74.2 64.6 74.0
5 6.68 67.2 86.6 74.2 64.6 74.1
6 6.77 67.2 86.6 74.2 64.6 74.1
Table 3: Ablation study of different numbers of decoder layers on COCO val2017 set. is the number of decoder layers used for refining the location of key-points. ”GFLOPs” indicates the computational cost. In this experiment, we set the number of transformer encoder layers: .
Figure 6: Visualization of the attention weights of the q2q attention. We average the attention maps over the whole COCO 2017val dataset. The left map is the attention weights of the second decoder layer. The right map is the attention weights of the third decoder layer. ‘L’ means the joints are in the left. ‘R’ means the joints are in the right. The horizontal axis and the vertical axis represent the input query and key of the attention module, respectively. Multi-head attention computes the attention weights between each pair of the queries and keys. The query attends more to the key with a higher attention weight.

Auxiliary loss. As shown in previous works[15, 54, 54], the transformer modules may converge slower. To mitigate this issue, we adopt the deformable attention module proposed in [54]. Apart from that, we propose an auxiliary loss to accelerate the convergence speed of TFPose. Here, we investigate the effect of the auxiliary loss. In this experiment, the first model is only supervised by regression loss; the second model is supervised by both regression loss and auxiliary loss. As shown in Figure 5 and Table 4, the auxiliary loss can significantly accelerates the convergence speed of TFPose and boost the performance by a large margin ( AP).

6.76 67.2 86.6 74.2 64.6 74.1
6.76 69.5 87.5 76.5 66.1 77.0
Table 4: Ablation study of effectiveness of auxiliary loss on COCO val2017 set. ”aux” indicts whether using auxiliary loss. In this experiment, we set the number of transformer encoder layers: , and transformer decoder layers: .

4.3 Discussions on TFPose

Visualization of sampling keypoints. To study how the Deformable Attention Module locate the body joints, we visualize the sampling locations of the module on the feature maps . In Deformable Attention Module, there are 8 attention heads and every head will sample 4 points on every feature map. So for the feature map, there are 32 sampling points. As shown in Figure 7, the sampling points (red dot) are all densely located nearby the ground truth (yellow circle). This visualization shows that TFPose can address the feature mis-alignment issue in a sense, and supervises the CNN with dense pixel information.

Figure 7: Visualisation of the sampling point on feature map. There are 17 queries for 17 keypoints. We visualize 12 body joints queries (not including facial joints). Each image correspond to a body joints. Red dot represent the sampling point; yellow circle represent the ground truth.

Visualization of query-to-query attention. To further study how the query-to-query self-attention module works, we visualize the attention weights of the query-to-query self-attention. As shown in Figure 6, there are two obvious patterns of attention: the first attention pattern is that the symmetric joints (e.g. left shoulder and right shoulder) are more likely to attend to each other, and the second attention pattern is that the adjacent joints (e.g. eyes, nose, and mouth) are more likely to attend to each other.

To have a better understanding of this attention pattern, we also visualize the attention graph between each keypoint according to the attention maps in the supplementary. This attention pattern suggests that TFPose can employ the context and structured relationship between the body joints to locate and classify the types of body joints.

4.4 Comparison with State-of-the-art Methods

Models Backbone Input Size GFLOPs AP(OKS)
DeepPose [46] ResNet-101 7.6 56.0
DeepPose [46] ResNet-152 11.3 58.3
8-stage Hourglass [33] - 19.5 66.9
8-stage Hourglass [33] - 25.9 67.1
CPN [9] ResNet-50 6.2 68.6 (69.4)
CPN [9] ResNet-50 13.9 70.6 (71.6)
SimpleBaseline [52] ResNet-50 8.9 70.4
SimpleBaseline [52] ResNet-50 20.0 72.2
Ours () ResNet-50 7.7 70.5
Ours () ResNet-50 9.2 71.0
Ours () ResNet-50 20.4 72.4
Table 5: Comparisons with previous works on the COCO val split. For CPN, the results in the brackets are with online hard keypoints mining. All the reported methods use person detectors with similar performance. Specifically, Hourglass and CPN use the person detector with 55.3% AP on COCO. Others use the person detector with 56.4% AP. DeepPose is implemented by the mmpose [12]. Flipping test is applied for all model. represents the number of encoder layers.

In this section, we compare TFPose with previous state-of-the-art 2D pose estimation methods on COCO val split, COCO test-dev split and MPII[1]. We compare these method in terms of both accuracy and computational cost. The results of our proposed TFPose and other state-of-the-art methods are listed in Table 5, Table 6 and Table 7.

Method Backbone Input Size GFLOPs AP AP AP AP AP
Heatmap-based methods
AE [32] HourGlass [33] - - 56.6 81.8 61.8 49.8 67.0
Mask R-CNN [17] ResNet-50 - - 62.7 87.0 68.4 57.4 71.1
CMU-Pose [5] VGG-19 [39] - - 64.2 86.2 70.1 61.0 68.8
G-RMI [36] ResNet-101 - - 64.9 85.5 71.3 62.3 70.0
HigherHRNet [10] HRNet-W48 - - 70.5 89.3 77.2 66.6 75.8
CPN [9] ResNet-Ince. 29.2 72.1 91.4 80.0 68.7 77.2
SimpleBaseline [51] ResNet-50 8.9 70.0 90.9 77.9 66.8 75.8
SimpleBaseline [51] ResNet-152 35.6 73.7 91.9 81.1 70.3 80.0
HRNet [40] HRNet-W32 16.0 74.9 92.5 82.8 71.3 80.9
Regression-based methods
DeepPose [46] ResNet-101 7.69 57.4 86.5 64.2 55.0 62.8
DeepPose [46] ResNet-152 11.34 59.3 87.6 66.7 56.8 64.9
Directpose [44] ResNet-50 - - 62.2 86.4 68.2 56.7 69.8
Directpose [44] ResNet-101 - - 63.3 86.7 69.4 57.8 71.2
SPM [35] HourGlass [33] - - 66.9 88.5 72.9 62.6 73.1
Int. Reg. [42] ResNet-101 11.0 67.8 88.2 74.8 63.9 74.0
Ours() ResNet-50 7.7 70.5 90.4 78.7 67.6 76.8
Ours() ResNet-50 9.2 70.9 90.5 79.0 68.1 77.0
Ours() ResNet-50 20.4 72.2 90.9 80.1 69.1 78.8
Table 6: Comparisons with state-of-the-art methods on COCO test-dev set. and denote flipping and multi-sacle testing, respectively. Input size and the GFLOPs are shown for the single person pose estimation methods. ’ResNet-Ince.’ represent the ResNet inception. The Simple baseline(ResNet-50) is tested with the official code. represents the number of encoder layers.
Method Head Sho. Elb. Wri. Hip Knee Ank. Total
Heatmap-based methods
Pishchulin et al. [37] 74.3 49.0 40.8 34.1 36.5 34.4 35.2 44.1
Tompson et al. [45] 95.8 90.3 80.5 74.3 77.6 69.7 62.8 79.6
Hu et al. [19] 95.0 91.6 83.0 76.6 81.9 74.5 69.5 82.4
Lifshitz et al. [27] 97.8 93.3 85.7 80.4 85.3 76.6 70.2 85.0
Raf et al. [38] 97.2 93.9 86.4 81.3 86.8 80.6 73.4 86.3
Bulat et al. [3] 97.9 95.1 89.9 85.3 89.4 85.7 81.7 89.7
Chu et al. [11] 98.5 96.3 91.9 88.1 90.6 88.0 85.0 91.5
Ke et al. [24] 98.5 96.8 92.7 88.4 90.6 89.3 86.3 92.1
Tang et al. [43] 98.4 96.9 92.6 88.7 91.8 89.4 86.2 92.3
Zhang et al. [53] 98.6 97.0 92.8 88.8 91.7 89.8 86.6 92.5
Regression-based methods
Carreira et al. [7] 95.7 91.7 81.7 72.4 82.8 73.2 66.4 81.3
Sun et al. [41] 97.5 94.3 87.0 81.2 86.5 78.5 75.4 86.4
Aiden et al. (ResNet-50)[34] 97.8 96.0 90.0 84.3 89.8 85.2 79.7 89.5
Ours (ResNet-50) 98.0 95.9 91.0 86.0 89.8 86.6 82.6 90.4
Table 7: MPII human pose test set PCKh accuracies. For our model, the number of encoder layers is set to 6.

Results on COCO val. set. As shown in Table 5, with similar computational cost, TFPose with 4 encoder layers and ResNet-50 surpass the previous regression-based method DeepPose with ResNet-101 (70.5% AP vs. 56.0% AP) by a large margin and even has much better performance than DeepPose with ResNet-152 (70.5% AP vs. 58.3% AP). Besides, TFPose also outperform many heatmap-based methods, for example, 8-stage Hourglass[33](70.5% AP vs. 67.1% AP), CPN[9](70.5% AP vs. 69.4% AP) by a large margin. It is also important to note that TFPose with 4 encoder layers and ResNet-50 outperforms the strong baseline SimpleBaseline[52] with ResNet-50 (70.5% AP vs. 70.4% AP) with lower computational cost (7.68 GFLOPs vs. 8.9 GFLOPs).

Results on COCO test-dev set. As shown in Table 6, TFPose achieves the best result among regression-based methods. Especially, TFPose with 6 encoder layers and ResNet-50 achieves 70.9% AP, which is higher than the Int. Reg.[42] (67.8% AP), and our computational cost is lower than the Int. Reg. (9.15 GFLOPs vs. 11.0 GFLOPs). Moreover, with the same bacbone ResNet-50, our TFPose even achieves better performance than the strong heatmap-based method SimpleBaseline (70.5% vs. 70.0% AP) with less computational complexity (7.7 GFLOPS vs. 8.9 GFLOPS). Additionally, the results of TFPose are also close to the best reported pose estimation results. For exmaple, the performance of TFPose (72.2% AP) is close to the ResNet-Inception based CPN(72.1% AP) and ResNet-152 based SimpleBaseline (73.7% AP). Note that they use much larger backbones than ours.

Results on MPII test set. On the MPII benchmark, TFPose also achieves the best results among the regression-based methods. As shown in Table 7, TFPose with ResNet-50 is higher than the method proposed by Aiden et al.[34] (90.4% vs. 89.5%) with the same backbone. TFPose is also comparable to heatmap-based methods.

5 Conclusion

We have proposed a novel pose estimation framework named TFPose built upon Transformers, which largely improves the performance of the regression-based pose estimation and bypasses the drawbacks of heatmap-based methods such as the non-differentiable post-processing and quantization error. We have shown that with the attention mechanism, TFPose can naturally capture the structured relationship between the body joints, resulting in improved performance. Extensive experiments on the MS-COCO and MPII benchmarks show that TFPose can achieve state-of-the-art performance among regression-based methods and is comparable to the best heatmap-based methods.


The authors would like to thank Alibaba Group for the donation of GPU cloud computing resources.


Appendix A Qualitative Results of TFPose

We show more qualitative results in Figure 10. TFPose works reliably under various challenging cases.

Figure 8: The pattern of symmetric joints. As shown in the right graph, left shoulder and right shoulder are symmetric joints and they attend to each other. The same pattern can be found in other body joints including left elbow and right elbow, left hip and right hip .
Figure 9: The pattern of adjacent joints. As shown in the right graph, left shoulder attend to its adjacent joints including right shoulder, left elbow, and head. The same pattern can be found in other body joints, e.g.., elbow and wrist.
Figure 10: Qualitative results of TFPose with ResNet-50 on COCO2017 val set (single-model and singe-scale testing). The joints in upper body are represented by green and the joints in lower body are represented by blue.

Appendix B Visualization of Transformer Attentions

b.1 Query-to-query Attention

We observe two obvious query-to-query attention patterns in different decoder layers, termed symmetric pattern and adjacent pattern, respectively. Both patterns exist in all decoder layers, we illustrate them separately for convenience. For symmetric pattern, Figure 8 demonstrates that the correlation between all pairs of symmetric joints in the third decoder layer. For adjacent pattern, Figure 9 explicitly shows that adjacent joints attend to each other in the second decoder layer.

b.2 Multi-scale Deformable Attention

We visualize the learned multi-scale deformable attention modules for better understanding. As shown in Figure 11 and Figure 12, the visualization indicates that TFPose looks at context information surround the ground truth joint. More concretely, the sampling points near the ground truth joint have higher attention weight (denoted as red), while the sampling points far from the ground truth joint own lower attention weight (denoted as blue).

Figure 11: Visualization of right shoulder’s pixel-to-query attention in the last decoder layer. For readability, we draw the sampling points and attention weights from feature map in different pictures. Each sampling point is marked as a filled circle whose color indicates its corresponding weight. The ground truth joint is shown as yellow cross marker.
Figure 12: Visualization of right knee’s pixel-to-query attention in the last decoder layer. For readability, we draw the sampling points and attention weights from feature map in different pictures. Each sampling point is marked as a filled circle whose color indicates its corresponding weight. The ground truth joint is shown as yellow cross marker.