1 Introduction
Human pose estimation requires the computer to obtain the human keypoints of interest in an input image and plays an important role in many computer vision tasks such as human behavior understanding.
Existing mainstream methods solving the task can be generally categorized into heatmapbased (Figure 1 top) and regressionbased methods (Figure. 1 bottom). Heatmapbased methods often first predict a heatmap or a classification score map with fully convolutional networks (FCNs), and then the body joints are located by the peak’s locations in the heatmap or the score map. Most pose estimation methods are heatmapbased because it has relatively higher accuracy. However, the heatmapbased methods may suffer the following issues. 1) A postprocessing (e.g.
, the “takingmaximum” operation) is needed. The postprocessing might not be differentiable, making the framework not endtoend trainable. 2) The resolution of heatmaps predicted by the FCNs is usually lower than the resolution of the input image. The reduced resolution results in a quantization error and limits the precision of the keypoint’s localization. This quantization error might be solved by shifting the output coordinates according to the value of the pixels near the peak, but it makes the framework much more complicated and introduces more hyperparameters. 3) The ground truth heatmaps need to be manually designed and heuristically tuned, which might cause many noises and ambiguities contained in the groundtruth maps, as show in
[31, 41, 21].In contrast, the regressionbased methods usually directly map the input image to the coordinates of body joints with a FC (fullyconnected) prediction layer, eliminating the need for heatmaps. The pipeline of regressionbased methods is much more straightforward than heatmapbased methods as in principle pose estimation is a kind of regression tasks such as object detection. Moreover, the regssionbased method can bypass the aforementioned drawbacks of heatmapbased methods, thus being more promising.
However, there are only a few research works focusing on regressionbased methods because regressionbased methods often have inferior performance to heatmapbased methods. The reasons may be fourfold. First, in order to reduce the network parameters in the FC layer, in the DeepPose [46], a global average pooling is applied to reduce the feature map resolution before the FC layer. This global average pooling destroys the spatial structure of the convolutional feature maps, and significantly deteriorates the performance. Next, as shown in DirectPose [44] and SPM [35], in regressionbased methods, the convolutional features and predictions are misaligned, which results in low localization precsion of the keypoints. Moreover, regressionbased methods only regress the coordinates of body joints and does not take account of the structured dependency between these keypoints[41].
Recently, we have witnessed the rise of vision transformers [54, 15, 6]. The transformers are originally designed for the sequencetosequence tasks, which inspires us to formulate the single person pose estimation to the problem of predicting Klength sequential coordinates, where K is the number of body joints for one person. This leads to a simple and novel regressionbased pose estimation framework, termed TFPose (i.e., Transformerbased Pose Estimation). As shown in Figure 2, taking as inputs the feature maps of CNNs, the transformer sequentially predict coordinates. TFPose can bypass the aforementioned difficulties of regressionbased methods. First, it does not need the global average pooling as in DeepPose [46]. Second, due to the multihead attention mechanism, our method can avoid the feature misalignment between the convolutional features and predictions. Third, since we predict the keypoints in the sequential way, the transformer can naturally capture the structured dependency between the keypoints, resulting in improved performance.
We summarize the main contributions as follows.

TFPose is the first transformerbased pose estimation framework. Our proposed framework adapts to the simple and straightforward regressionbased methods, which is endtoend trainable and can overcome many drawbacks of the heatmapbased methods.

Moreover, our TFPose can naturally learn to exploit the structured dependency between the keypoints without heuristic designs, e.g., in [41]. This results in improved performance and better interpretability.

TFPose achieves greatly advance the stateoftheart of regressionbased methods, making the regressionbased methods comparable to the stateoftheart heatmapbased ones. For example, we improve the previously best regressionbased method Sun et al. [42] by 4.4% AP on the COCO keypoint detection task, and Aiden et al. [34] by 0.9% PCK on the MPII benchmark.
2 Related Work
Transformers in computer vision. After being proposed in [47]
, Transformers have achieved significant progress in NLP (Natural Language Processing)
[14, 2]. Recently, Transformers have also attracted much attention in computer vision community. For basic image classification task, ViT[15] apply a pure Transformer to sequential image patches. Expect for image classification, vision Transformer is also widely applied to object detection[6, 54], segmentation[48, 49], pose estimation[22, 23, 28], lowlevel vision task [8]. More details, we refer to [16]. Specially, DETR[6] and Deformable DETR[54] formulate the object detection task to predict a box set so that object detection model can be trained endtoend; the Transformer applications in both 3D Hand Pose Estimation[22, 23] and 3D human pose estimation[22, 23] show that Transformer is suitable for modeling human pose.Heatmapbased 2D pose estimation. Heatmapbased 2D pose estimation methods[9, 51, 40, 4, 26, 5, 10, 17, 33] perform the stateoftheart accuracy in 2D human pose estimation. Recently, most work, including both topdown and bottom up, are heatmapbased methods. [33] firstly propose hourglassstyle framework and hourglassstyle framework also be widely applied, such as, [4, 26]. [40] propose a novel network architecture for heatmapbased 2D pose estimation and achieve a excellent performance. [10] propose a new bottomup method achieve impressive performance in CrowdPose dataset[25] and improved by [31]. [4] propose a efficient network achieving the thestateofart performance in COCO keypoint detection dataset[29]. However, [42, 44] argue that heatmapbased methods cannot be trained endtoend, due to the ”takingmaximum” operation. Recently, the noise and ambiguity in the ground truth heatmap are found by [31, 41]. [21] finds the heatmap data processing applied by most previous work is biased and proposed an new unbiased data processing method.
Regressionbased 2D pose estimation. 2D human pose estiamtion is naturally a regression problem[42]. However, regression based methods are not accurate as well as heatmapbased methods, thus there are just a few works[46, 42, 41, 7, 44, 35] for it. Apart from that, although some methods, such as GRMI [36], apply regression method to reduce the quantization errors casued by heatmap, they are essentially heatmapbased methods. There are some work point out the reason of the bad performance of regressionbased method. Directpose[44] points out the feature misalignment issue and propose a mechanism to align the feature and the predictions; [41] indicates regressionbased method cannot learn the structureaware information well and proposal a handdesign model for pose estimation to force regressionbased method learn the structureaware information better; Sun et al.[42] propose integral regression, which shares the merits of both heatmap representation and regression approaches, to avoid nondifferentiable postprocessing and quantization error issues.
3 Our Approach
3.1 TFPose Architecture
This work focus on the single pose estimation task. Following previous works, we first apply a person detector to obtain the bounding boxes of persons. Then, according to the detected boxes, each person is cropped from the input image. We denote the cropped image by , where ,
is the height and the width of the image, respectively. With the cropped image with a single person, the previous heatmapbased methods apply a convolutional neural network
to the patch to predict keypoint heatmaps ( for joint) of this person, where is the number of the predicted keypoint. Formally, we have(1) 
Each pixel of H
represents the probability that the body joints locate at this pixel. To obtain the joints’ coordinates
( for joint), those methods usually use the “takingmaximum” operation to obtain the locations with peak activations. Formally, let p be the spatial locations on H, and it can be formulated as(2) 
Note that in the heatmapbased methods, the localization precision of p is up to the resolution of H, which is often much lower than the resolution of the input and thus causes the quantization errors. Moreover, the operation here is not differential, making the pipeline not endtoend trainable. In TFPose, we instead treat J as a Klength sequence and directly map the input I to the body joints’ coordinates J. Formally,
(3) 
where is composed of three main components: a standard CNN backbone to extract multilevel feature representations, a feature encoder to capture and fuse multilevel features and a coarsetofine decoder to generate the a sequence of keypoint coordinates. It is illustrated in Figure 2. Note that our TFPose is fully differentiable and the localization precision is not limited by the resolution of the feature maps.
3.2 Transformer Encoder
As shown in Figure 3, the backbone extracts multilevel features of the input image. The multilevel feature maps are denoted by , , and
, respectively, whose strides are 4, 8, 16 and 32, respectively. We separately apply a
convolution to these feature maps so that they have the same number of the output channels. These feature maps are flatten and concatenated together, which results in the input to the first encoder in the transformer , where is the number of the pixel in the . Here, we use denotes the output to the th encoder in the transformer. Following [54, 47], is added with the positional embeddings and we denote with the positional embeddings by . The details of the positional embeddings will be discussed in Section 3.2. Afterwards, both and are sent to the transformer to compute the memory . With the memory , a query matrix will be used in the transformer decoder to obtain the K body joints’ coordinates .We follow Deformable DETR [54] to design the encoder in our transformer. As mentioned before, before is taken as inputs, each feature vector of is added with the positional embeddings. Following Deformable DETR, we use both level embedding and pixel position embeddings . The former encodes the level where the feature vector comes from, and the latter is the feature vector’s spatial location on the feature maps. As shown in Figure 3, all the feature vectors from level are added with and then the feature vectors are added with their pixel position embeddings , where is the 2D cosine positional embeddings corresponding to the 2D location of the feature vector on the feature maps.
In TFPose, we use encoder layers. For encoder layer, as shown in Figure 4, the previous encoder layer’s outputs will be taken as the input of this layer. Following Deformable DETR, we also compute the pixeltopixel attention between the output vectors of each encoder layer (denoted by ‘p2p attention’). After transformer encoder layers are applied, we can obtain the memory .
3.3 Transformer Decoder
In the decoder, we aim to decode the desired keypoint coordinates from the memory . As mentioned before, we use a query matrix to achieve this. is essentially an extra learnable matrix, which is jointly updated with the model parameters during training and each row of which corresponds to a keypoint. In TFPose, we have transformer decoder layers. As shown in Figure 4, each decoder layer takes as input the memory and the outputs of the previous decoder layer . The first layer takes as inputs and the matrix . Similarly, is added with the positional embeddings. The result is denoted by . The and will be sent to the querytoquery attention module (denoted as ‘q2q attention’), which aims to model the dependency between human body joints. The q2q attention module use , and as values, queries and keys, respectively. Later, the output of the q2q attention module and used to compute the pixeltoquery attention (denoted as ‘p2q attention’) with the value being the former and query being the latter. Then, an MLP will be applied to the output of p2q attention the output of the decoder . The keypoint coordinates are predicted by applying an MLP with output channels being to each row of .
Instead of simply predicting the keypoint coordinates in the final decoder layer, inspired by [7, 20, 54], we require all the decoder layers to predict the keypoint coordinates. Specifically, we let the first decoder layer directly predict the target coordinates. Then, every other decoder layer refines the predictions of its previous decoder layer by predicting refinements . In that way, the keypoint coordinates can be progressively refined. Formally, let be the keypoint coordinates predicted by the th decoder layer, the predictions of the th decoder layer are
(4) 
where and
denote the sigmoid and inverse sigmoid function, respectively.
is a randomlyinitialized matrix and jointly updated with model parameters during training.3.4 Training Targets and Loss Functions
The loss functions of TFPose consist of two parts. The first part is the
regression loss. Let be the groundtruth coordinates. The regression loss is formulated as,(5) 
where is the number of the decoders, and every decoder layer is supervised with the target keypoint coordinates. The second part is an auxiliary loss . Following DirectPose [44], we use the auxiliary heatmap learning during training ^{1}^{1}1The heatmap branch is removed in inference., which can result in better performance. In order to use the heatmap leanrning, we gather the feature vectors that were from and reshape these vectors into the original spatial shape. The result is denoted by . Similar to simple baseline[51], we apply deconvolution to to upsample the feature maps by and generate the heatmap . Then, we compute the mean square error (MSE) loss between the predicted and groundtruth heatmaps. The groundtruth heatmaps are generated by following [33, 52]. Formally, the auxiliary loss function is
(6) 
We sum the two loss functions to obtain the final overall loss
(7) 
where is a constant and used to balance the two losses.
4 Experiments
4.1 Implementation details.
Datasets. We conduct a number of ablation experiments on two mainstream pose estimation datasets.
Our experiments are mainly conducted on COCO2017 Keypoint Detection[50] benchmark, which contains about person instances with 17 keypoints. Following common settings, we use the same person detector in Simple Baseline [52] for COCO evaluation. We report results on the val set for ablation studies and compare with other stateoftheart methods on the testdev
set. The Average Precision (AP) based on Object Keypoint Similarity (OKS) is employed as the evaluation metric.
Besides COCO dataset, we also report results on MPII dataset [1]. MPII is a popular benchmark for single person 2D pose estimation, which has images. In total, there are annotated poses for training, and another poses for testing. The Percentage of Correct Keypoints (PCK) metric is used for evaluation.
Model settings. Unless specified, ResNet18[18] is used as the backbone in ablation study. The size of input image is or
. The weights pretrained on ImageNet
[13] are used to initialize the ResNet backbone. The rest parts of our network are initialized with random parameters. For the Transformer, we adopt Defermable Attention Module proposed in [54] and the same hyperparameters are used.Training. All the models are optimized by AdamW[30] with a base learning rate of . and are set to and . Weight decay is set to . is set to by default for balancing the regression loss and auxiliary loss. Unless specified, all the experiments use a cosine learning schedule with base learning rate . Learning rate of the Transformers and the linear projections for predicting keypoints offsets is decreased by a factor of . For data augmentation, random rotation (), random scale (), flipping and half body data augmentation[50] are applied. For auxiliary loss, we follow Unbiased Data Processing (UDP) [21] to generate unbiased ground truth.
4.2 Ablation Study
Querytoquery attention. In the proposed TFPose, querytoquery attention is designed to capture structureaware information cross all the keypoints. Unlike [41]
which uses a handdesign method to explicitly force the model to learn the structureaware information, querytoquery attention models human body structure implicitly. To study the effect of querytoquery attention, we report the results of removing the querytoquery attention in all decoder layers. As shown in Table
1, the proposed querytoquery attention improve the performance by 1.3% AP with only 0.1 GFLOPs more computational cost.q2q  GFLOPs  AP  AP  AP  AP  AP 
3.51  63.2  85.1  69.9  60.3  70.2  
✓  3.61  64.5  85.2  71.2  61.5  71.5 
Configurations of Transformer decoder. Here we study the effect of width and depth of the decoder. Specifically, we conduct experiments by varying the number of channels of the input features and the number of decoder layers in Transformer decoder.
As shown in Table 2, Transformers with 256channel feature maps is 1.3% AP higher than 128channels ones. Moreover, we change the number of decoder layers. As shown in Table 3, the performance grows at the first three layers and saturates at the fourth decoder layer.
C  GFLOPs  AP  AP  AP  AP  AP 
128  2.28  63.2  85.0  69.8  60.6  69.8 
256  3.61  64.5  85.2  71.2  61.5  71.5 
GFLOPs  AP  AP  AP  AP  AP  
1  6.32  65.7  86.3  73.4  63.0  72.4 
2  6.41  66.9  86.5  74.0  64.2  73.8 
3  6.50  67.1  86.6  74.2  64.5  73.9 
4  6.59  67.2  86.6  74.2  64.6  74.0 
5  6.68  67.2  86.6  74.2  64.6  74.1 
6  6.77  67.2  86.6  74.2  64.6  74.1 
Auxiliary loss. As shown in previous works[15, 54, 54], the transformer modules may converge slower. To mitigate this issue, we adopt the deformable attention module proposed in [54]. Apart from that, we propose an auxiliary loss to accelerate the convergence speed of TFPose. Here, we investigate the effect of the auxiliary loss. In this experiment, the first model is only supervised by regression loss; the second model is supervised by both regression loss and auxiliary loss. As shown in Figure 5 and Table 4, the auxiliary loss can significantly accelerates the convergence speed of TFPose and boost the performance by a large margin ( AP).
aux  GFLOPs  AP  AP  AP  AP  AP 
6.76  67.2  86.6  74.2  64.6  74.1  
✓  6.76  69.5  87.5  76.5  66.1  77.0 
4.3 Discussions on TFPose
Visualization of sampling keypoints. To study how the Deformable Attention Module locate the body joints, we visualize the sampling locations of the module on the feature maps . In Deformable Attention Module, there are 8 attention heads and every head will sample 4 points on every feature map. So for the feature map, there are 32 sampling points. As shown in Figure 7, the sampling points (red dot) are all densely located nearby the ground truth (yellow circle). This visualization shows that TFPose can address the feature misalignment issue in a sense, and supervises the CNN with dense pixel information.
Visualization of querytoquery attention. To further study how the querytoquery selfattention module works, we visualize the attention weights of the querytoquery selfattention. As shown in Figure 6, there are two obvious patterns of attention: the first attention pattern is that the symmetric joints (e.g. left shoulder and right shoulder) are more likely to attend to each other, and the second attention pattern is that the adjacent joints (e.g. eyes, nose, and mouth) are more likely to attend to each other.
To have a better understanding of this attention pattern, we also visualize the attention graph between each keypoint according to the attention maps in the supplementary. This attention pattern suggests that TFPose can employ the context and structured relationship between the body joints to locate and classify the types of body joints.
4.4 Comparison with Stateoftheart Methods
Models  Backbone  Input Size  GFLOPs  AP(OKS) 
DeepPose [46]  ResNet101  7.6  56.0  
DeepPose [46]  ResNet152  11.3  58.3  
8stage Hourglass [33]    19.5  66.9  
8stage Hourglass [33]    25.9  67.1  
CPN [9]  ResNet50  6.2  68.6 (69.4)  
CPN [9]  ResNet50  13.9  70.6 (71.6)  
SimpleBaseline [52]  ResNet50  8.9  70.4  
SimpleBaseline [52]  ResNet50  20.0  72.2  
Ours ()  ResNet50  7.7  70.5  
Ours ()  ResNet50  9.2  71.0  
Ours ()  ResNet50  20.4  72.4 
In this section, we compare TFPose with previous stateoftheart 2D pose estimation methods on COCO val split, COCO testdev split and MPII[1]. We compare these method in terms of both accuracy and computational cost. The results of our proposed TFPose and other stateoftheart methods are listed in Table 5, Table 6 and Table 7.
Method  Backbone  Input Size  GFLOPs  AP  AP  AP  AP  AP 
Heatmapbased methods  
AE [32]  HourGlass [33]      56.6  81.8  61.8  49.8  67.0 
Mask RCNN [17]  ResNet50      62.7  87.0  68.4  57.4  71.1 
CMUPose [5]  VGG19 [39]      64.2  86.2  70.1  61.0  68.8 
GRMI [36]  ResNet101      64.9  85.5  71.3  62.3  70.0 
HigherHRNet [10]  HRNetW48      70.5  89.3  77.2  66.6  75.8 
CPN [9]  ResNetInce.  29.2  72.1  91.4  80.0  68.7  77.2  
SimpleBaseline [51]  ResNet50  8.9  70.0  90.9  77.9  66.8  75.8  
SimpleBaseline [51]  ResNet152  35.6  73.7  91.9  81.1  70.3  80.0  
HRNet [40]  HRNetW32  16.0  74.9  92.5  82.8  71.3  80.9  
Regressionbased methods  
DeepPose [46]  ResNet101  7.69  57.4  86.5  64.2  55.0  62.8  
DeepPose [46]  ResNet152  11.34  59.3  87.6  66.7  56.8  64.9  
Directpose [44]  ResNet50      62.2  86.4  68.2  56.7  69.8 
Directpose [44]  ResNet101      63.3  86.7  69.4  57.8  71.2 
SPM [35]  HourGlass [33]      66.9  88.5  72.9  62.6  73.1 
Int. Reg. [42]  ResNet101  11.0  67.8  88.2  74.8  63.9  74.0  
Ours()  ResNet50  7.7  70.5  90.4  78.7  67.6  76.8  
Ours()  ResNet50  9.2  70.9  90.5  79.0  68.1  77.0  
Ours()  ResNet50  20.4  72.2  90.9  80.1  69.1  78.8 
Method  Head  Sho.  Elb.  Wri.  Hip  Knee  Ank.  Total 
Heatmapbased methods  
Pishchulin et al. [37]  74.3  49.0  40.8  34.1  36.5  34.4  35.2  44.1 
Tompson et al. [45]  95.8  90.3  80.5  74.3  77.6  69.7  62.8  79.6 
Hu et al. [19]  95.0  91.6  83.0  76.6  81.9  74.5  69.5  82.4 
Lifshitz et al. [27]  97.8  93.3  85.7  80.4  85.3  76.6  70.2  85.0 
Raf et al. [38]  97.2  93.9  86.4  81.3  86.8  80.6  73.4  86.3 
Bulat et al. [3]  97.9  95.1  89.9  85.3  89.4  85.7  81.7  89.7 
Chu et al. [11]  98.5  96.3  91.9  88.1  90.6  88.0  85.0  91.5 
Ke et al. [24]  98.5  96.8  92.7  88.4  90.6  89.3  86.3  92.1 
Tang et al. [43]  98.4  96.9  92.6  88.7  91.8  89.4  86.2  92.3 
Zhang et al. [53]  98.6  97.0  92.8  88.8  91.7  89.8  86.6  92.5 
Regressionbased methods  
Carreira et al. [7]  95.7  91.7  81.7  72.4  82.8  73.2  66.4  81.3 
Sun et al. [41]  97.5  94.3  87.0  81.2  86.5  78.5  75.4  86.4 
Aiden et al. (ResNet50)[34]  97.8  96.0  90.0  84.3  89.8  85.2  79.7  89.5 
Ours (ResNet50)  98.0  95.9  91.0  86.0  89.8  86.6  82.6  90.4 
Results on COCO val. set. As shown in Table 5, with similar computational cost, TFPose with 4 encoder layers and ResNet50 surpass the previous regressionbased method DeepPose with ResNet101 (70.5% AP vs. 56.0% AP) by a large margin and even has much better performance than DeepPose with ResNet152 (70.5% AP vs. 58.3% AP). Besides, TFPose also outperform many heatmapbased methods, for example, 8stage Hourglass[33](70.5% AP vs. 67.1% AP), CPN[9](70.5% AP vs. 69.4% AP) by a large margin. It is also important to note that TFPose with 4 encoder layers and ResNet50 outperforms the strong baseline SimpleBaseline[52] with ResNet50 (70.5% AP vs. 70.4% AP) with lower computational cost (7.68 GFLOPs vs. 8.9 GFLOPs).
Results on COCO testdev set. As shown in Table 6, TFPose achieves the best result among regressionbased methods. Especially, TFPose with 6 encoder layers and ResNet50 achieves 70.9% AP, which is higher than the Int. Reg.[42] (67.8% AP), and our computational cost is lower than the Int. Reg. (9.15 GFLOPs vs. 11.0 GFLOPs). Moreover, with the same bacbone ResNet50, our TFPose even achieves better performance than the strong heatmapbased method SimpleBaseline (70.5% vs. 70.0% AP) with less computational complexity (7.7 GFLOPS vs. 8.9 GFLOPS). Additionally, the results of TFPose are also close to the best reported pose estimation results. For exmaple, the performance of TFPose (72.2% AP) is close to the ResNetInception based CPN(72.1% AP) and ResNet152 based SimpleBaseline (73.7% AP). Note that they use much larger backbones than ours.
Results on MPII test set. On the MPII benchmark, TFPose also achieves the best results among the regressionbased methods. As shown in Table 7, TFPose with ResNet50 is higher than the method proposed by Aiden et al.[34] (90.4% vs. 89.5%) with the same backbone. TFPose is also comparable to heatmapbased methods.
5 Conclusion
We have proposed a novel pose estimation framework named TFPose built upon Transformers, which largely improves the performance of the regressionbased pose estimation and bypasses the drawbacks of heatmapbased methods such as the nondifferentiable postprocessing and quantization error. We have shown that with the attention mechanism, TFPose can naturally capture the structured relationship between the body joints, resulting in improved performance. Extensive experiments on the MSCOCO and MPII benchmarks show that TFPose can achieve stateoftheart performance among regressionbased methods and is comparable to the best heatmapbased methods.
Acknowledgements
The authors would like to thank Alibaba Group for the donation of GPU cloud computing resources.
References
 [1] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 3686–3693, 2014.
 [2] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are fewshot learners. In Proc. Adv. Neural Inform. Process. Syst., 2020.
 [3] Adrian Bulat and Georgios Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In Proc. Eur. Conf. Comput. Vis., pages 717–732, 2016.
 [4] Yuanhao Cai, Zhicheng Wang, Zhengxiong Luo, Binyi Yin, Angang Du, Haoqian Wang, Xiangyu Zhang, Xinyu Zhou, Erjin Zhou, and Jian Sun. Learning delicate local representations for multiperson pose estimation. In Proc. Eur. Conf. Comput. Vis., pages 455–472. Springer, 2020.
 [5] Zhe Cao, Gines Hidalgo, Tomas Simon, ShihEn Wei, and Yaser Sheikh. Openpose: realtime multiperson 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell., 43(1):172–186, 2019.
 [6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. Endtoend object detection with transformers. In Proc. Eur. Conf. Comput. Vis., pages 213–229. Springer, 2020.
 [7] Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation with iterative error feedback. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4733–4742, 2016.
 [8] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pretrained image processing transformer. arXiv preprint arXiv:2012.00364, 2020.
 [9] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multiperson pose estimation. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 7103–7112, 2018.
 [10] Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S Huang, and Lei Zhang. Higherhrnet: Scaleaware representation learning for bottomup human pose estimation. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5386–5395, 2020.
 [11] Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L Yuille, and Xiaogang Wang. Multicontext attention for human pose estimation. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1831–1840, 2017.
 [12] MMPose Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/openmmlab/mmpose, 2020.
 [13] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 248–255. Ieee, 2009.
 [14] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 [15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
 [16] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on visual transformer. arXiv preprint arXiv:2012.12556, 2020.
 [17] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask rcnn. In Proc. Int. Conf. Comput. Vis., pages 2961–2969, 2017.
 [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 770–778, 2016.
 [19] Peiyun Hu and Deva Ramanan. Bottomup and topdown reasoning with hierarchical rectified gaussians. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5600–5609, 2016.
 [20] Tao Hu, Honggang Qi, Jizheng Xu, and Qingming Huang. Facial landmarks detection by selfiterative regression based landmarksattention network. volume 32, 2018.
 [21] Junjie Huang, Zheng Zhu, Feng Guo, and Guan Huang. The devil is in the details: Delving into unbiased data processing for human pose estimation. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5700–5709, 2020.
 [22] Lin Huang, Jianchao Tan, Ji Liu, and Junsong Yuan. Handtransformer: Nonautoregressive structured modeling for 3d hand pose estimation. In Proc. Eur. Conf. Comput. Vis., pages 17–33. Springer, 2020.
 [23] Lin Huang, Jianchao Tan, Jingjing Meng, Ji Liu, and Junsong Yuan. Hotnet: Nonautoregressive transformer for 3d handobject pose estimation. In Proc. ACM Int. Conf. Multimedia, pages 3136–3145, 2020.
 [24] Lipeng Ke, MingChing Chang, Honggang Qi, and Siwei Lyu. Multiscale structureaware network for human pose estimation. In Proc. Eur. Conf. Comput. Vis., September 2018.
 [25] Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, HaoShu Fang, and Cewu Lu. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 10863–10872, 2019.
 [26] Wenbo Li, Zhicheng Wang, Binyi Yin, Qixiang Peng, Yuming Du, Tianzi Xiao, Gang Yu, Hongtao Lu, Yichen Wei, and Jian Sun. Rethinking on multistage networks for human pose estimation. arXiv preprint arXiv:1901.00148, 2019.
 [27] Ita Lifshitz, Ethan Fetaya, and Shimon Ullman. Human pose estimation using deep consensus voting. In Proc. Eur. Conf. Comput. Vis., pages 246–260, 2016.
 [28] Kevin Lin, Lijuan Wang, and Zicheng Liu. Endtoend human pose and mesh reconstruction with transformers. arXiv preprint arXiv:2012.09760, 2020.
 [29] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proc. Eur. Conf. Comput. Vis., pages 740–755. Springer, 2014.
 [30] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In Proc. Int. Conf. Learn. Represent., 2019.
 [31] Zhengxiong Luo, Zhicheng Wang, Yan Huang, Tieniu Tan, and Erjin Zhou. Rethinking the heatmap regression for bottomup human pose estimation. arXiv preprint arXiv:2012.15175, 2020.
 [32] Alejandro Newell, Zhiao Huang, and Jia Deng. Associative embedding: Endtoend learning for joint detection and grouping. arXiv preprint arXiv:1611.05424, 2016.
 [33] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Proc. Eur. Conf. Comput. Vis., pages 483–499. Springer, 2016.
 [34] Aiden Nibali, Zhen He, Stuart Morgan, and Luke Prendergast. Numerical coordinate regression with convolutional neural networks. arXiv preprint arXiv:1801.07372, 2018.
 [35] Xuecheng Nie, Jiashi Feng, Jianfeng Zhang, and Shuicheng Yan. Singlestage multiperson pose machines. In Proc. Int. Conf. Comput. Vis., pages 6951–6960, 2019.
 [36] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. Towards accurate multiperson pose estimation in the wild. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4903–4911, 2017.
 [37] Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. Strong appearance and expressive spatial models for human pose estimation. In Proc. Int. Conf. Comput. Vis., pages 3487–3494, 2013.
 [38] Umer Rafi, Bastian Leibe, Juergen Gall, and Ilya Kostrikov. An efficient convolutional network for human pose estimation. In Proc. Brit. Mach. Vis. Conf., volume 1, page 2, 2016.
 [39] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In Proc. Int. Conf. Learn. Represent., 2015.
 [40] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep highresolution representation learning for human pose estimation. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5693–5703, 2019.
 [41] Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. Compositional human pose regression. In Proc. Int. Conf. Comput. Vis., pages 2602–2611, 2017.
 [42] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In Proc. Eur. Conf. Comput. Vis., pages 529–545, 2018.
 [43] Wei Tang, Pei Yu, and Ying Wu. Deeply learned compositional models for human pose estimation. In Proc. Eur. Conf. Comput. Vis., September 2018.
 [44] Zhi Tian, Hao Chen, and Chunhua Shen. Directpose: Direct endtoend multiperson pose estimation. arXiv preprint arXiv:1911.07451, 2019.
 [45] Jonathan Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. arXiv preprint arXiv:1406.2984, 2014.

[46]
Alexander Toshev and Christian Szegedy.
Deeppose: Human pose estimation via deep neural networks.
In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 1653–1660, 2014.  [47] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. Adv. Neural Inform. Process. Syst., pages 5998–6008, 2017.
 [48] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and LiangChieh Chen. MaXDeepLab: Endtoend panoptic segmentation with mask transformers. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2021.
 [49] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. Endtoend video instance segmentation with transformers. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2021.
 [50] Zhicheng Wang, Wenbo Li, Binyi Yin, Qixiang Peng, Tianzi Xiao, Yuming Du, Zeming Li, Xiangyu Zhang, Gang Yu, and Jian Sun. Mscoco keypoints challenge 2018. In Proc. Eur. Conf. Comput. Vis., volume 5, 2018.
 [51] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In Proc. Eur. Conf. Comput. Vis., pages 466–481, 2018.
 [52] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In Proc. Eur. Conf. Comput. Vis., September 2018.
 [53] Hong Zhang, Hao Ouyang, Shu Liu, Xiaojuan Qi, Xiaoyong Shen, Ruigang Yang, and Jiaya Jia. Human pose estimation with spatial contextual information. arXiv preprint arXiv:1901.01760, 2019.
 [54] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable Transformers for endtoend object detection. In Proc. Int. Conf. Learn. Represent., 2021.
Appendix A Qualitative Results of TFPose
We show more qualitative results in Figure 10. TFPose works reliably under various challenging cases.
Appendix B Visualization of Transformer Attentions
b.1 Querytoquery Attention
We observe two obvious querytoquery attention patterns in different decoder layers, termed symmetric pattern and adjacent pattern, respectively. Both patterns exist in all decoder layers, we illustrate them separately for convenience. For symmetric pattern, Figure 8 demonstrates that the correlation between all pairs of symmetric joints in the third decoder layer. For adjacent pattern, Figure 9 explicitly shows that adjacent joints attend to each other in the second decoder layer.
b.2 Multiscale Deformable Attention
We visualize the learned multiscale deformable attention modules for better understanding. As shown in Figure 11 and Figure 12, the visualization indicates that TFPose looks at context information surround the ground truth joint. More concretely, the sampling points near the ground truth joint have higher attention weight (denoted as red), while the sampling points far from the ground truth joint own lower attention weight (denoted as blue).