Localizing the anatomical 2D keypoints of all humans in an image is a fundamental task in computer vision, with the ability to enable progress in applications such as virtual reality, human computer interaction, and human behavior analysis. It is also a common key component of algorithms for tasks such as action recognition[64, 13], multi-object tracking [22, 30], and generative models [8, 52].
Current methods typically follow one of two paradigms: bottom-up and top-down. Top-down approaches [10, 17, 20, 26, 34, 46, 63, 55, 56] divide the problem into two subtasks: (i) bounding box detection for all persons in the image, and (ii) joint localization for each person individually. Despite their success in some benchmarks [2, 1, 36], these two-step approaches lack efficiency due to their need to use a separate object detector, and their performance tends to degrade severely under heavy occlusions . Bottom-up methods [6, 48, 27, 41, 44, 12, 29] follow a different approach, as they first detect identity-agnostic keypoints, and then group them into separate poses. Their lack of reliance on external object detectors and their ability to operate jointly over the entire set of keypoints in the image has allowed them to outperform top-down approaches in benchmarks where occlusions are common . While recent work has significantly advanced the ability of bottom-up methods to accurately predict identity-free keypoints , current grouping algorithms still face significant drawbacks: since they generally rely on optimization algorithms, they cannot be trained end-to-end, and are often slow.
The keypoint grouping task can be formulated as a graph optimization problem in which nodes represent keypoints, and edge weights, which can be learned, represent their likelihood of belonging to the same human pose. Approaches ranging from integer linear programming[48, 27, 49]41, 44, 6] or graph clustering  are then used to find the correct assignment. A common problem of bottom-up methods is that their learning objectives are poorly aligned with the real inference procedure: they learn affinities between keypoints but, at test time, grouping is performed by a separate algorithm which is not differentiable per se.
One-shot methods are an efficient alternative [43, 69, 62] to optimization-based bottom-up methods. Their general formulation consists in regressing a root node location per person, and then predicting offsets to keypoint locations. Since they are able to avoid the optimization-based grouping stage, they are significantly faster than their counterparts. However, given the inherent difficulty of predicting offsets under occlusions and scale variation, they are also significantly less accurate, and therefore have to rely on additional postprocessing techniques to obtain competitive performance [43, 69, 62].
We propose to tackle the limitations of current bottom-up grouping and one-shot algorithms with a novel framework based on attention. Instead of regressing offsets from a set of center nodes, our proposed CenterGroup uses attention to search for the best match between person centers and keypoints over the entire image. Our method retains the ability of bottom-up approaches to precisely predict keypoints from heatmaps, while maintaining the efficiency of one-shot methods. Furthermore, unlike standard bottom-up methods, CenterGroup does not require any test-time optimization and is end-to-end trainable.
More specifically, we first obtain proposals for person centers and identity-agnostic keypoints via heatmap regression. We then feed centers and keypoints to a transformer  to encode contextual information into their updated embeddings. Finally, the embeddings are used in a simple keypoint grouping scheme which maximizes the attention scores between person centers and keypoints belonging to the same pose. At test time, we extract poses by assigning to centers those keypoints with the corresponding highest attention score. Due to the simplicity of our grouping algorithm and the parallel nature of attention computation, CenterGroup is 2.5x faster than the current state-of-the-art bottom-up method , while having better performance.
Overall, we make the following contributions:
We propose to tackle the pose estimation problem by grouping keypoints and person center predictions with a multi-head attention formulation that allows to train the model in an end-to-end fashion.
We use a transformer to encode dependencies between bottom-up detected keypoints and centers to obtain context-enhanced embeddings, efficiently boosting the performance of our proposed grouping scheme.
We achieve state-of-the-art results within an end-to-end framework that yields a speedup increase of up to 2.5x with respect to state-of-the-art .
2 Related Work
Top-down methods. Top-down methods [10, 17, 20, 26, 34, 46, 63, 55, 56, 60, 67, 23, 5, 42, 40, 54, 50, 57, 14] split the task into two steps. They first apply a person detector on the image and then perform single person pose estimator for each detected image region which is given by the bounding boxes. While being particularly strong at handling scale variaion, these methods struggle in cases of occlusion. To address these limitations, previous work has explored refining poses by exploiting the graph structure of the human skeletons with additional modules such as graph networks [60, 4, 50] or probabilistic graphical models. While being more robust, these methods still rely on external detectors, and therefore cannot recover from missing boxes.
Bottom-up methods. Bottom-up methods [6, 48, 27, 41, 44, 12, 29, 33] start by detecting identity-free keypoints over the entire image. In a second step, a grouping algorithm assembles poses using pairwise similarity scores between keypoints. To predict these similarity scores, DeepCut and Person-Lab [48, 27, 44] predict offset fields that link joints belonging to the same person. Openpose and PifPaf [6, 33] predict part affinity fields which resemble human limbs and encode the position and orientation between pairs of keypoints. Associative embeddings  are a popular approach currently used by state-of-the-art . They predict an embedding for every detected keypoint from convolutional features, and then use their pairwise euclidean distances as similarity scores. For all these methods, grouping is done by either graph partitioning [48, 27, 1, 49, 29, 28] or heuristic greedy parsing [41, 6, 44]. HGG
makes progress towards learning to group keypoints by using a graph neural network
on top of associative embeddings and training an edge and node classifier to hierarchically predict which keypoints belong together. While its graph network is trainable, it still relies on an external non-differentiable clustering algorithm for grouping. CenterGroup does not need this clustering step and, instead, it uses attention as a form of learnable keypoint grouping.
One-shot methods. One-shot methods[43, 69, 62] avoid the grouping task by directly regressing keypoint locations from a set of predicted centers [43, 69] or anchors. Both SPM and CenterNet regress offsets at center locations to regress each person’s joints. In addition,  predicts keypoint heatmaps as standard bottom-up methods. It then combines both predictions by heuristically matching offsets to their closest predicted joints. While being more efficient than grouping-based approaches, this heuristic still does not perform on par with them, and suffers from the same problem of not being end-to-end learnable.
Transformers and attention. Transformers were initially introduced for machine translation, and became recently popular for computer vision tasks ranging from image classification [16, 9], object detection [7, 70], semantic segmentation [59, 61], video processing [66, 68], image generation , and hand pose estimation [24, 25]. They employ self-attention layers to model relations between entities in a global context. Their use for human pose estimation is still relatively unexplored:  employs transformer for human pose tracking, [38, 21] apply them to estimate 3D human poses and  use a transformer based architecture for explainable single person pose estimation.
3 Background: Multi-Head Attention
Our model uses multi-head attention and a transformer as main tools to perform grouping. Therefore, we start by providing a brief review of these techniques.
Multi-Head Attention (MHA) 
, the core component of the transformer model, aims at obtaining contextual representations from an unordered set of vectors by letting each vectorattend over multiple representation subspaces of a (possibly different) set of vectors. More precisely, given a set of dimensional query feature vectors, , and a set of pairs of key and value vectors, and 111For notational simplicity, we assume queries, keys and values have the same dimensinality., MHA updates the query embeddings by linearly projecting the concatenation of attention heads:
where is a learnable matrix, and is the dimensionality of each attention head. Each attention head computes, at every index :
where the attention scores are computed as softmax-normalized222Transformers use scaled dot product attention, which means that they normalize softmax outputs with the dimensionality of the projected embeddings, . However, we omit this term as we will not use it in the remaining of the paper. dot-products between keys and queries:
where, are learnable projection matrices.
Whenever these sets of key, query and values are the same, i.e., , one refers to self-attention, which is the core component of the transformer encoder architecture. Overall, transformer encoders are formed by stacking blocks of an initial layer of self-attention with skip-connection and layer normalization , followed by a feed-forward network and a second instance of layer normalization. For completeness, we provide a more detailed explanation of their architecture in the supplementary material.
4 Problem Formulation
We first provide an overview of the general formulation of our method and introduce notation.
4.1 Problem Statement
Given an input image, we aim to obtain the set of poses corresponding to all persons in the image. Let be the number of joints being considered. Each pose can be uniquely determined by the 2D location and visibility of its joints. Formally, for every pose , we refer to its joint locations as for every . We denote the visibility of each joint of pose as , and assign it 1 whenever the joint is visible, and 0 otherwise.
Our approach operates over a set of predicted identity-agnostic joint keypoints in the image, which we refer to as . Each keypoint can be identified by its 2D location and its predicted type . Our method also predicts an additional set of targets corresponding to the center locations of persons in the image. We denote as the set of detected person centers.
4.2 Grouping Keypoints and Centers
Standard bottom-up methods learn a similarity score for every pair of detected keypoints , and use them to form poses by clustering keypoints that are most similar. One-shot methods, instead, directly use predicted person centers and regress displacement offsets from centers to joint locations to avoid expensive grouping.
Inspired by these approaches, we propose to perform human pose estimation by learning similarity score between every pair of detected person center and keypoints and type . By being able to estimate the similarity between center nodes and keypoints, as opposed to the similarity between pairs of keypoints, we are able to reduce the complexity of the grouping task significantly. Instead of requiring a graph clustering algorithm, we formulate the grouping task as a simple nearest-neighbor search problem. Namely, for every predicted center , we obtain its corresponding pose by retrieving the locations of its most similar detected keypoint of a target type according to . Formally, the predicted location of joint type for center can be obtained as , where
Since our approach operates directly over the set of detected keypoint locations, there is no need for additional postprocessing to obtain precise joint locations, unlike offset-based methods [69, 43].
4.3 Attention as Differentiable Keypoint Selection
The main drawback from the aforementioned procedure is that it is not end-to-end trainable, as it involves an operation over the detected keypoints. We circumvent this issue by formulating the nearest neighbor search task as a differentiable attention mechanism. To do so, we treat person centers as our set of queries, and keypoints as our set of keys, and obtain their similarity scores by computing their dot-product in a learned embedding space for every joint type . We then normalize the scores with a softmax operator to replace the non-differentiable . The resulting coefficients are used during training to directly predict keypoint locations for every joint type and every person center as:
where are the coordinates of the detected keypoint .
Predictions resulting from equation 5 can then be minimized by directly computing their loss with respect to the ground truth location. Note that since the detected keypoint coordinates are fixed in Equation 5, in order to minimize the loss our network needs to assign the highest attention score to the keypoints which location is closest to the ground truth coordinates . Moreover, in the limit, when , Equation 5 becomes equivalent to just computing a standard . Therefore, the attention coefficients act as a differentiable mechanism for selecting keypoints from person centers based on their dot-product similarity. This procedure still allows us to use the simple operator at test-time to efficiently retrieve keypoints from centers as in Equation 4.
We exploit the formulation described in the previous section within an end-to-end bottom-up pipeline for pose estimation. In this section, we first provide a general overview of it, and then explain each of its components in detail.
Our method, CenterGroup, consists of three main stages, which are summarized in Figure 2:
Keypoint and center detection. The location of identity-agnostic keypoints and person centers is obtained by heatmap regression following HigherHRNet . The output is a variable number of high-scoring joint and person center detections.
Encoding keypoints and centers. For every detected keypoint and center, we extract features from a CNN backbone, and augment them with additional embeddings encoding their spatial position. These embeddings are fed to a transformer , yielding updated embeddings with enhanced contextual information.
Keypoint grouping. We use the embeddings obtained from the previous stage and compute dot-product attention scores between person centers and keypoints, and normalize them in order to obtain a soft-assignment between persons and keypoints. Additionally, we use the transformer embeddings to classify center nodes into true and false positives, and determine the visibility of each keypoint.
5.2 Keypoint and Center Detection
In the first stage of our pipeline, we start by detecting identity agnostic-keypoints and person centers.
backbone, followed by two keypoint prediction heads that regress heatmaps at 1/4 and 1/2 of the original image scale for every joint type. Heatmaps are trained to follow a gaussian distribution centered at ground truth keypoint locations. During training, both heatmaps are supervised independently with a minimum-squared error loss. At inference, heatmaps are upsampled and aggregated to obtain a single heatmap at full image resolution.
Person centers. In addition to joints, we regress a new heatmap corresponding to person centers, also at resolutions 1/4 and 1/2. Following , given a ground truth pose with joint locations , the location of its center is computed as the center of mass of the visible joints, i.e.,
where is the number of visible joints in pose . Note that we identify the pose location as that of its center, and hence write .
5.3 Encoding Keypoints and Centers
Given the set of predicted keypoints and centers from the first stage, our goal is to obtain discriminative embeddings encoding contextual information. These embeddings will then be used in our grouping module in order to predict associations among keypoints and centers. Hence, it is desirable for them to encode global context.Towards this end, we use a transformer encoder that yields updated embeddings for every keypoint and center.
Initial features. We add one additional residual block 
to our backbone’s last feature map at 1/4 of the original resolution. For every detected keypoint and center, we obtain an initial embedding vector by extracting the vector at its corresponding location from the resulting feature map, and feed it to a two-layer Multi-Layer Perceptron (MLP) to project it onto a higher dimensionality.
Positional encodings. CNN features struggle to encode the position of different keypoints . However, spatial information offers an important cue for keypoint grouping. Hence, similarly to previous work [7, 70, 16, 61], we use fixed sinusoidal features encoding the absolute and axis locations at different frequencies. As a result, we obtain a new vector of dimensionality and sum it element-wise to the initial features of every detected keypoint and center.
Transformer encoder. In order to encode global context among every detected person and keypoint, we take their initial features, augmented with positional encodings, and feed them to a transformer encoder. As a result, we obtain updated embeddings and for every detected keypoint and center . Our transformer architecture follows the one described in Section 3, and is described in detail in the supplementary material.
5.4 Keypoint Grouping
In the last stage of our pipeline, we construct poses by using the embeddings produced by our transformer to determine which keypoints belong to which person centers via pairwise attention scores. As explained in Section 4.3
, we use attention as a differentiable approximation of keypoint selection from centers. In addition, we predict two additional targets for every center: the visibility of each of their keypoints, and the probability that they represent a true pose. This module is summarized in Figure3.
Classifying centers. We start by identifying which predicted centers locations correspond to the ground truth poses by matching them based on their locations333More details are provided in the supplementary material.. As a result, each predicted center is labelled with a binary target , set to 1 if the center is matched and 0 otherwise. We then use a small multi-layer perceptron to classify the center embeddings produced by our transformer, , and supervise the resulting prediction with a focal loss .
Predicting joint locations. For every predicted center such that , we aim to predict the 2D coordinates of each of its joints of every type . For each joint type , we define a pair of learnable projection matrices and , and a learned type encoding vector . The goal of the projection matrices is to map the center embeddings and keypoint embeddings , into a discriminative representation in which their dot-product will encode their likelihood being a good match for type . We compute their similarity as:
Note that the learned embedding is added to the keypoint embedding before the multiplication. Its goal is to encode the initial type predicted by our keypoint detector for keypoint . Intuitively, when searching for joints of a target type , it is desirable for our network to still be able to consider joints of all predicted types in order to recover from type errors made by the keypoint detector. For instance, for a target type , such as left ankle, some predicted types by the detector (e.g., right ankle) are more likely to be better matching candidates than others (e.g. nose). By using the learnable encoding for each keypoint before computing the projected embedding with , we allow our network to explicitely account for the relationships between the target type , and the predicted type of , , in a learnable manner.
With the similarity scores from Equation 7, the final attention scores are computed by normalizing them with a softmax operation over the entire set of keypoints:
Finally, we obtain the predicted locations as in Eq. 5:
and supervise them with an 1 loss as:
where the learnable ground locations of every joint for center are those of its matched ground truth pose.
Overall, this procedure can be interpreted as an instance of an attention head in which center and keypoint embeddings are queries and keys, respectively, and keypoint locations act as values. Note that we use different matrices and for every target type , which is equivalent to having different heads.
Predicting keypoint visibility. One drawback of the attention mechanism we have described is that, due to the softmax normalization, it may still predict high attention scores between centers and keypoints for a target type even if a given center has no corresponding visible keypoint of that type in the image. We address this problem by exploiting the attention mechanism to explicitly classify whether the predicted keypoints are visible. To do so, we introduce an additional projection matrix for every head , and reuse the type encodings and attention scores already computed to predict locations to compute a weighted aggregation analogous to that in Equation 9:
we then concatenate and
, and classify the resulting vector with an additional multilayer perceptron,as either visible or not visible. We supervise the result with a focal loss444This loss is only computed whenever the given joint in the predicted center is labelled as not visible in the ground truth, or the predicted keypoint has is has small euclidean distance with respect to the with the ground truth keypoint.. Intuitively, whenever keypoints are not visible, the original embeddings and will not be aligned, and therefore, neither will be and . Hence, their concatenation can be discriminatively used to identify when a joint has no good keypoint candidate for the target joint type.
|#||Method||Group.||Type Agnostic||Type Encoding||Transformer||Pos. Encoding||AP|
|1||Offsets + keypoint match.||65.3||86.4||71.4||59.1||75.0|
|2||AE [12, 41]||67.1||86.2||73.0||61.5||76.1|
|Ours w/o K&C transformer enc.||✓||67.5||86.7||72.7||62.0||76.6|
|4||Ours w/o K&C transformer enc.||✓||✓||67.5||86.8||72.9||60.8||77.3|
|5||Ours w/o K&C transformer enc.||✓||✓||✓||67.9||87.4||73.2||61.4||77.4|
In this section, we detail the experimental evaluation of our method. We divide it into ablation studies and comparison to state-of-the-art on two large-scale public datasets. For implementation details, we refer the reader to the supplementary material.
6.1 Datasets and Evaluation Metrics
COCO Keypoint Detection.
The COCO dataset is a large-scale benchmark containing large variety of every-day life situations. It contains over 200,000 images and 17 keypoints annotations for more than 250,000 human instances, which are split in approximately 150,000, 80,000 and 20,000 instances are for training, testing and, validation, respectively. We train our models on the train2017 split only, perform our ablation studies on val2017, and report our final results on the test-dev2017 split.
CrowdPose. The CrowdPose dataset  is a challenging benchmark with the goal to evaluate the robustness of methods in crowded scenes. Unlike COCO, in which the majority of images contain few instances, the crowd index
in CrowdPose follows a uniform distribution. The dataset contains a total of 20,000 images and a total of 80,000 instances annotated with 14 keypoints. Images are split in a ratio 5:4:1 for training, validation, and testing. Following , we train our model on the train and validation splits combined, and report the final performance on the test set.
Evaluation metrics. The aforementioned datasets use average precision (AP) as their main metric. AP computation is based on the Object Keypoint Similarity (OKS)  score among detected and ground truth poses. AP is the result of average precision scores for OKS thresholds . We also report AP for thresholds and , namely, and . In addition, for COCO we report and , which corresponds to AP over medium and large-sized instances respectively. For CrowdPose, we also report , , , which stands for AP scores over easy, medium and hard instances, according to dataset annotations.
6.2 Ablation Study
To determine the individual contribution of each of our model’s main components, we perform an ablation study on the COCO val2017 split, with HRNet32 backbone and input size 512x512. All results are reported with flip-testing, following [12, 41, 29], and without top-down refinement.
Baselines. CenterGroup can be naturally compared to two alternative frameworks. First, associative embeddings  (Tab. 1, row #1), since they are the method used originally by our keypoint detection network . Second, one-shot or offset-based methods [69, 43], which also use person center predictions, but use offset regression to obtain the final results. For a fair comparison, we reimplement  with our HigherHRNet backbone and report its performance for its strongest variant, which predicts keypoint heatmaps, center heatmaps and center offsets, and matches centers to their closest predicted keypoint (Tab. 1, row #2).
Grouping Module. We consider our model without the transformer encoder to isolate the effect of our Grouping Module. We compare three versions of it. In the first one, the attention head corresponding to the prediction of each keypoint type is only allowed to attend over keypoints of the same type that our keypoint detection network detects, and therefore is not able to overcome joint type mistakes made by the detector. This setting (Tab. 1, row #3) already outperforms our baselines, which confirms the superiority of CenterGroup over AE-based grouping and offset-based methods. In rows #4 and #5, we allow each head to attend over keypoints from the entire set of predicted heatmaps, and refer to them as type-agnostic. In row #5, we further use type encoding in the attention computation, as explained in Section 5.4, and observe that they significantly improve upon type-agnostic grouping.
Feature encoding. In rows #6 and #7 of Table 1, we further analyze the effect of using the keypoint and center transformer encoding before the routing module. This yields a significant performance boost, which confirms the importance of encoding long-range interactions between keypoints. Further enhancing the initial embeddings with positional encodings allows the transformer to explicitly use spatial information and gives up to 0.4 AP points of improvement for large persons.
Loss terms. We also assess the importance of our additional center and visibility classification losses, the results can be found in Table 2. Without them, we score our predicted poses by directly assigning them the confidence of its predicted center from heatmaps. We observe that replacing the heatmap score with the classification score obtained from our transformer’s embeddings (row #2) already provides a significant boost. We then experiment with either using a simple MLP over those features to predict the visibility of every keypoint (row #3), compared to using our attention-based model shown in Figure 3, results in row #4. We observe that both yield a significant improvement, but our attention-based model performs best.
Runtime analysis. In Table 3, we report the overall speed of our method when compared to our baselines. All models are run on the same machine with a single NVIDIA RTX5000 GPU, with batch size 1 and flip testing. We report: grouping runtime, i.e., all computations after keypoint detection, which in our case includes the transformer and grouping attention forward pass, and the overall runtime, which always adds 126ms corresponding to HigherHRNet. The overall runtime of CenterGroup is similar to , while we get significantly better results. Compared to AE-based grouping, our keypoint attention grouping is over 6x faster.
|#||Class. Cent.||Vis. w/ MLP||Vis. w/ Attn.||AP||APM||APL|
6.3 Benchmark Evaluation
COCO Keypoint Detection.
|Integral Pose Regression ||67.8||88.2||74.8||63.9||74.0|
|Ours w/ HrHRNet-W32||67.6||88.7||73.6||61.9||75.6|
|Ours w/ HrHRNet-W48||69.6||89.7||76.0||64.9||76.3|
|Ours w/ HrHRNet-W32+||70.3||90.0||76.9||65.4||77.5|
Ours w/ HrHRNet-W48+
In Table 4, we compare CenterGroup against state-of-the-art methods on the COCO dataset. Our method achieves the best performance among all bottom-up methods, for both single and multi-scale testing, and outperforms HigherHRNet, which uses AE-based Grouping , by approximately 1 AP. We observe that our achievements are most significant in APL. This can be explained by the ability of our attention module to capture long-range interactions between joints that are far apart. Overall, strong results in COCO, combined with our faster inference speed, show that CenterGroup is a more efficient alternative to current bottom-up methods . We provide additional analysis in the supplementary material.
CrowdPose. In Table 5, we show the test-set results for our model trained on CrowdPose. Unlike COCO, where top-down methods show superior performance, bottom-up methods outperform their top-down counterparts in CrowdPose, since this dataset is focused on much more challenging images with severe occlusions. In this setting, CenterGroup shows its full potential and obtains state-of-the-art performance among all methods by 1.8 AP points. Most importantly, our improvement is most significant in the hard regime (APH), where we improve upon state-of-the-art by 2.4 and 2.6 AP points for single and multi-scale testing, respectively.
This proves that our end-to-end learnable formulation does benefit from being trained on a dataset in which occlusions are common, and results in better generalization to new, challenging images. Overall, we show that our end-to-end trainable method can outperform top-down and bottom-up approaches on difficult scenarios with severe occlusions, where reasoning about keypoint detection and grouping jointly has a clear benefit.
|Top-down with refinement|
|Ours w/ HrHRNet-W48||67.6||87.7||72.7||73.9||68.2||60.3|
|Ours w/ HrHRNet-W48+||70.0||88.9||75.1||76.8||70.7||62.2|
We have proposed an end-to-end attention-based framework for bottom-up human pose estimation. We have demonstrated that CenterGroup has better performance than existing state-of-the-art methods, particularly in crowded images, while being significantly more efficient. We hope that our approach will inspire future work to explore the potential of attention mechanisms, as well as general learning-based alternatives to optimization-based grouping for bottom-up human pose estimation.
Acknowledgments. This project was partially funded by the Sofja Kovalevskaja Award of the Humboldt Foundation and by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036B. The authors of this work take full responsibility for its content.
-  (2018) PoseTrack: A benchmark for human pose estimation and tracking. In CVPR, Cited by: §1, §2.
2D human pose estimation: new benchmark and state of the art analysis.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2016) Layer normalization. In arXiv preprint arXiv:1706.03762, External Links: Cited by: §3.
-  (2016) Human pose estimation via convolutional part heatmap regression. In European Conference on Computer Vision, pp. 717–732. Cited by: §2.
-  (2020) Learning delicate local representations for multi-person pose estimation. In European Conference on Computer Vision, pp. 455–472. Cited by: §2.
-  (2017) Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7291–7299. Cited by: Table 6, §C.2, §1, §1, §2, Table 4, Table 5.
-  (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: Appendix B, §2, §5.3.
-  (2019) Everybody dance now. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
Generative pretraining from pixels.
International Conference on Machine Learning, pp. 1691–1703. Cited by: §2.
-  (2018) Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112. Cited by: §1, §2.
-  (2018) Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112. Cited by: Table 4.
-  (2020-06) HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 6, Appendix A, Appendix A, §C.1, §C.2, §C.2, 4(b), Figure 5, §D.1, 3rd item, §1, §1, §2, Figure 2, item 1, §5.2, §6.1, §6.2, §6.2, Table 1, Table 4, Table 5.
-  (2020-06) Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2017) Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1831–1840. Cited by: §2.
Weighted graph cuts without eigenvectors a multilevel approach. IEEE transactions on pattern analysis and machine intelligence 29 (11), pp. 1944–1957. Cited by: Table 6, §2.
-  (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2, §5.3.
-  (2017) Rmpe: regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343. Cited by: §1, §2, Table 4, Table 5.
Accurate, large minibatch sgd: training imagenet in 1 hour. External Links: Cited by: §C.1.
-  (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 770–778. External Links: Cited by: §5.3.
-  (2017-10) Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, Table 4, Table 5.
-  (2020) Epipolar transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7779–7788. Cited by: §2.
-  (2019) Multiple people tracking using body and joint detections. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 770–779. External Links: Cited by: §1.
-  (2020) The devil is in the details: delving into unbiased data processing for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5700–5709. Cited by: §2.
-  (2020) Hand-transformer: non-autoregressive structured modeling for 3d hand pose estimation. In European Conference on Computer Vision, pp. 17–33. Cited by: §2.
-  (2020) HOT-net: non-autoregressive transformer for 3d hand-object pose estimation. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 3136–3145. Cited by: §2.
-  (2017) A coarse-fine network for keypoint localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3028–3037. Cited by: §1, §2, Table 4.
-  (2016) Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision, pp. 34–50. Cited by: §1, §1, §2.
-  (2016) Multi-person pose estimation with local joint-to-person associations. In European Conference on Computer Vision, pp. 627–642. Cited by: §2.
-  (2020) Differentiable hierarchical graph grouping for multi-person pose estimation. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 718–734. External Links: Cited by: Table 6, Appendix A, §1, §1, §2, §6.2, Table 4.
-  (2018-10) Motion segmentation & multiple object tracking by correlation co-clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, pp. 1–1. External Links: Cited by: §1.
-  (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §C.1.
-  (2018) Multiposenet: fast multi-person pose estimation using pose residual network. In Proceedings of the European conference on computer vision (ECCV), pp. 417–433. Cited by: §C.2.
-  (2019) Pifpaf: composite fields for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11977–11986. Cited by: Table 6, §2, Table 4.
-  (2019-06) CrowdPose: efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix B, §1, §2, §6.1, Table 5.
-  (2017-10) Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §5.4.
-  (2014) Microsoft COCO: common objects in context. CoRR abs/1405.0312. Cited by: §1, §6.1, §6.1.
An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems, Cited by: §5.3.
LiftFormer: 3d human pose estimation using attention models. arXiv preprint arXiv:2009.00348. Cited by: §2.
-  (2018) Mixed precision training. In International Conference on Learning Representations, External Links: Cited by: §C.1.
-  (2019) Posefix: model-agnostic general human pose refinement network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7773–7781. Cited by: §2.
-  (2017) Associative embedding: end-to-end learning for joint detection and grouping. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. 2277–2287. External Links: Cited by: Table 6, §C.1, §C.2, §1, §1, §2, §6.2, §6.2, §6.3, Table 1, Table 3, Table 4.
-  (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision, pp. 483–499. Cited by: §2.
-  (2019) Single-stage multi-person pose machines. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6951–6960. Cited by: Table 6, Appendix A, §C.2, §1, §2, §4.2, §5.2, §6.2, Table 4.
-  (2018) Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–286. Cited by: Table 6, §C.2, §1, §1, §2, Table 4.
-  (2017) Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4903–4911. Cited by: Table 4.
-  (2017) Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4903–4911. Cited by: §1, §2.
-  (2018) Image transformer. In International Conference on Machine Learning, pp. 4055–4064. Cited by: §2.
-  (2016) Deepcut: joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4929–4937. Cited by: §1, §1, §2.
-  (2017) Articulated multi-person tracking in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1, §2.
-  (2020) Peeking into occluded joints: a novel framework for crowd pose estimation. In European Conference on Computer Vision, pp. 488–504. Cited by: §2.
-  (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. External Links: Cited by: §2.
-  (2019) Appearance and pose-conditioned human image generation using deformable gans. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Cited by: §1.
-  (2020) 15 keypoints is all you need. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6738–6748. Cited by: §2.
-  (2019) Multi-person pose estimation with enhanced channel-wise and spatial information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5674–5682. Cited by: §2.
-  (2019) Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703. Cited by: §C.2, §D.1, §1, §2, §5.2, Table 4.
-  (2018) Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 529–545. Cited by: §1, §2, Table 4.
-  (2018) Deeply learned compositional models for human pose estimation. In Proceedings of the European conference on computer vision (ECCV), pp. 190–206. Cited by: §2.
-  (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §C.3, §1, §3, item 2.
-  (2020) MaX-deeplab: end-to-end panoptic segmentation with mask transformers. arXiv preprint arXiv:2012.00759. Cited by: §2.
-  (2020) Graph-pcnn: two stage human pose estimation with graph pose refinement. In European Conference on Computer Vision, pp. 492–508. Cited by: §2.
-  (2020) End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503. Cited by: §2, §5.3.
-  (2020) Point-set anchors for object detection, instance segmentation and pose estimation. In European Conference on Computer Vision, pp. 527–544. Cited by: §1, §2.
-  (2018) Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pp. 466–481. Cited by: §1, §2, Table 4, Table 5.
Spatial temporal graph convolutional networks for skeleton-based action recognition.
AAAI Conference on Artificial Intelligence, External Links: Cited by: §1.
-  (2020) TransPose: towards explainable human pose estimation by transformer. arXiv preprint arXiv:2012.14214. Cited by: §2.
-  (2020) Learning joint spatial-temporal transformations for video inpainting. In European Conference on Computer Vision, pp. 528–543. Cited by: §2.
-  (2020) Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7093–7102. Cited by: §2.
End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8739–8748. Cited by: §2.
-  (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: 4(c), Figure 5, §D.1, §1, §2, §4.2, §6.2, §6.2, §6.3, Table 1, Table 3.
-  (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. Cited by: §2, §5.3.
Appendix A Extended COCO Comparison
|Method||Backbone||Grouping||Input size||# Params||AP|
|w/o multi-scale test|
|OpenPose* ||–||Greedy decoding w/ optimization||–||–||61.8||84.9||67.5||57.1||68.2|
|AE* ||Hourglass||Greedy decoding w/ optimization||512||277.8M||62.8||84.6||69.2||57.5||70.4|
|PifPaf||–||Greedy decoding w/ optimization||–||–||66.7||-||-||62.4||72.9|
|HigherHRNet[12, 41]||HRNet-W32||Greedy decoding w/ optimization||512||28.6M||66.4||87.5||72.8||61.2||74.2|
|HigherHRNet[12, 41]||HRNet-W48||Greedy decoding w/ optimization||640||63.8M||68.4||88.2||75.1||64.4||74.2|
|w/ multi-scale test|
|AE* ||Hourglass||Greedy decoding w/ optimization||512||277.8M||65.5||86.8||72.3||60.6||72.6|
|SPM* ||Hourglass||Offsets (One-shot)||512||277.8M||66.9||88.5||72.9||62.6||73.1|
|HGG ||Hourglass||Graph Network + Graclus clustering ||512||–||67.6||85.1||73.7||62.7||74.6|
|PersonLab ||ResNet152||Greedy decoding||1401||68.7M||68.7||89.0||75.4||66.6||75.8|
|HrHRNet-W48 ||HRNet-W48||Greedy decoding w/ optimization||640||63.8M||70.5||89.3||77.2||66.6||75.8|
In Table 6, we provide a detailed comparison of CenterGroup against published bottom-up approaches on the COCO test-dev dataset. For each method, we specify its backbone network, grouping procedure, input size, and parameter count. We observe that most top-performing methods rely on greedy decoding schemes, which often involve optimization in the form of solving a sequence of bipartite matching problems. Alternatively, SPM  uses offsets, but relies on top-down refinement to achieve competitive results 555i.e. it applies a single person pose estimation model over the predicted poses., and HGG
uses a hierarchical clustering algorithm that operates on the output of graph network predictions.
CenterGroup outperforms all previous methods with our proposed attention-based grouping module, which does not rely on optimization and is end-to-end trainable. Note that this module only introduces a slight increase in the number of parameters with respect to HigherHRNet, and combined with our keypoint detector, yields a model with significantly fewer parameters than other methods.
Regarding performance, we note that the increase in accuracy is most significant for large persons, where our improvement is of 2.1 AP points for single-scale, and 1.7 for multi-scale, which can be explained by the ability of our transformer to capture relationships among distant joints in the image. Overall, it outperforms the current state-of-the-art method, HigherHRNet by approximately 1.2 AP for single-scale and 0.9 AP for multi-scale, while having the exact same backbone and input size, and being 2.5x faster, which confirms CenterGroup’s increased efficiency.
Appendix B Matching Centers
In order to train our grouping module, we need to determine which detected centers in the image correspond to a ground truth pose. As explained in Section 5.4 in the main paper, this allows us to define a target for every detected center indicating whether it represents a ground truth pose (i.e., ) or not (). These labels are used to train our center classification module. Moreover, for those detected centers that do correspond to a ground truth pose, we obtain the visibility of their corresponding keypoints as well as the locations of those that are visible by simply using the annotations of the ground truth center that the detected center is matched with.
In order to determine correspondences between detected centers () and ground truth centers (), we compute the euclidean distance between every and , and normalize it by the scale of , :
where is a fixed constant set to 0.15666This number is determined by increasing by 50% the constant that the COCO dataset uses for hip joints for OKS computation., and the scale is computed as multiplied by ’s bounding box height and width, following 
. This formula is adapted from the OKS metric, and simply normalizes distances between 0 and 1 by using a pre-defined standard deviation that depends on the object size.
With the distances from Equation 12, we define an instance of a bipartite matching problem. For every and , their corresponding cost , whenever and otherwise. We obtain matches between centers and ground truth centers by solving the problem with the hungarian algorithm, similarly to . Note that running this algorithm takes on average significantly less than ms since the cost matrix is, at most, of size 20x30, and therefore it adds no significant computational burden. Additionally, note that this procedure is only necessary at training time in order define ground truth assignments. At test-time, as explained in the main paper, we do not require any form of optimization.
Appendix C Implementation Details
We pretrain our backbone and keypoint detection module following HigherHRNet . We then randomly initialize our encoding and grouping modules and train our entire model end-to-end for iterations with batch size 130, which corresponds to approximately
50 epochs on COCO, and270 epochs on CrowdPose, and use learning rate linear warm-up during the first iterations. We use an Adam optimizer  with learning rate set to for pretrained layers and for the remaining parts of the network, which we drop by a factor of 10 at 10,000 and 20,000 iterations. In addition, use use automatic mixed precision for training , which reduces the memory requirements by approximately half, and allows training on 4 NVIDIA RTX6000 with 24GB of RAM memory in approximately 24 hours. We observe that our training loss shows high stability and allows training with mixed precision without any divergence problems, in contrast to Associative Embeddings. For data augmentation, we use the same techniques as , which include random flipping, rotation, scale variation, and generating a random crop of size 512x512, when using an HRNet32 backbone, or 640x640 when using an HRNet48 backbone.
We add one grouping module at the output of every transformer encoder block and compute the location, visibility and center losses, and then average them over the output of every transformer encoder block. Loss terms are balanced as follows: the heatmap loss, is weighted by factor 10, the location loss, is averaged over all visible keypoints in the image and weighted by 0.02, the center and visibility losses, and , are both weighted by factor 1. The overall set of weights is determined by ensuring that each loss term has a comparable magnitude.
At inference, we resize images to preserve their aspect ratio and have their shorter side of size 512 if using a HRNet32 backbone, or 640 if using HRNet48. Following 
, predicted heatmaps are upsampled to full image resolution. We then extract peaks by applying heatmap Non-Maximum Suppression (NMS) with a max-pooling kernel of size 5x5 for keypoints and 17x17 for person centers, and select all peaks that either have score over 0.01 or are within the top-5 scoring peaks in the heatmap.
For every predicted center , we build its pose by assigning it the keypoints with highest attention score according to the attention score corresponding to every type, as explained in Section 4.2 in the main paper. Formally, given center the location of each of its joint types is determined as:
In order to score the resulting poses, we use the predicted visibility scores for every keypoint, , as well as the predicted probability that center represents a true positive center, , as follows:
Intuitively, since visibility scores are only computed for those centers such that during training (i.e. matched centers), we only use them whenever our network predicts centers to represent true pose centers with probability over 0.5. In that case, the overall pose score is the average visibility confidence score of keypoints that are predicted to be visible (i.e., ).
Unlike [41, 43, 6], we do not perform top-down refinement, nor ensembling , and all results are reported with flip-testing as it is common practice [55, 41, 44]. For postprocessing, following [41, 12], keypoint coordinates are shifted by 0.25 towards the contiguous second maximal activation in each heatmap, to account for quantization errors.
c.3 Exact Architecture
Our keypoint detection network is minimally modified from HigherHRNet, as explained in Section 5.2 in the main paper. Our newly added modules include an additional residual block and a multi-layer perceptron (MLP) to generate initial keypoint and person features, a transformer encoder and the grouping module. Our transformer encoder has 3 blocks, each with input dimension 128, 4 self-attention heads and MLP hidden dimension set to 512. We found no significant performance benefits from further increasing the transformer’s size. The architecture of each transformer encoder block is not modified from the original one , and shown in Figure 4.
|Layer Name||# Parameters|
All of the MLPs in the grouping module, as well as the one generating the transformer’s input contain two hidden layers. We detail the number of parameters of each component in Table 7. The overall parameter count of our proposed keypoint encoding and grouping module is below 2M, which is relatively small, and only accounts for 6 (resp. 3) of the overall count when using an HRNet32 (resp. HRNet48) backbone.
Appendix D Qualitative Results
d.1 Qualitative Examples
In Figure 5, we visualize results produced by our method in comparison to those from our baselines: HigherHRNet and CenterNet . As explained in the main paper, we reimplement CenterNet to use an HRNet backbone and HigherHRNet’s scale-aware heatmaps  for keypoint heatmap regression for a fair comparison.
We observe that our method’s performance is robust under severe occlusion and challenging conditions. In comparison, CenterNet often fails whenever there is significant overlap among different poses, as can be seen in rows 1, 4, 5, 6 and 7. Moreover, since it always predicts joint locations for a given pose regardless of whether they are visible or not, it often hallucinates joints and produces unfeasible pose estimates (all rows).
HigherHRNet generally does a better job at grouping, as can be seen in rows 1, 4, 5, and 6, but this comes at a significantly increased computational cost of 2.5x inference time. Moreover, we observe that it tends to miss or assign very low confidence to large-sized poses (rows 2, 4, 5, 6).
Our method, instead, has a runtime inference time comparable to CenterNet’s, due to its fast optimization-free test-time procedure, and has increased robustness where our baselines fail. Namely, it performs well in images with heavy occlusion, and, due to its ability to capture long-rage connections with our attention mechanism, it does not struggle with large-sized poses.
d.2 Visualizing Attention Activations
In Figures 6 and 7 we visualize the attention output scores with which the results in Figure 5 were obtained. We observe that despite the large amount of keypoints over which each center attends, particularly in crowded scenes, attention scores are heavily concentrated over a small subset of keypoints, for each center. Indeed, most attention scores for a given type have magnitude over 0.95%, which can be seen from the dark color of most lines. This can be explained due to our loss formulation: to achieve low training error, our model must concentrate attention weights in the most promising keypoint locations, as otherwise it’d incur in large L1 loss values. Overall, Figures 6 and 7 show how our model is able to consider a large number of center-keypoint association candidates but still focus on those keypoints belonging to each pose, even in highly challenging scenarios.