Human pose estimation is one of the vital tasks in computer vision and has received a great deal of attention from researchers for the past few decades. From the spatial aspect, this problem is divided into 2D and 3D human pose estimation. Geometrically, the 3D human pose might be predicted through the respective 2D human pose combining with a 3D exemplar matching . This paper focuses on the deep learning approach for 2D human pose estimation which aims to localize human anatomical keypoints on the torso, face, arms, and legs.
The pioneer of deep learning methods formulated human pose estimation as a CNN-based regression towards body joints . The model uses an AlexNet  backend (consisting of 7 layers) and an extra final layer that directly outputs joint coordinates. The later state-of-the-art methods reshaped this problem by estimating heatmaps for all human keypoints, where the th heatmap represents the location confidence of the th keypoint [28, 31, 20, 6, 32]. Heatmap-based approaches consist of two major parts as shown in Fig. 1: the first part (encoder) works as a feature extractor which is responsible for understanding the image while the second one (decoder) is to generate the heatmaps corresponding to the human keypoints. Convolutional pose machines (CPM)  used a multi-stage training scheme where the image features and the heatmaps produced by the previous stage are fed as the input; thus, the prediction is refined throughout stages. Commonly, the output of the feature extractor is the low-resolution feature maps. Stacked Hourglass  and Cascaded pyramid network (CPN)  adopted a multi-resolution learning strategy to generate the heatmaps from the feature maps at a variety of resolutions. Instead of independently processing at multiple resolutions as CPN, Hourglass uses skip layers to preserve spatial information at each resolution. However, these two methods were defeated when Xiao et al.  proposed a simple yet effective baseline which utilizes ResNet  as its backbone for feature extractor followed by a few deconvolutional layers for heatmap generator (Fig. 2). SimpleBaseline  for human pose estimation is the most effortless way to generate the heatmaps from the low-resolution feature maps, obtaining good performance on MS-COCO 2017 benchmark  (improving AP by 3.5 and 1.0 points compared to Hourglass  and CPN  respectively, with the similar backbone and input size).
In the feature extractor, the deeper the layer is, the more specific the learned features are. For example, the first layer may learn overall features by abstracting the pixels and encoding the edges; the second layer may learn how to arrange the edges; the third layer encodes the face; the fourth layer encodes the eyes. Simply to see that the model needs to learn specialized features like eyes, nose because they correspond to the human keypoints. In particular, there are many cases of occluded keypoints. For example, the wrist is behind the back, so the wrist may not be detected. However, we actually can infer the wrist thanks to other keypoints such as elbow, shoulder, or even human skeleton. This means the model needs not only specific features but also overall patterns.
This paper is inspired by the idea that the simple architecture could be ameliorated if it can learn the features from multiple resolutions, for the high resolution allows capturing overall information and the low resolution aims to extract specific characteristics. We propose novel network architectures utilizing the simple baseline , combining with the multi-resolution learning strategy. Our first approach achieves the multi-resolution heatmaps after the lowest-resolution feature maps are obtained. To do so, we branch off at each resolution of the heatmap generator and add extra layers for heatmap generation. In our second approach, the networks directly learn the heatmap generation at each resolution of the feature extractor. Our experiments were conducted on two common benchmarks for human pose estimation: MS-COCO  and MPII . On the COCO val2017 dataset, our best model gains AP by 0.6 points compared to SimpleBaseline  which has a similar backbone and input size. On the MPII dataset, our best model achieves PCKh@0.5 of 89.8.
Contributions: Our main contributions are:
We introduce two novel approaches to achieve multi-resolution representation for both heatmap generation and feature map extraction.
Our architectures are simple yet effective, and experiments show the superiority of our approaches over numerous methods.
Our approaches could be applied to other tasks that have the architecture of encoder (feature extractor) - decoder (specific tasks) such as image captioning and image segmentation.
Ii Human pose estimation using Deconvolutional layers as the Heatmap generator
This section presents the simple baseline  whose the heatmap generator composed of deconvolutional layers. The network structure is illustrated in Fig. 2. From the input image, the model uses residual blocks to learn the features of the image. After each residual block, the resolution is decreased by half while the number of output channels is doubled. In Fig. 2, four residual blocks are working together as a feature extractor, and their numbers of output channels are , , , and respectively. We also use these notations for later architectures.
After reaching lowest-resolution feature maps, the network begins the top-down sequence of upsampling to obtain the high-resolution feature maps. Instead of using upsampling algorithms, SimpleBaseline  leverages deconvolutional layers where each of them is built out of a transposed convolutional layer high-resolution heatmaps representing the location confidence for all
human keypoints. Mean Squared Error (MSE) is used as the loss function between the predicted and ground-truth heatmaps:
where and are the ground-truth and predicted heatmap of the th keypoint respectively, (, ) is the size of the heatmap.
Iii Our method
To investigate the impact of multi-resolution representation, in this section, we propose learning the multi-resolution representation for both the heatmap generator and the feature extractor. These two approaches are referred to as multi-resolution heatmap learning and multi-resolution feature map learning, respectively. We use ResNet  as our feature extractor because it is the most common backbone network for image feature extraction.
Iii-a Multi-resolution heatmap learning
We started thinking about this kind of architecture by assuming that the ResNet backbone  works very well on the image feature extraction. The architectures of the multi-resolution heatmap learning are illustrated in Fig. 5. The lowest-resolution feature maps are fed into the sequence of deconvolutional layers to obtain the higher resolutions. The number of output channels of these deconvolutional layers is kept unchanged and is set to be equal to the number of output channels (denoted by ) of the first residual block.
In the baseline method, heatmaps are generated after obtaining the highest resolution. In our method, we branch off at each deconvolutional layer (excluding the highest-resolution deconvolutional layer) and add some convolutional layers to generate the low-resolution heatmaps. The higher-resolution heatmaps could be obtained from the low-resolution heatmaps by using extra deconvolutional layers. The reason we do so is that the high-resolution feature maps help generate the heatmaps with overall information while the low-resolution feature maps focus on specific characteristics. We propose two architectures with a slight difference as shown in Fig. 5:
In Fig. (a)a, the lowest-resolution heatmaps are upsampled to the higher resolution (called medium resolution) and then combined with the heatmaps generated at this medium resolution. The result of this combination is fed into a deconvolutional layer to obtain the highest-resolution heatmaps.
With a small change, in Fig. (b)b, the heatmaps at each resolution are upsampled to the highest-resolution heatmaps independently and then combined at the end.
Iii-B Multi-resolution feature map learning
Instead of learning at each resolution of the heatmap generator as in the multi-resolution heatmap learning strategy, the multi-resolution feature map learning aims to directly learn how to generate the heatmaps at each resolution of the feature extractor (Fig. 8). At each residual block corresponding to each resolution of the feature extractor (excluding the lowest resolution), the network branches off and goes through respective deconvolutional layers to obtain the highest resolution. Especially, the branch from the highest-resolution residual block does not go through any deconvolutional layers but directly goes to the element-sum component. At last, a convolutional layer is added to generate predicted heatmaps for all keypoints.
Following this stream, we propose two architectures as illustrated in Fig. (a)a and Fig. (b)b. The main difference between these two architectures is the number of output channels of deconvolutional layers. In the network shown in Fig. (a)a, the number of output channels of all deconvolutional layers is set to be equal to the number of output channels (denoted by ) of the highest-resolution residual block, this may lead to an information loss.
The feature extractor consists of four residual blocks: the first residual block outputs feature maps with the size of , the second residual block aims to learn more features and outputs feature maps with the size of , the third residual block outputs feature maps with the size of , and the fourth residual block finally outputs lowest-resolution feature maps with the size of . It is easy to see the principle of the image feature extraction here: the number of feature maps is increased by a factor of 2 (more features are learned) while the resolution is halved. Therefore, in the top-down sequence of upsampling, the resolution is increased two times, the number of feature maps should be decreased two times as well. For the network shown in Fig. (a)a, after the first deconvolutional layer in the main branch, the resolution of feature maps is increased two times, but the number of feature maps is decreased eight times (from to ). Therefore, some previously learned information may be lost. To overcome this point, the architecture in Fig. (b)b uses the deconvolutional layers with the number of output channels depending on the number of feature maps extracted by the previously adjacent layer. For instance, after the fourth residual block, lowest-resolution feature maps are outputted; as a result, the numbers of output channels of following deconvolutional layers are , , and , respectively. The effectiveness of learning the heatmap generation from multiple resolutions of the feature extractor will be clarified in Section IV.
The COCO dataset contains more than 200k images and 250k person instances labeled with keypoints. Each person is annotated with 17 keypoints. We train our models on COCO train2017 dataset with 57k images and 150k person instances. Our models are evaluated on COCO val2017 and test-dev2017 dataset, with 5k and 20k images, respectively.
The MPII dataset contains around 25k images with over 40k person samples. Each person is annotated with 16 joints. MPII covers 410 human activities collected from YouTube videos where the contents are everyday human activities. Since the annotations of MPII test set are not available, we train our models on a subset of 22k training samples and evaluate our models on a validation set of 3k samples .
We use different metrics for our evaluation on the MS-COCO and MPII dataset:
In the COCO dataset, each person object has the ground-truth keypoints with the form , where are the keypoint locations and is a visibility flag (: not labeled, : labeled but not visible, and
: labeled and visible). The standard evaluation metric is based on Object Keypoint Similarity (OKS):
In which, is the Euclidean distance between the detected and corresponding ground-truth keypoint, is the visibility flag of the ground-truth keypoint, is the object scale, and is a per-keypoint constant that controls falloff. Predicted keypoints that are not labeled () do not affect the OKS. The OKS plays the same role as the IoU in object detection, so the average precision (AP) and average recall (AR) scores could be computed if given the OKS.
For the MPII dataset, we use Percentage of Correct Keypoints with respect to head (PCKh) metric . Firstly, we recall Percentage of Correct Keypoints (PCK) metric . PCK is the percentage of correct detection that falls within a tolerance range which is a fraction of torso diameter. The equation could be expressed as:
where and are the ground-truth and predicted location of the th keypoint respectively, and are the ground-truth location of right hip and left shoulder respectively, r is the threshold bounded between 0 and 1. represents the torso diameter. For example, PCK@0.2 () means that: the distance between the predicted and ground-truth keypoint torso diameter. PCKh is almost the same as PCK except that the tolerance range is a fraction of head size.
For all our experiments, we use ResNet  as our backbone for the image feature extraction, consisting of 4 residual blocks as shown in Fig. 5 and Fig. 8. Each deconvolutional layer uses kernel filters. Each convolutional layer uses kernel filters. The numbers of output channels of the residual block, deconvolutional layer, and convolutional layer are denoted by and as shown in Fig. 5 and Fig. 8. is set to 256. is set to 17 or 16 for the COCO or MPII dataset respectively.
|8-stage Hourglass ||8-stage Hourglass||N||66.9||-||-||-||-||-||-||-||-||-|
|CPN + OHKM ||ResNet-50||Y||69.4||-||-||-||-||-||-||-||-||-|
. Pretrain means the backbone is pre-trained on the ImageNet classification task.
|Bottom-up approach: keypoint detection and grouping|
|Associative Embedding ||-||-||65.5||86.8||72.3||60.6||72.6||70.2||89.5||76.0||64.6||78.1|
|Top-down approach: person detection and single-person keypoint detection|
|G-RMI ||ResNet-101||353 257||64.9||85.5||71.3||62.3||70.0||69.7||88.7||75.5||64.4||77.1|
|Integral Pose Regression ||ResNet-101||256 256||67.8||88.2||74.8||63.9||74.0||-||-||-||-||-|
|G-RMI + extra data ||ResNet-101||353 257||68.5||87.1||75.5||65.8||73.3||73.3||90.1||79.5||68.1||80.4|
|SimpleBaseline ||ResNet-50||256 192||70.0||90.9||77.9||66.8||75.8||75.6||94.5||83.0||71.5||81.3|
|SimpleBaseline ||ResNet-101||256 192||70.9||91.1||79.3||67.9||76.7||76.7||94.9||84.2||72.7||82.2|
|SimpleBaseline ||ResNet-152||256 192||71.6||91.2||80.1||68.7||77.2||77.2||94.9||85.0||73.4||82.6|
|Our multi-resolution representation learning models|
Iv-a Experimental results on COCO dataset
Training. The data pre-processing and augmentation follow the setting in . The ground-truth human bounding box is extended in height or width to a fixed aspect ratio (). The human box after cropped from the image is resized to a fixed size of for a fair comparison with [20, 6, 32]. The data augmentation includes random rotation (), random scale (), and flip. We use Adam optimizer . The batch size is 64. The learning schedule is set up as follows: the base learning rate is set to , and is dropped to and at the th and
th epoch, respectively. The training process is terminated within 170 epochs.
Testing. We use the two-stage top-down paradigm, similar to [6, 32]. Keypoint locations are obtained by using the highest heatvalue’s location in predicted heatmaps and a quarter offset in the direction from the highest response to the second-highest response.
Comparisons on COCO val2017 dataset. TABLE I reports our evaluation results compared to Hourglass , CPN , and SimpleBaseline . Note that the results of Hourglass  are cited from . For the fair comparison, we use the faster-RCNN detector  with the detection AP of 56.4 (being the same with that of SimpleBaseline ) while the person detection AP of Hourglass  and CPN  is 55.3.
As shown in TABLE I, both our architectures outperform Hourglass  and CPN . With the same ResNet-50 backbone, our MRFeaNet2 achieves an AP score of 70.9, improving the AP by 4.0 and 2.3 points compared to Hourglass and CPN respectively. Online Hard Keypoints Mining (OHKM) proved the efficiency when helping CPN gain the AP by 0.8 points (from 68.6 to 69.4), but still being 1.5 points lower than the AP of MRFeaNet2.
Compared to SimpleBaseline , our multi-resolution heatmap learning architectures have slightly worse performance. In the case of using the ResNet-50 backbone, SimpleBaseline has the AP score of 70.4 while the AP scores of MRHeatNet1 and MRHeatNet2 are 70.2 and 70.3 respectively. This may be explained that the deconvolutional layers cannot completely recover all information which the feature extractor already learned, so only learning from the outputs of deconvolutional layers is not enough to generate the heatmaps.
On the other hand, our multi-resolution feature map learning architectures have better performance compared to SimpleBaseline . With the ResNet-50 backbone, MRFeaNet1 gains AP by 0.2 points while the AP of MRFeaNet2 increases by 0.5 points. MRFeaNet2 still obtains the AP improvement of 0.4 and 0.6 points compared to SimpleBaseline in the case of using the ResNet-101 and ResNet-152 backbone, respectively. This proves that learning heatmap generation from multiple resolutions of the feature extractor can help improve the performance of keypoint prediction.
Comparisons on COCO test-dev dataset. TABLE II shows the performance of our models and previous methods on the COCO test-dev dataset. Note that the results of SimpleBasline  are reproduced by us using the provided models. We use the human detector with the person detection AP of 60.9 on COCO test-dev for SimpleBasline and our models. Our networks outperform bottom-up approaches. Our MRFeaNet2 achieves the AP improvement of 2.2 points compared to MultiPoseNet . In comparison with top-down approaches, our models are better even with the smaller backbone and image size. Our MRFeaNet2, which uses the ResNet-50 backbone, obtains the AP of 70.4 while the AP score of G-RMI  is 68.5 even using the larger backbone network, larger image size, and extra training data. Compared to SimpleBaseline , our MRFeaNet2 still improves the AP by 0.4, 0.3, and 0.2 points in the case of using the ResNet-50, ResNet-101, and ResNet-152 backbone, respectively.
Iv-B Experimental results on MPII dataset
Training. The data pre-processing and augmentation are similar to the setting in the experiment on the COCO dataset. The input size of human bounding box is set to for a fair comparison with other methods. The data augmentation includes random rotation (), random scale (), and flip. Adam optimizer  is also used. The batch size is 64. The learning rate starts from , drops to and at the th and th epoch, respectively. The training process is terminated within epochs.
|Pishchulin et al. ||74.3||49.0||40.8||34.1||36.5||34.4||35.2||44.1|
|Tompson et al. ||95.8||90.3||80.5||74.3||77.6||69.7||62.8||79.6|
|Carreira et al. ||95.7||91.7||81.7||72.4||82.8||73.2||66.4||81.3|
|Tompson et al. ||96.1||91.9||83.9||77.8||80.9||72.3||64.8||82.0|
|Hu et al. ||95.0||91.6||83.0||76.6||81.9||74.5||69.5||82.4|
|Pishchulin et al. ||94.1||90.2||83.4||77.3||82.6||75.7||68.6||82.4|
|Lifshitz et al. ||97.8||93.3||85.7||80.4||85.3||76.6||70.2||85.0|
|Gkioxary et al. ||96.2||93.1||86.7||82.1||85.2||81.4||74.1||86.1|
|Rafi et al. ||97.2||93.9||86.4||81.3||86.8||80.6||73.4||86.3|
|Belagiannis et al. ||97.7||95.0||88.2||83.0||87.9||82.6||78.4||88.1|
|Insafutdinov et al. ||96.8||95.2||89.3||84.4||88.4||83.4||78.0||88.5|
|Wei et al. ||97.8||95.0||88.7||84.0||88.4||82.8||79.4||88.5|
Testing. We use the human bounding boxes provided with the images. TABLE III shows the PCKh scores of our architectures and previous methods at . The results of SimpleBaseline  are reproduced by us using the provided models.
Similar to the experiments on the COCO dataset, our multi-resolution representation learning architectures outperform numerous previous methods. In comparison with SimpleBaseline , the multi-resolution feature map learning method achieves better performance. Our MRFeaNet1 gains the PCKh@0.5 score by 0.6, 0.3 and 0.2 points compared to SimpleBaseline in the case of using the ResNet-50, ResNet-101, and ResNet-152 backbone, respectively.
On the other hand, the results also show that the performance could be improved if using the larger backbone network. To make this statement clear, the PCKh@0.5 scores of SimpleBaseline  and our models are presented on a chart as shown in Fig. 9. MRFeaNet1, which is the best model on the MPII dataset, obtains the score improvement of 0.4 and 0.7 points compared to MRFeaNet1 and MRFeaNet1 respectively. MRHeatNet1 achieves the highest improvement which is 1.1 points when the backbone network is transformed from ResNet-50 to ResNet-152.
Iv-C Qualitative results
Qualitative results on COCO test2017 dataset. We use our models trained on the COCO train2017 dataset with the ResNet-50 backbone to visualize human keypoint prediction. Our qualitative results on the unseen images of the COCO test2017 dataset are shown as in Fig. 10. Both our models work well on the simple cases (the 1 and 2 row).
The figures in the 3 and 4 row are harder with some occluded keypoints, but the multi-resolution feature map learning models still relatively precisely predict the human keypoints. The multi-resolution heatmap learning models do not work well: MRHeatNet1 omits the right elbow in the 3 row, and the eye detection of MRHeatNet2 is not reasonable in both of these two cases.
In the 5 row, both legs of the woman are hidden under the table, but both of our models can make their opinion. The prediction results are different among the models. If carefully looking at the hip prediction, the locations proposed by MRFeaNet2 are the most reasonable result.
Qualitative results on MPII dataset. We use our MRFeaNet1 model trained on a subset of the MPII training set with the ResNet-152 backbone to visualize human keypoint prediction. Fig. 11 shows the keypoint predictions and corresponding heatmaps on the unseen images of the MPII test set. Each heatmap represents the location confidence of the respective keypoint. With the simple cases as in the 1 and 2 row, all keypoints are predicted with high confidence.
The man in the 3 row has his right leg and left ankle occluded, so the prediction of these keypoints has low confidence. However, all prediction results of this case are reasonable and acceptable.
Especially, the man in the 4 row has two ankles not displayed, so the ankle prediction is unreasonable. The heatmaps corresponding to these two ankles are suitable and meaningful, where there is no location predicted with high confidence.
In this paper, we introduce two novel approaches for multi-resolution representation learning solving human pose estimation. The first approach reconciles a multi-resolution representation learning strategy with the heatmap generator where the heatmaps are generated at each resolution of the deconvolutional layers. The second approach achieves the heatmap generation from each resolution of the feature extractor. While our multi-resolution feature map learning models outperform the baseline and many previous methods, the proposed architectures are relatively straightforward and integrable. The future work includes the applications to other tasks that have the architecture of encoder-decoder (feature extraction - specific tasks) such as image captioning and image segmentation.
2d human pose estimation: new benchmark and state of the art analysis.
Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693. Cited by: §I, 2nd item, §IV.
-  (2017) Recurrent human pose estimation. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 468–475. Cited by: TABLE III.
-  (2017) Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299. Cited by: TABLE II.
-  (2016) Human pose estimation with iterative error feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4733–4742. Cited by: TABLE III.
-  (2017) 3d human pose estimation= 2d pose estimation+ matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7035–7043. Cited by: §I.
-  (2018) Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112. Cited by: §I, §IV-A, §IV-A, §IV-A, §IV-A, TABLE I.
-  COCO - Common Objects in Context. Note: http://cocodataset.org/#keypoints-eval Cited by: 1st item.
-  (2016) A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285. Cited by: §II.
Chained predictions using convolutional neural networks. In European Conference on Computer Vision, pp. 728–743. Cited by: TABLE III.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: TABLE II.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §III-A, §III, §IV.
-  (2016) Bottom-up and top-down reasoning with hierarchical rectified gaussians. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5600–5609. Cited by: TABLE III.
-  (2016) Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision, pp. 34–50. Cited by: TABLE III.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-A, §IV-B.
-  (2018) Multiposenet: fast multi-person pose estimation using pose residual network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 417–433. Cited by: §IV-A, TABLE II.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
-  (2016) Human pose estimation using deep consensus voting. In European Conference on Computer Vision, pp. 246–260. Cited by: TABLE III.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §I, §I, §IV.
-  (2017) Associative embedding: end-to-end learning for joint detection and grouping. In Advances in neural information processing systems, pp. 2277–2287. Cited by: TABLE II.
-  (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision, pp. 483–499. Cited by: §I, §IV-A, §IV-A, §IV-A, TABLE I.
-  (2018) Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–286. Cited by: TABLE II.
-  (2017) Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4903–4911. Cited by: §IV-A, TABLE II.
-  (2013) Strong appearance and expressive spatial models for human pose estimation. In Proceedings of the IEEE international conference on Computer Vision, pp. 3487–3494. Cited by: TABLE III.
-  (2016) Deepcut: joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4929–4937. Cited by: TABLE III.
-  (2016) An efficient convolutional network for human pose estimation.. In BMVC, Vol. 1, pp. 2. Cited by: TABLE III.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §IV-A.
-  (2018) Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 529–545. Cited by: TABLE II.
-  (2015) Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656. Cited by: §I, 2nd item, TABLE III.
-  (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems, pp. 1799–1807. Cited by: TABLE III.
-  (2014) Deeppose: human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1653–1660. Cited by: §I.
-  (2016) Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4724–4732. Cited by: §I, TABLE III.
-  (2018) Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pp. 466–481. Cited by: §I, §I, §II, §II, §IV-A, §IV-A, §IV-A, §IV-A, §IV-A, §IV-A, §IV-B, §IV-B, §IV-B, TABLE I, TABLE II, TABLE III.
-  (2011) Articulated pose estimation with flexible mixtures-of-parts. In CVPR 2011, pp. 1385–1392. Cited by: 2nd item.