The recent progress in machine learning techniques has allowed computer vision to work beyond bounding box estimation and solve tasks necessitating a fine-grained understanding of scenes and people. In particular, estimating human poses in the wild has been an area of research which has seen tremendous progress thanks to the development of large convolutional neural networks (CNN) that can efficiently reason about human poses under large occlusions or poor image quality. However, most of the best-performing models have 30 to 60 millions parameters, prohibiting their usage in systems where compute and power are constrained,e.g. mobile phones. Decreasing the size of these deep models most often results in a drastic loss of accuracy, making this a last resort option to improve the efficiency. One might then rightfully ask: could some form of structural priors be used to reduce the size of these models while keeping a good accuracy?
A key physiological property that is shared across humans is the kinematic structure of the body. Our understanding of this structure allows us to accurately estimate the location of all body parts of other people even under occlusions. This naturally brings up the question: could a deep neural network (DNN) make use of this kinematic structure to achieve highly accurate human body pose estimation while keeping the model complexity small?
In fact, we are not the first to wonder about utilizing the kinematic structure of the body in a machine learning model. Indeed, early computer vision techniques for human pose estimation were using part-based graphical models . While the kinematic structure of the human body is well defined, the distribution of joint distances and angles is complex and hard to model explicitly when projected on 2D. Therefore, these earlier approaches often simplified this distribution for example to Gaussian potentials . In contrast, our approach only encodes the kinematic structure in a network architecture and lets the network learn the priors from data.
Some more recent deep learning approaches have also made use of a kinematic prior by approximating a maximum a posteriori (MAP) solver within a neural network[10, 27], typically through implementing the max-sum algorithm within the network. Running a MAP solver is computationally expensive and requires an explicit description of the distribution. Our approach avoids these issues by employing a kinematically structured network. This allows us to incorporate the structural prior without incurring the computational penalty. We encode this structure at a coarse resolution where a small receptive field is large enough to capture and spatially correlate neighbouring pose keypoints. Moreover, employing the kinematic feature update module at a coarse resolution keeps our network lightweight. Finally, our method successfully refines the predicted pose hierarchically through a feature pyramid until the finest resolution is reached. Figure 1 illustrates how the predicted pose improves throughout the various updates.
To summarize, our main contributions are as follows:
A novel network architecture that encodes the kinematic structure via feature updates at coarse resolution, without the need for including any approximate MAP inference steps.
A lightweight kinematic feature update module that achieves a significant improvement in accuracy, while only adding a small number of learnable parameters compared to state-of-the-art approaches.
Extensive evaluation showing state-of-the art results on the LSP dataset and competitive results on the MPII dataset using a lightweight network without using model compression techniques such as distillation.
2 Related Work
Human pose estimation is a fundamental problem in computer vision and an active field of research. Early attempts to solving this problem were based on kinematically inspired statistical graphical models, e.g. [11, 24, 5, 27], for modeling geometric and structural priors between keypoints, e.g. elbow and wrist, in 2D images.
These techniques either imposed limiting assumptions on the modeled distribution, or relied on sub-optimal methods for solving such graphical models within a deep learning framework [24, 5, 27]. For example, 
assumed that the distance between a pair of keypoints could be modeled as a Gaussian distribution. Although efficient optimization methods exist for such a model, in practice the model is fairly simple and does not capture the complex global-relation between keypoints especially in 2D image space.
More recent approaches such as  applied loopy belief propagation, without any guarantees of optimality or convergence, in an effort to infer the MAP-estimate of a pose within a deep learning framework. The used loopy belief propagation in  or dynamic programming in  are computationally expensive. Furthermore, such networks are harder to train in general [15, 17], and the inferred MAP-estimate is not informative during the early stages of training when networks are learning to extract low level features.
The top performing pose estimation methods are based on DNNs [23, 19, 26, 8, 16, 6], which are capable of modeling complex geometric and appearance distributions between keypoints. In search for better performance on benchmarks, novel architectures and strategies were devised, such as adversarial data augmentation , feature-pyramids , pose GANs  and network stacking , which is a commonly used strategy that other methods [19, 26, 8, 2, 6] build on due to its simplicity and effectiveness.
In general, better pose estimates could be reached by successively refining the estimated pose. Carreira et al.  refined their initially estimated keypoints’ heatmaps by using a single additional refinement network and repeatedly using it to predict a new pose estimate given the current one. The stacking used in  could be seen as unrolling of Carreira et al.  refinement approach, where there are seven consecutive refinement networks that do not share weights. Although refinement unrolling achieves significantly better results than a single repeated refinement step , it is very expensive, e.g. ,  and  require 18/38 , 28 and 60+ million parameters, respectively.
There are DDNs that aim to learn spatial priors between keypoints without resolving to MAP inference approximation. In  keypoints are clustered into sets of correlated keypoints and each set has its independent features, e.g
. knee features do not directly affect hip features. The clustering was based on mutual information measure, but the clustering threshold was heuristically chosen. In contrast, RePose allow neighbouring keypoints to directly influence each others features. Furthermore, relies heavily on network stacking, while stacking slightly improves RePose’s accuracy. Unlike RePose,  does not apply hierarchical pose refinement and relies on a handcrafted post-processing step to suppress false positives in heatmaps. Finally, [23, 7] are significantly larger networks than RePose.
In reality those approaches sacrificed practicality, in terms of network size, for better benchmark performance metrics. There are a number of recent attempts to find lightweight pose estimation networks, while achieving close to state-of-the-art performance [3, 30]. In 
, the authors explored weight binarization[21, 9], which enabled them to replace multiplications with bitwise XOR operations. Their approach, however, resulted in a significant drop in performance. Recently,  was successful in distilling the stacked-hourglass  network with minimal drop in performance.
In Section 3 we describe our approach, RePose, for encoding geometric and structural priors via convolutional feature updates. Then we compare our approach to various state-of-the-art methods in Section 4 and run an extensive ablation studies of our model components. Finally, Section 5 concludes our findings and contributions.
Let denote an image. In our work, a human pose is represented by 2D keypoints, e.g. head, left ankle, etc., where is the keypoint of example in the dataset. Our approach predicts a set of heatmaps, one for each keypoint. The ground truth heatmap of keypoint is an unnormalized Gaussian centered at
with standard deviation111In our experiments, we set to 5 for a input image size..
To simplify our network description, we define a convolutional blockconvolutional blocks are denoted by , where , and
are kernel size, stride and the number of output filters, respectively. In addition,denotes a convolutional block without batch normalization layer.
Figure 2 shows our network architecture. At the coarsest resolution the features are decoupled into independent sets of features. To encourage that each set of features corresponds to a unique keypoint, we predict a single heatmap222We used and convolutional blocks per heatmap. from each set of features out of the sets. Afterwards we concatenate all predicted heatmaps to form the pre-update heatmaps.
Next, we update the decoupled sets of features according to a predefined ordering and kinematic connectivity, which is covered in Section 3.2. Then we use the updated features to compute post-update heatmaps, in the same manner as the pre-update heatmaps were computed. At this point we concatenate all the features used to predict the post-update heatmaps, this step is shown as a white circle in Figure 2.
The concatenated features are then bilinearly upsampled and concatenated with the skip connection, and projected back to channels. At each resolution heatmaps are predicted which are then bilinearly upsampled to full resolution. The refinement procedure continues as depicted in Figure 2 until full resolution is achieved. Finally, loss (3) is applied to all predicted heatmaps.
Without feature decoupling and kinematic updates, which are discussed in Section 3.2, RePose reduces a UNet/Hourglass style architecture with intermediate supervision.
3.2 Kinematic Features Updates
As shown in Figure 2, the kinematic features updates part of our network receives the decoupled sets of features . The basic idea at this stage is to update these sets of features in a way that enables the network to learn kinematic priors between keypoints and how to correlate them. As such we update the decoupled keypoints’ features according to a predefined ordering and kinematic connectivity.
Our predefined ordering starts with keypoints that are more likely to be predicted with high fidelity, e.g. hips or head, and ends with usually poorly predicted ones, e.g. wrists or ankles, see Figure 3 for the predefined ordering used in our approach.
The connectivity defines which keypoints we expect the network to learn to correlate. In our method connectivity is not restricted to trees. We used an undirected graph to define such connectivity, where each keypoint is represented by a unique node, and the set of edges encodes the desired connectivity; see Figure 3. For a keypoint/node let
be the ordered set of its neighbouring keypoints w.r.t. .
We update the keypoints one at a time following the predefined ordering. The features of keypoint are updated as follows:
where is a trainable parameter, and are and convolutional blocks, respectively. In (1) we simply concatenate and all the features of its neighbouring keypoints. Then projects the concatenated features to
channels, which then pass through four convolutional blocks. The features are updated via a residual connection (2) with a trainable weight. Finally, inspired by message passing techniques, we update the features one more time w.r.t. the reversed ordering. It should be noted that the two passes do not share any trainable parameters.
Our loss is partial Mean Squared Error (MSE)
where is the batch size and is the heatmap predicted by the network. Some of the images in the datasets are not fully annotated, as such we define to be the set of annotated keypoints of example . It should be noted that MSE is a fairly standard loss for pose estimation but its partial counterpart was not used before to the best of our knowledge. As shown in Figure 2, RePose produces multiple heatmaps/predictions for intermediate supervision. Our total loss is the sum of (3) for all the predicted heatmaps.
We evaluated our RePose network on two standard pose estimation datasets, namely Leeds Sports Pose (LSP) [13, 14] and MPII Human Pose . MPII is more challenging compared to LSP, as poses in MPII cover a large number of activities. Furthermore, MPII has a large number of spatially inseparable poses, which frequently occur in crowded scenes. MPII provides an estimate of pose center and scale, while LSP does not. To allow for joint training on both datasets we used an estimated pose center and scale for the LSP training set, as done in [26, 16, 25]. For LSP testing set, the scale and center were set to the image’s size and center, respectively.
|Tompson et al. NIPS 14 ||-||-|
|Rafi et al. BMVC 16 ||56M||28G|
|Yang et al. CVPR 16 ||-||-|
|Yu et al. ECCV 16 ||-||-|
|Carreira et al. CVPR 16 ||-||-|
|Yang et al. ICCV 17 ||28M||46G|
|Peng et al. CVPR 18 ||26M||55G|
|lightweight pose estimation approaches|
|Fast Pose CVPR 19 ||3M||9G|
|Insafutdinov et al. ECCV 16 ||66M||286G|
|Rafi et al. BMVC 16 ||56M||28G|
|Wei et al. CVPR 16 ||31M||351G|
|Newell et al. ECCV 16 ||26M||55G|
|Chu et al., CVPR 17 ||58M||128G|
|Yang et al. ICCV 17 ||28M||46G|
|Nie et al. CVPR 18 ||26M||63G|
|Peng et al. CVPR’18 ||26M||55G|
|lightweight pose estimation approaches|
|Sekii ECCV18 ||-||-||-||-||-||-||-||16M||6G|
|Fast Pose CVPR 19 ||3M||9G|
Similar to [26, 16, 25], we augmented the training data by cropping according to the provided pose scale and center, and resized the crop to be . Furthermore, the training data was augmented by scaling , rotation between , horizontal flipping, and color noise (i.e. jitter, brightness and contrast). Our network described in Section 3.1 results in a model with M parameters and GFLOPS, which was trained jointly on LSP and MPII. We used Adam optimizer to train our network with a batch size of 64 and a predefined stopping criterion at 2M steps. The initial learning rate was set to and was dropped to and at 1M and 1.3M steps, respectively. Contrary to other approaches we did not fine-tune our model on a specific dataset.
For evaluation, we used commonly adopted single pose estimation metrics in the literature. As per the LSP and MPII benchmarks, we used two variants of the Probability of Correct Keypoints[28, 1] metric, i.e. PCK@0.2 for LSP and PCKh@0.5 for MPII. The former uses of the torso diameter to define the threshold used in identifying correctly predicted keypoints, while the latter uses of the head length. The validation set of [25, 26, 30] was used for evaluating our model on MPII.
Quantitative results are shown in Tables 1 and 2 comparing our trained model to various state-of-the-art approaches on the LSP and MPII datasets, respectively. As shown in Table 1, RePose was able to surpass Yang et al.  and Tompson et al.  by a large margin, which try to approximate a MAP-solver of a statistical graphical model within a deep neural network framework. Furthermore, our approach was able to perform better than Fast Pose  by on average. As shown in Table 2, RePose reached comparable results to Fast Pose  and the Stacked-hourglass . Our network reaches better performance on MPII at the expense of increasing the number of trainable parameters and FLOPS; see Table 6. However, the gain in performance does not seem to justify doubling the network size.
for a sample) are skewed towards scenes with large number of occluded keypoints or spatially inseparable poses. Intuitively, kinematically updating features in those cases does not perform as well, since there are not enough accurately localized keypoints to enhance the prediction of the remaining ones.
4.1 Ablation Study
We conduct ablation study to show the effectiveness of different configurations and components of RePose.
Coarsest Resolution for Kinematic Updates
One important question is, what is the coarsest resolution at which our kinematic features updates are the most effective? Table 3 shows the results of applying the updates at different resolutions. On the one hand, it is clear that applying kinematic updates at resolution degrades the performance significantly, on average by . If we were to randomly place keypoints on an pixel grid, then there is more than even chance333Assuming keypoints are i.i.d. this chance is . that two or more keypoints will be placed on the same pixel. For as in LSP, this chance is and for the and resolutions, respectively. On the other hand, at the resolution the number of FLOPS increases by compared to the resolution. Furthermore, applying the updates at higher resolutions could adversely affect the performance, since the receptive field would not be large enough to capture all neighbouring keypoints to properly correlate their features.
Feature Update Step
We tried using different number of convolutional blocks in each kinematic update step (2). As shown in Table 4, increasing the number of blocks to more than four degrades performance. We also tested different strategies of applying the residual connection of the update. Table 5 shows the results of using trainable weights as in (2), adding the old features to the updated ones, or completely replacing the old ones. Using trainable weights leads to a significant performance gain, specially on MPII where occlusions are more common.
Network stacking  is a popular technique to increase network performance. For completeness, Table 6 shows results for stacking. RePose reaches comparable results to state-of-the-art methods [26, 19] on LSP, while only using of the required trainable weights. Finally, stacked RePose networks train significantly faster, requiring less than half the number of steps compared to a single network.
Kinematically Ordered vs Sequential Updates
To show how ordering the convolutions helps performance, we replaced the features update step by a series of sequential convolutional blocks, such that the resulting model would have roughly the same number of parameters. The sequential model reached and on LSP and MPII, respectively, which is a significant reduction in performance compared to RePose with kinematically ordered updates. Thus, indicating how crucial it is to properly structure the convolutional blocks to get better pose estimation models.
Instead of using the predefined ordering in Figure 3, where we started from the hips and propagated outwards, we tried a top down approach where started from the head and moved towards the ankles and wrists. The alternative ordering led to a decrease in performance by and on the LSP and MPII datasets, respectively.
|# Conv||Leeds||MPII||# Params||FLOPS|
|Feature Update Strategy||Leeds||MPII|
|# Stages||Leeds||MPII||# Params||FLOPS|
Post-features Update Predictions
As described in Section 3, we independently predict one heatmap form each post-update feature sets. This configuration results in a RePose model. Alternatively, jointly predicting heatmaps from the projected concatenation of all the post-update features reduced the model to but degraded performance by and on LSP and MPII, respectively.
Input Image Resolution & Ground Truth Heatmaps
We tried two different values for , namely , which is used in generating ground truth heatmaps. We also tried two different input image resolutions, and , but applied the kinematic features updates at the resolution for both configurations.
On the one hand, as shown in Table 7, increasing the resolution leads to an increase in performance by , but on the other hand FLOPS increased by a factor of .
We presented a novel lightweight model for pose estimation from a single image. Our model combines two main components to achieve competitive results at its scale: 1) a learned deep geometric prior that intuitively encourages predictions to have consistent configurations, and 2) hierarchical refinement of predictions through a multi-scale representation of the input image; both trained jointly and in an end-to-end fashion. Compared with various state-of-the-art models, our approach has a fraction of the parameter count and the computational cost, and achieves state-of-the-art results on a standard benchmark for models of its size.
We carried out extensive ablation studies of our model components, evaluating across input resolutions, number of scales, and types of kinematic updates, among others, to provide a detailed report of the impact of the various design choices. Finally, recent state-of-the-art approaches to pose estimation incorporate adversarial loss or distillation, both of which are orthogonal to our contribution and will likely improve our model, which we leave to future work.
Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele.
2d human pose estimation: New benchmark and state of the art
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
-  Adrian Bulat and Georgios Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In European Conference on Computer Vision, pages 717–732. Springer, 2016.
-  Adrian Bulat and Georgios Tzimiropoulos. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In Proceedings of the IEEE International Conference on Computer Vision, pages 3706–3714, 2017.
-  Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation with iterative error feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4733–4742, 2016.
-  Xianjie Chen and Alan L Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in neural information processing systems, pages 1736–1744, 2014.
-  Yu Chen, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and Jian Yang. Adversarial posenet: A structure-aware convolutional network for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1212–1221, 2017.
-  Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Structured feature learning for pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4715–4723, 2016.
-  Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L Yuille, and Xiaogang Wang. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1831–1840, 2017.
-  Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or-1. arXiv preprint arXiv:1602.02830, 2016.
-  Rodrigo de Bem, Anurag Arnab, Stuart Golodetz, Michael Sapienza, and Philip Torr. Deep fully-connected part-based models for human pose estimation. In Asian Conference on Machine Learning, pages 327–342, 2018.
-  Pedro F Felzenszwalb and Daniel P Huttenlocher. Pictorial structures for object recognition. International journal of computer vision, 61(1):55–79, 2005.
-  Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision, pages 34–50. Springer, 2016.
-  Sam Johnson and Mark Everingham. Clustered pose and nonlinear appearance models for human pose estimation. In Frédéric Labrosse, Reyer Zwiggelaar, Yonghuai Liu, and Bernie Tiddeman, editors, BMVC, pages 1–11. British Machine Vision Association, 2010.
-  Sam Johnson and Mark Everingham. Learning effective human pose estimation from inaccurate annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1465–1472. IEEE, 2011.
-  Arthur Mensch and Mathieu Blondel. Differentiable dynamic programming for structured prediction and attention. arXiv preprint arXiv:1802.03676, 2018.
-  Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European conference on computer vision, pages 483–499. Springer, 2016.
-  Vlad Niculae and Mathieu Blondel. A regularized framework for sparse and structured neural attention. In Advances in Neural Information Processing Systems, pages 3338–3348, 2017.
-  Xuecheng Nie, Jiashi Feng, Yiming Zuo, and Shuicheng Yan. Human pose estimation with parsing induced learner. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2100–2108, 2018.
-  Xi Peng, Zhiqiang Tang, Fei Yang, Rogerio S Feris, and Dimitris Metaxas. Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2226–2234, 2018.
-  Umer Rafi, Bastian Leibe, Juergen Gall, and Ilya Kostrikov. An efficient convolutional network for human pose estimation. In British Machine Vision Conference, volume 1, page 2, 2016.
-  Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
-  Taiki Sekii. Pose proposal networks. In Proceedings of the European Conference on Computer Vision, pages 342–357, 2018.
-  Wei Tang and Ying Wu. Does learning specific features for related parts help human pose estimation? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1107–1116, 2019.
-  Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems, pages 1799–1807, 2014.
-  Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
-  Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Learning feature pyramids for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1281–1290, 2017.
-  Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3073–3082, 2016.
-  Yi Yang and Deva Ramanan. Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence, 35(12):2878–2890, 2012.
-  Xiang Yu, Feng Zhou, and Manmohan Chandraker. Deep deformation network for object landmark localization. In European Conference on Computer Vision, pages 52–70. Springer, 2016.
-  Feng Zhang, Xiatian Zhu, and Mao Ye. Fast human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3517–3526, 2019.