RePose: Learning Deep Kinematic Priors for Fast Human Pose Estimation

02/10/2020 ∙ by Hossam Isack, et al. ∙ 56

We propose a novel efficient and lightweight model for human pose estimation from a single image. Our model is designed to achieve competitive results at a fraction of the number of parameters and computational cost of various state-of-the-art methods. To this end, we explicitly incorporate part-based structural and geometric priors in a hierarchical prediction framework. At the coarsest resolution, and in a manner similar to classical part-based approaches, we leverage the kinematic structure of the human body to propagate convolutional feature updates between the keypoints or body parts. Unlike classical approaches, we adopt end-to-end training to learn this geometric prior through feature updates from data. We then propagate the feature representation at the coarsest resolution up the hierarchy to refine the predicted pose in a coarse-to-fine fashion. The final network effectively models the geometric prior and intuition within a lightweight deep neural network, yielding state-of-the-art results for a model of this size on two standard datasets, Leeds Sports Pose and MPII Human Pose.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent progress in machine learning techniques has allowed computer vision to work beyond bounding box estimation and solve tasks necessitating a fine-grained understanding of scenes and people. In particular, estimating human poses in the wild has been an area of research which has seen tremendous progress thanks to the development of large convolutional neural networks (CNN) that can efficiently reason about human poses under large occlusions or poor image quality. However, most of the best-performing models have 30 to 60 millions parameters, prohibiting their usage in systems where compute and power are constrained,

e.g. mobile phones. Decreasing the size of these deep models most often results in a drastic loss of accuracy, making this a last resort option to improve the efficiency. One might then rightfully ask: could some form of structural priors be used to reduce the size of these models while keeping a good accuracy?

Pre-updates

Coarsest resolution

Post-updates

Coarsest resolution

Final prediction

Finest resolution

Ground truth

Figure 1: Our approach employs learned structural and geometric priors at a very coarse resolution using a CNN that refines the prediction hierarchically. Top row shows the prediction before keypoint-specific features are updated. Second row shows prediction after features are updated in a predefined order, where high confidence keypoints influence poorly localized ones. Third row shows the final refined pose prediction at the finest resolution.

A key physiological property that is shared across humans is the kinematic structure of the body. Our understanding of this structure allows us to accurately estimate the location of all body parts of other people even under occlusions. This naturally brings up the question: could a deep neural network (DNN) make use of this kinematic structure to achieve highly accurate human body pose estimation while keeping the model complexity small?

In fact, we are not the first to wonder about utilizing the kinematic structure of the body in a machine learning model. Indeed, early computer vision techniques for human pose estimation were using part-based graphical models [11]. While the kinematic structure of the human body is well defined, the distribution of joint distances and angles is complex and hard to model explicitly when projected on 2D. Therefore, these earlier approaches often simplified this distribution for example to Gaussian potentials [11]. In contrast, our approach only encodes the kinematic structure in a network architecture and lets the network learn the priors from data.

Some more recent deep learning approaches have also made use of a kinematic prior by approximating a maximum a posteriori (MAP) solver within a neural network

[10, 27], typically through implementing the max-sum algorithm within the network. Running a MAP solver is computationally expensive and requires an explicit description of the distribution. Our approach avoids these issues by employing a kinematically structured network. This allows us to incorporate the structural prior without incurring the computational penalty. We encode this structure at a coarse resolution where a small receptive field is large enough to capture and spatially correlate neighbouring pose keypoints. Moreover, employing the kinematic feature update module at a coarse resolution keeps our network lightweight. Finally, our method successfully refines the predicted pose hierarchically through a feature pyramid until the finest resolution is reached. Figure 1 illustrates how the predicted pose improves throughout the various updates.

To summarize, our main contributions are as follows:

  1. A novel network architecture that encodes the kinematic structure via feature updates at coarse resolution, without the need for including any approximate MAP inference steps.

  2. A lightweight kinematic feature update module that achieves a significant improvement in accuracy, while only adding a small number of learnable parameters compared to state-of-the-art approaches.

  3. Extensive evaluation showing state-of-the art results on the LSP dataset and competitive results on the MPII dataset using a lightweight network without using model compression techniques such as distillation.

2 Related Work

Human pose estimation is a fundamental problem in computer vision and an active field of research. Early attempts to solving this problem were based on kinematically inspired statistical graphical models, e.g[11, 24, 5, 27], for modeling geometric and structural priors between keypoints, e.g. elbow and wrist, in 2D images.

These techniques either imposed limiting assumptions on the modeled distribution, or relied on sub-optimal methods for solving such graphical models within a deep learning framework [24, 5, 27]. For example, [11]

assumed that the distance between a pair of keypoints could be modeled as a Gaussian distribution. Although efficient optimization methods exist for such a model, in practice the model is fairly simple and does not capture the complex global-relation between keypoints especially in 2D image space.

More recent approaches such as [27] applied loopy belief propagation, without any guarantees of optimality or convergence, in an effort to infer the MAP-estimate of a pose within a deep learning framework. The used loopy belief propagation in [27] or dynamic programming in [5] are computationally expensive. Furthermore, such networks are harder to train in general [15, 17], and the inferred MAP-estimate is not informative during the early stages of training when networks are learning to extract low level features.

The top performing pose estimation methods are based on DNNs [23, 19, 26, 8, 16, 6], which are capable of modeling complex geometric and appearance distributions between keypoints. In search for better performance on benchmarks, novel architectures and strategies were devised, such as adversarial data augmentation [19], feature-pyramids [26], pose GANs [6] and network stacking [16], which is a commonly used strategy that other methods [19, 26, 8, 2, 6] build on due to its simplicity and effectiveness.

In general, better pose estimates could be reached by successively refining the estimated pose. Carreira et al[4] refined their initially estimated keypoints’ heatmaps by using a single additional refinement network and repeatedly using it to predict a new pose estimate given the current one. The stacking used in [16] could be seen as unrolling of Carreira et al[4] refinement approach, where there are seven consecutive refinement networks that do not share weights. Although refinement unrolling achieves significantly better results than a single repeated refinement step [4], it is very expensive, e.g. [16][26] and [12] require 18/38 [19], 28 and 60+ million parameters, respectively.

There are DDNs that aim to learn spatial priors between keypoints without resolving to MAP inference approximation. In [23] keypoints are clustered into sets of correlated keypoints and each set has its independent features, e.g

. knee features do not directly affect hip features. The clustering was based on mutual information measure, but the clustering threshold was heuristically chosen. In contrast, RePose allow neighbouring keypoints to directly influence each others features. Furthermore,

[23] relies heavily on network stacking, while stacking slightly improves RePose’s accuracy. Unlike RePose, [7] does not apply hierarchical pose refinement and relies on a handcrafted post-processing step to suppress false positives in heatmaps. Finally, [23, 7] are significantly larger networks than RePose.

Figure 2: The network architecture of RePose. We use , , and to denote , , and convolutional blocks respectively; see text for definition. All convolution blocks are unique, i.e. no weight sharing, and we dropped their identifying indices for simplicity. Also, all predicted heatmaps are resized to . As shown, after applying the kinematic features updates, our approach was able to correctly recover the ankles; see full predicted pose in Figure 1.

In reality those approaches sacrificed practicality, in terms of network size, for better benchmark performance metrics. There are a number of recent attempts to find lightweight pose estimation networks, while achieving close to state-of-the-art performance [3, 30]. In [3]

, the authors explored weight binarization 

[21, 9], which enabled them to replace multiplications with bitwise XOR operations. Their approach, however, resulted in a significant drop in performance. Recently, [30] was successful in distilling the stacked-hourglass [16] network with minimal drop in performance.

In Section 3 we describe our approach, RePose, for encoding geometric and structural priors via convolutional feature updates. Then we compare our approach to various state-of-the-art methods in Section 4 and run an extensive ablation studies of our model components. Finally, Section 5 concludes our findings and contributions.

3 Method

Let denote an image. In our work, a human pose is represented by 2D keypoints, e.g. head, left ankle, etc., where is the keypoint of example  in the dataset. Our approach predicts a set of heatmaps, one for each keypoint. The ground truth heatmap of keypoint is an unnormalized Gaussian centered at

with standard deviation 

111In our experiments, we set to 5 for a input image size..

3.1 Network

To simplify our network description, we define a convolutional block

as a 2D convolution followed by ReLU activation and batch normalization layer. A series of

convolutional blocks are denoted by , where , and

are kernel size, stride and the number of output filters, respectively. In addition,

denotes a convolutional block without batch normalization layer.

Figure 2 shows our network architecture. At the coarsest resolution the features are decoupled into independent sets of features. To encourage that each set of features corresponds to a unique keypoint, we predict a single heatmap222We used and convolutional blocks per heatmap. from each set of features out of the sets. Afterwards we concatenate all predicted heatmaps to form the pre-update heatmaps.

Next, we update the decoupled sets of features according to a predefined ordering and kinematic connectivity, which is covered in Section 3.2. Then we use the updated features to compute post-update heatmaps, in the same manner as the pre-update heatmaps were computed. At this point we concatenate all the features used to predict the post-update heatmaps, this step is shown as a white circle in Figure 2.

The concatenated features are then bilinearly upsampled and concatenated with the skip connection, and projected back to channels. At each resolution heatmaps are predicted which are then bilinearly upsampled to full resolution. The refinement procedure continues as depicted in Figure 2 until full resolution is achieved. Finally, loss (3) is applied to all predicted heatmaps.

Without feature decoupling and kinematic updates, which are discussed in Section 3.2, RePose reduces a UNet/Hourglass style architecture with intermediate supervision.

3.2 Kinematic Features Updates

As shown in Figure 2, the kinematic features updates part of our network receives the decoupled sets of features . The basic idea at this stage is to update these sets of features in a way that enables the network to learn kinematic priors between keypoints and how to correlate them. As such we update the decoupled keypoints’ features according to a predefined ordering and kinematic connectivity.

Figure 3: The keypoint connectivity. At the coarsest resolution, features of each keypoint are updated based on the features of its neighbors. Our update ordering is hips, shoulders, knees, elbows, ankles, wrists, neck and then head, where the right keypoint comes just before its left counterpart.

Our predefined ordering starts with keypoints that are more likely to be predicted with high fidelity, e.g. hips or head, and ends with usually poorly predicted ones, e.g. wrists or ankles, see Figure 3 for the predefined ordering used in our approach.

The connectivity defines which keypoints we expect the network to learn to correlate. In our method connectivity is not restricted to trees. We used an undirected graph to define such connectivity, where each keypoint is represented by a unique node, and the set of edges encodes the desired connectivity; see Figure 3. For a keypoint/node let

be the ordered set of its neighbouring keypoints w.r.t. .

We update the keypoints one at a time following the predefined ordering. The features of keypoint are updated as follows:

(1)
(2)

where is a trainable parameter, and are and convolutional blocks, respectively. In (1) we simply concatenate and all the features of its neighbouring keypoints. Then projects the concatenated features to

channels, which then pass through four convolutional blocks. The features are updated via a residual connection (

2) with a trainable weight. Finally, inspired by message passing techniques, we update the features one more time w.r.t. the reversed ordering. It should be noted that the two passes do not share any trainable parameters.

3.3 Loss

Our loss is partial Mean Squared Error (MSE)

(3)

where is the batch size and is the heatmap predicted by the network. Some of the images in the datasets are not fully annotated, as such we define to be the set of annotated keypoints of example . It should be noted that MSE is a fairly standard loss for pose estimation but its partial counterpart was not used before to the best of our knowledge. As shown in Figure 2, RePose produces multiple heatmaps/predictions for intermediate supervision. Our total loss is the sum of (3) for all the predicted heatmaps.

4 Experiments

Datasets

We evaluated our RePose network on two standard pose estimation datasets, namely Leeds Sports Pose (LSP) [13, 14] and MPII Human Pose [1]. MPII is more challenging compared to LSP, as poses in MPII cover a large number of activities. Furthermore, MPII has a large number of spatially inseparable poses, which frequently occur in crowded scenes. MPII provides an estimate of pose center and scale, while LSP does not. To allow for joint training on both datasets we used an estimated pose center and scale for the LSP training set, as done in [26, 16, 25]. For LSP testing set, the scale and center were set to the image’s size and center, respectively.

Head Shoulder Elbow Wrist Hip Knee Ankle Mean # Param FLOPS
Tompson et al. NIPS 14 [24] - -
Rafi et al. BMVC 16 [20] 56M 28G
Yang et al. CVPR 16 [27] - -
Yu et al. ECCV 16 [29] - -
Carreira et al. CVPR 16 [4] - -
Yang et al. ICCV 17 [26] 28M 46G
Peng et al. CVPR 18 [19] 26M 55G
lightweight pose estimation approaches
Fast Pose CVPR 19 [30] 3M 9G
RePose  4M 13.5G
Table 1: A comparison of various methods on the LSP dataset using the PCK@0.2 metric [28]. RePose achieved better results w.r.t. other state-of-the-art CNN methods that are based on statistical graphical methods (trained on LSP). M and G stand for Million and Giga.
Head Shoulder Elbow Wrist Hip Knee Ankle Mean # Param FLOPS
Insafutdinov et al. ECCV 16 [12] 66M 286G
Rafi et al. BMVC 16 [20] 56M 28G
Wei et al.  CVPR 16 [25] 31M 351G
Newell et al. ECCV 16 [16] 26M 55G
Chu et al., CVPR 17 [8] 58M 128G
Yang et al. ICCV 17 [26] 28M 46G
Nie et al. CVPR 18 [18] 26M 63G
Peng et al. CVPR’18 [19] 26M 55G
lightweight pose estimation approaches
Sekii ECCV18 [22] - - - - - - - 16M 6G
Fast Pose CVPR 19 [30] 3M 9G
RePose  4M 13.5G
Table 2: A comparison of various methods on the MPII dataset using the PCKh@0.5 metric [1]. RePose achieved comparable results to Stacked-hourglass [16] and its distilled version, i.e. Fast Pose [30]. M and G stand for Million and Giga.

Training

Similar to [26, 16, 25], we augmented the training data by cropping according to the provided pose scale and center, and resized the crop to be . Furthermore, the training data was augmented by scaling , rotation between , horizontal flipping, and color noise (i.e. jitter, brightness and contrast). Our network described in Section 3.1 results in a model with M parameters and GFLOPS, which was trained jointly on LSP and MPII. We used Adam optimizer to train our network with a batch size of 64 and a predefined stopping criterion at 2M steps. The initial learning rate was set to and was dropped to and at 1M and 1.3M steps, respectively. Contrary to other approaches we did not fine-tune our model on a specific dataset.

Metrics

For evaluation, we used commonly adopted single pose estimation metrics in the literature. As per the LSP and MPII benchmarks, we used two variants of the Probability of Correct Keypoints

[28, 1] metric, i.e. PCK@0.2 for LSP and PCKh@0.5 for MPII. The former uses of the torso diameter to define the threshold used in identifying correctly predicted keypoints, while the latter uses of the head length. The validation set of [25, 26, 30] was used for evaluating our model on MPII.

Quantitative Evaluations

Quantitative results are shown in Tables 1 and 2 comparing our trained model to various state-of-the-art approaches on the LSP and MPII datasets, respectively. As shown in Table 1, RePose was able to surpass Yang et al[27] and Tompson et al[24] by a large margin, which try to approximate a MAP-solver of a statistical graphical model within a deep neural network framework. Furthermore, our approach was able to perform better than Fast Pose [30] by on average. As shown in Table 2, RePose reached comparable results to Fast Pose [30] and the Stacked-hourglass [16]. Our network reaches better performance on MPII at the expense of increasing the number of trainable parameters and FLOPS; see Table 6. However, the gain in performance does not seem to justify doubling the network size.

Qualitative Evaluations

Figure 4 shows a sample of correctly predicted poses using examples not seen during training on both datasets. Our network failed predictions (see Figure 5

for a sample) are skewed towards scenes with large number of occluded keypoints or spatially inseparable poses. Intuitively, kinematically updating features in those cases does not perform as well, since there are not enough accurately localized keypoints to enhance the prediction of the remaining ones.

LSP


LSP

LSP

LSP

MPII

MPII

Figure 4: A sample of correctly predicted poses using RePose on LSP and MPII examples not seen during training. The body parts are uniquely color coded, while line style encodes the person centric orientation of a part, i.e. solid and dashed for right and left, respectively.

Ground truth

Incorrect pose

LSP MPII
Figure 5: shows a sample of incorrectly predicted poses. The failure cases usually involve crowed scenes, highly occluded pose, or a left/right flip of the whole body pose.

4.1 Ablation Study

We conduct ablation study to show the effectiveness of different configurations and components of RePose.

Coarsest Resolution for Kinematic Updates

One important question is, what is the coarsest resolution at which our kinematic features updates are the most effective? Table 3 shows the results of applying the updates at different resolutions. On the one hand, it is clear that applying kinematic updates at resolution degrades the performance significantly, on average by . If we were to randomly place keypoints on an pixel grid, then there is more than even chance333Assuming keypoints are i.i.d. this chance is . that two or more keypoints will be placed on the same pixel. For as in LSP, this chance is and for the and resolutions, respectively. On the other hand, at the resolution the number of FLOPS increases by compared to the resolution. Furthermore, applying the updates at higher resolutions could adversely affect the performance, since the receptive field would not be large enough to capture all neighbouring keypoints to properly correlate their features.

Feature Update Step

We tried using different number of convolutional blocks in each kinematic update step (2). As shown in Table 4, increasing the number of blocks to more than four degrades performance. We also tested different strategies of applying the residual connection of the update. Table 5 shows the results of using trainable weights as in (2), adding the old features to the updated ones, or completely replacing the old ones. Using trainable weights leads to a significant performance gain, specially on MPII where occlusions are more common.

Network Stacking

Network stacking [16] is a popular technique to increase network performance. For completeness, Table 6 shows results for stacking. RePose reaches comparable results to state-of-the-art methods [26, 19] on LSP, while only using of the required trainable weights. Finally, stacked RePose networks train significantly faster, requiring less than half the number of steps compared to a single network.

Kinematically Ordered vs Sequential Updates

To show how ordering the convolutions helps performance, we replaced the features update step by a series of sequential convolutional blocks, such that the resulting model would have roughly the same number of parameters. The sequential model reached and on LSP and MPII, respectively, which is a significant reduction in performance compared to RePose with kinematically ordered updates. Thus, indicating how crucial it is to properly structure the convolutional blocks to get better pose estimation models.

Updates Ordering

Instead of using the predefined ordering in Figure 3, where we started from the hips and propagated outwards, we tried a top down approach where started from the head and moved towards the ankles and wrists. The alternative ordering led to a decrease in performance by and on the LSP and MPII datasets, respectively.

Coarsest Leeds MPII # Params FLOPS
Resolution
M G
M G
M G
Table 3: The results for applying kinematic features update steps at different resolutions. At resolution the receptive field is not large enough to fully capture all neighbouring keypoints.
# Conv Leeds MPII # Params FLOPS
Blocks
M G
M G
M G
M G
M G
Table 4: The results for different number of convolutional blocks used in the kinematic feature update step.
Feature Update Strategy Leeds MPII
trainable
add
replace
Table 5: The results for different kinematic features update strategies. Using trainable mixing weights to learn how to weight old vs new features is the clear winner, while insignificantly increasing the number of trainable parameters.
# Stages Leeds MPII # Params FLOPS
M G
M G
M G
Table 6: The results for stacking multiple RePose architecture to create a multi-stage network, a la [16]. Our approach achieves comparable results to state-of-the-art methods [26, 19] on LSP, while using of the parameters required by [26, 19].
Leeds MPII # Params FLOPS
5 M G
10 M G
5 M G
10 M G
Table 7: The results for different input resolutions and ground truth heatmap ’s using RePose. Increasing the input image resolution leads to better results on MPII, but also increases the FLOPS by a factor of .

Post-features Update Predictions

As described in Section 3, we independently predict one heatmap form each post-update feature sets. This configuration results in a RePose model. Alternatively, jointly predicting heatmaps from the projected concatenation of all the post-update features reduced the model to but degraded performance by and on LSP and MPII, respectively.

Input Image Resolution & Ground Truth Heatmaps

We tried two different values for , namely , which is used in generating ground truth heatmaps. We also tried two different input image resolutions, and , but applied the kinematic features updates at the resolution for both configurations.

On the one hand, as shown in Table 7, increasing the resolution leads to an increase in performance by , but on the other hand FLOPS increased by a factor of .

5 Conclusion

We presented a novel lightweight model for pose estimation from a single image. Our model combines two main components to achieve competitive results at its scale: 1) a learned deep geometric prior that intuitively encourages predictions to have consistent configurations, and 2) hierarchical refinement of predictions through a multi-scale representation of the input image; both trained jointly and in an end-to-end fashion. Compared with various state-of-the-art models, our approach has a fraction of the parameter count and the computational cost, and achieves state-of-the-art results on a standard benchmark for models of its size.

We carried out extensive ablation studies of our model components, evaluating across input resolutions, number of scales, and types of kinematic updates, among others, to provide a detailed report of the impact of the various design choices. Finally, recent state-of-the-art approaches to pose estimation incorporate adversarial loss or distillation, both of which are orthogonal to our contribution and will likely improve our model, which we leave to future work.

References

  • [1] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2014.
  • [2] Adrian Bulat and Georgios Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In European Conference on Computer Vision, pages 717–732. Springer, 2016.
  • [3] Adrian Bulat and Georgios Tzimiropoulos. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In Proceedings of the IEEE International Conference on Computer Vision, pages 3706–3714, 2017.
  • [4] Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation with iterative error feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4733–4742, 2016.
  • [5] Xianjie Chen and Alan L Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in neural information processing systems, pages 1736–1744, 2014.
  • [6] Yu Chen, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and Jian Yang. Adversarial posenet: A structure-aware convolutional network for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1212–1221, 2017.
  • [7] Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Structured feature learning for pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4715–4723, 2016.
  • [8] Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L Yuille, and Xiaogang Wang. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1831–1840, 2017.
  • [9] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or-1. arXiv preprint arXiv:1602.02830, 2016.
  • [10] Rodrigo de Bem, Anurag Arnab, Stuart Golodetz, Michael Sapienza, and Philip Torr. Deep fully-connected part-based models for human pose estimation. In Asian Conference on Machine Learning, pages 327–342, 2018.
  • [11] Pedro F Felzenszwalb and Daniel P Huttenlocher. Pictorial structures for object recognition. International journal of computer vision, 61(1):55–79, 2005.
  • [12] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision, pages 34–50. Springer, 2016.
  • [13] Sam Johnson and Mark Everingham. Clustered pose and nonlinear appearance models for human pose estimation. In Frédéric Labrosse, Reyer Zwiggelaar, Yonghuai Liu, and Bernie Tiddeman, editors, BMVC, pages 1–11. British Machine Vision Association, 2010.
  • [14] Sam Johnson and Mark Everingham. Learning effective human pose estimation from inaccurate annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1465–1472. IEEE, 2011.
  • [15] Arthur Mensch and Mathieu Blondel. Differentiable dynamic programming for structured prediction and attention. arXiv preprint arXiv:1802.03676, 2018.
  • [16] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European conference on computer vision, pages 483–499. Springer, 2016.
  • [17] Vlad Niculae and Mathieu Blondel. A regularized framework for sparse and structured neural attention. In Advances in Neural Information Processing Systems, pages 3338–3348, 2017.
  • [18] Xuecheng Nie, Jiashi Feng, Yiming Zuo, and Shuicheng Yan. Human pose estimation with parsing induced learner. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2100–2108, 2018.
  • [19] Xi Peng, Zhiqiang Tang, Fei Yang, Rogerio S Feris, and Dimitris Metaxas. Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2226–2234, 2018.
  • [20] Umer Rafi, Bastian Leibe, Juergen Gall, and Ilya Kostrikov. An efficient convolutional network for human pose estimation. In British Machine Vision Conference, volume 1, page 2, 2016.
  • [21] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
  • [22] Taiki Sekii. Pose proposal networks. In Proceedings of the European Conference on Computer Vision, pages 342–357, 2018.
  • [23] Wei Tang and Ying Wu. Does learning specific features for related parts help human pose estimation? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1107–1116, 2019.
  • [24] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems, pages 1799–1807, 2014.
  • [25] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
  • [26] Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Learning feature pyramids for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1281–1290, 2017.
  • [27] Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3073–3082, 2016.
  • [28] Yi Yang and Deva Ramanan. Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence, 35(12):2878–2890, 2012.
  • [29] Xiang Yu, Feng Zhou, and Manmohan Chandraker. Deep deformation network for object landmark localization. In European Conference on Computer Vision, pages 52–70. Springer, 2016.
  • [30] Feng Zhang, Xiatian Zhu, and Mao Ye. Fast human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3517–3526, 2019.