1 Introduction
(a) Generative Partition  (b) Local Inference 
Multiperson pose estimation aims to localize body joints of multiple persons captured in a 2D monocular image [7, 22]. Despite extensive prior research, this problem remains very challenging due to the highly complex joint configuration, partial or even complete joint occlusion, significant overlap between neighboring persons, unknown number of persons and more critically the difficulties in allocating joints to multiple persons. These challenges feature the unique property of multiperson pose estimation compared with the simpler singleperson setting [19, 27]. To tackle these challenges, existing multiperson pose estimation approaches usually perform joint detection and partition separately, mainly following two different strategies. The topdown strategy [7, 8, 13, 20, 23] first detects persons and then performs pose estimation for each single person individually. The bottomup strategy [3, 11, 12, 15, 16, 22], in contrast, generates all joint candidates at first, and then tries to partition them to corresponding person instances.
The topdown approaches directly leverage existing person detection models [17, 24] and singleperson pose estimation methods [19, 27]. Thus they effectively avoid complex joint partitions. However, their performance is critically limited by the quality of person detections. If the employed person detector fails to detect a person instance accurately (due to occlusion, overlapping or other distracting factors), the introduced errors cannot be remedied and would severely harm performance of the following pose estimation. Moreover, they suffer from high joint detection complexity, which linearly increases with the number of persons in the image, because they need to run the singleperson joint detector for each person detection sequentially.
In contrast, the bottomup approaches detect all joint candidates at first by globally applying a joint detector for only once and then partition them to corresponding persons according to joint affinities. Hence, they enjoy lower joint detection complexity than the topdown ones and robustness to errors from early commitment. However, they suffer from very high complexity of partitioning joints to corresponding persons, which usually involves solving NPhard graph partition problems [12, 22] on densely connected graphs covering the whole image.
In this paper, we propose a novel solution, termed the Generative Partition Network (GPN), to overcome essential limitations of the above two types of approaches and meanwhile inherit their strengths within a unified model for efficiently and effectively estimating poses of multiple persons in a given image. As shown in Figure 1, GPN solves multiperson pose estimation problem by simultaneously 1) modeling person detection for joint partition as a generative process inferred from all joint candidates and 2) performing local inference for obtaining joint categorizations and associations conditioned on the generated person detections.
In particular, GPN introduces a dense regression module to generate person detections with partitioned joints via votes from joint candidates in a carefully designed embedding space, which is efficiently parameterized by person centroids. This generative partition model produces joint candidates and partitions by running a joint detector for only one feedforward pass, offering much higher efficiency than topdown approaches. In addition, the produced person detections from GPN are robust to various distracting factors, e.g., occlusion, overlapping, deformation, and large pose variation, benefiting the following pose estimation. GPN also introduces a local greedy inference algorithm by assuming independence among person detections for producing optimal multiperson joint configurations. This local optimization strategy reduces the search space of the graph partition problem for finding optimal poses, avoiding the high joint partition complexity challenging the bottomup strategy. Moreover, the local greedy inference algorithm exploits reliable global affinity cues from the embedding space for inferring joint configurations within robust person detections, leading to performance improvement.
We implement GPN based on the Hourglass network [19] for learning joint detector and dense regressor, simultaneously. Extensive experiments on MPII Human Pose MultiPerson [1], extended PASCALPersonPart [28] and WAF [7] benchmarks evidently show the efficiency and effectiveness of the proposed GPN. Moreover, GPN achieves new stateoftheart on all these benchmarks.
We make following contributions. 1) We propose a new onepass solution to multiperson pose estimation, totally different from previous topdown and bottomup ones. 2) We propose a novel dense regression module to efficiently and robustly partition body joints into multiple persons, which is the key to speeding up multiperson pose estimation. 3) In addition to high efficiency, GPN is also superior in terms of robustness and accuracy on multiple benchmarks.
2 Related Work
TopDown MultiPerson Pose Estimation
Existing approaches following topdown strategy sequentially perform person detection and singleperson pose estimation. In [9], Gkioxari et al.
proposed to adopt the Generalized Hough Transform framework to first generate person proposals and then classify joint candidates based on the poselets. Sun
et al. [25]presented a hierarchical partbased model for jointly person detection and pose estimation. Recently, deep learning techniques have been exploited to improve both person detection and singleperson pose estimation. In
[13], Iqbal and Gall adopted FasterRCNN [24] based person detector and convolutional pose machine [27] based joint detector for this task. Later, Fang et al. [8]utilized spatial transformer network
[14] and Hourglass network [19] to further improve the quality of joint detections and partitions. Despite remarkable success, they suffer from limitations from early commitment and high joint detection complexity. Differently, the proposed GPN adopts a onepass generative process for efficiently producing person detections with partitioned joint candidates, offering robustness to early commitment as well as low joint detection complexity.BottomUp MultiPerson Pose Estimation
The bottomup strategy provides robustness to early commitment and low joint detection complexity. Previous bottomup approaches [3, 12, 18, 22]
mainly focus on improving either the joint detector or joint affinity cues, benefiting the following joint partition and configuration inference. For joint detector, fully convolutional neural networks,
e.g., Residual networks [10] and Hourglass networks [19], have been widely exploited. As for joint affinity cues, Insafutdinov et al. [12] explored geometric and appearance constraints among joint candidates. Cao et al. [3] proposed part affinity fields to encode location and orientation of limbs. Newell and Deng [18] presented the associative embedding for grouping joint candidates. Nevertheless, all these approaches partition joints based on partitioning the graph covering the whole image, resulting in high inference complexity. In contrast, GPN performs local inference with robust global affinity cues which is efficiently generated by dense regressions from the centroid embedding space, reducing complexity for joint partitions and improving pose estimation.3 Approach
3.1 Generative Partition Model
The overall pipeline for the proposed Generative Partition Network (GPN) model is shown in Figure 2. Throughout the paper, we use following notations. Let denote an image containing multiple persons, denote the spatial coordinates of joint candidates from all persons in with , and denote the labels of corresponding joint candidates, in which and is the number of joint categories. For allocating joints via local inference, we also consider the proximities between joints, denoted as . Here encodes the proximity between the th joint candidate and the th joint candidate
, and gives the probability for them to be from the same person.
The proposed GPN with learnable parameters aims to solve the multiperson pose estimation task through learning to infer the conditional distribution . Namely, given the image , GPN infers the joint locations , labels and proximities providing the largest likelihood probability. To this end, GPN adopts a generative model to simultaneously produce person detections with joint partitions implicitly and infers joint configuration and for each person detection locally. In this way, GPN reduces the difficulty and complexity of multiperson pose estimation significantly. Formally, GPN introduces latent variables to encode joint partitions, and each is a collection of joint candidates belonging to a specific person detection in which their labels are not considered, and is the number of joint partitions. With these latent variables , can be factorized into
(1)  
where models the generative process of joint partitions within person detections based on joint candidates. Maximizing the above likelihood probability gives optimal pose estimation for multiple persons in .
However, directly maximizing the above likelihood is computationally intractable. Instead of maximizing w.r.t. all possible partitions , we propose to maximize its lower bound induced by a single “optimal” partition, inspired by the EM algorithm [6]. Such approximation could reduce the complexity significantly without harming the performance. Concretely, based on Eqn. (1), we have
Here, we find the optimal solution by maximizing the above induced lower bound , instead of maximizing the summation. The joint partitions disentangle independent joints and reduce inference complexity—only the joints falling in the same partition have nonzero proximities . Then is further factorized as
(2)  
where denotes the labels of joints falling in the partition and denotes their proximities. In the above probabilities, we define as a Gibbs distribution:
(3) 
where
is the energy function for the joint distribution
. Its explicit form is derived from Eqn. (2) accordingly:(4)  
Here, scores the quality of joint partitions generated from joint candidates for the input image , scores how the position is compatible with label , and represents how likely the positions with label and with label belong to the same person, i.e., characterizing the proximity . In the following subsections, we will give details for detecting joint candidates , generating optimal joint partitions , inferring joint configurations and along with the proposed algorithm to optimize the energy function.
3.2 Joint Candidate Detection
To reliably detect human body joints, we use confidence maps to encode probabilities of joints presenting at each position in the image. The joint confidence maps are constructed by modeling the joint locations as Gaussian peaks, as shown in Figure 2 (a). We use to denote the confidence map for the th joint with being the confidence map of the th joint for the th person. For a position in the given image, is calculated by , where denotes the groundtruth position of the th joint of the th person, and
is an empirically chosen constant to control variance of the Gaussian distribution and set as 7 in the experiments. The target confidence map, which the proposed GPN model learns to predict, is an aggregation of peaks of all the persons in a single map. Here, we choose to take the maximum of confidence maps rather than average to remain distinctions between closeby peaks
[3], i.e. . During testing, we first find peaks with confidence scores greater than a given threshold (set as 0.1) on predicted confidence maps for all types of joints. Then we perform nonmaximum suppression to find the joint candidate set .3.3 Joint Partition via Dense Regression
Our proposed joint partition model performs dense regression over all the joint candidates to localize centroids of multiple persons to partition joints into different person instances, as shown in Figure 2 (b) and (c). It learns to transform all the pixels belonging to a specific person to an identical single point in a carefully deigned embedding space, where they are easy to cluster into corresponding persons. Such a dense regression framework enables generating joint partitions by one single feedforward pass, reducing high joint detection complexity troubling topdown solutions.
To this end, we build and parameterize the joint candidate embedding space by the person centroids, as centroid is stable and reliable to discriminate person instances even in presence of some extreme poses. We denote the constructed embedding space as . In , each person corresponds to a single point (i.e., the centroid), and each point represents a hypothesis about centroid location of a specific person instance. An example is given in Figure 3 (a).
Joint candidates are densely transformed into and can collectively determine the centroid hypotheses of their corresponding person instances, since they are tightly related in the view of articulated kinematics, as shown in Figure 2 (c). For instance, a candidate of the head joint would add votes for the presence of a person’s centroid to the location just below it. A single candidate does not necessarily provide evidence for the exact centroid of a person instance, but the population of joint candidates can vote for the correct centroid with large probability and determine the joint partitions correctly. In particular, the probability of generating joint partition at location is calculated by summing the votes from different joint candidates together, i.e.
where is the indicator function and is the weight for the votes from th joint category. We set for all joints assuming all kinds of joints equally contribute to the localization of person instances in view of unconstrained shapes of human body and uncertainties of presence of different joints. The function learns to densely transform every pixel in the image to the embedding space . For learning , we build the target regression map for the th joint of the th person as follows:
where denotes the centroid position of the th person, is the normalization factor, and are the height and width of image , denotes the neighbor positions of the th joint of the th person, and is a constant to define the neighborhood size, set as 7 in our experiments. An example is shown in Figure 3 (b) for construction of a regression target of a pixel in a given image. Then, we define the target regression map for the th joint as the average for all persons by
where
is the number of nonzero vectors at position
across all persons. During testing, after predicting the regression map , we define transformation function for position as . After generating for each point in the embedding space, we calculate the score as .Then the problem of joint partition generation is converted to finding peaks in the embedding space . As there are no priors on the number of persons in the image, we adopt the Agglomerative Clustering [2] to find peaks by clustering the votes, which can automatically determine the number of clusters. We denote the vote set as , and use to denote the clustering result on , where represents the th cluster and is the number of clusters. We assume the set of joint candidates casting votes in each cluster corresponds to a joint partition , defined by
(5) 
3.4 Local Greedy Inference for Pose Estimation
According to Eqn. (3), we maximize the conditional probability by minimizing the energy function in Eqn. (4). We optimize in two sequential steps: 1) generate joint partition set based on joint candidates; 2) conduct joint configuration inference in each joint partition locally, which reduces the joint configuration complexity and overcomes the drawback of bottomup approaches.
After getting joint partition according to Eqn. (5), the score becomes a constant. Let denote the generated partition set. The optimization is then simplified as
(6)  
Pose estimation in each joint partition is independent, thus inference over different joint partitions becomes separate. We propose the following local greedy inference algorithm to solve Eqn. (6) for multiperson pose estimation. Given a joint partition , the unary term is the confidence score at from the th joint detector: . The binary term is the similarity score of votes of two joint candidates based on the global affinity cues in the embedding space:
where and .
For efficient inference in Eqn. (6), we adopt a greedy strategy which guarantees the energy monotonically decreases and eventually converges to a lower bound. Specifically, we iterate through each joint one by one, first considering joints around torso and moving out to limb. We start the inference with neck. For a neck candidate, we use its embedding point in to initialize the centroid of its person instance. Then, we select the head top candidate closest to the person centroid and associate it with the same person as the neck candidate. After that, we update person centroid by averaging the derived hypotheses. We loop through all other joint candidates similarly. Finally, we get a person instance and its associated joints. After utilizing neck as root for inferring joint configurations of person instances, if some candidates remain unassigned, we utilize joints from torso, then from limbs, as the root to infer the person instance. After all candidates find their associations to persons, the inference terminates. See details in Algorithm 1.
4 Learning Joint Detector and Dense Regressor with CNNs
GPN is a generic model and compatible with various CNN architectures. Extensive architecture engineering is out of the scope of this work. We simply choose the stateoftheart Hourglass network [19] as the backbone of GPN. Hourglass network consists of a sequence of Hourglass modules. As shown in Figure 4, each Hourglass module first learns downsized feature maps from the input image, and then recovers fullresolution feature maps through upsampling for precise joint localization. In particular, each Hourglass module is implemented as a fully convolutional network. Skipping connections are added between feature maps with the same resolution symmetrically to capture information at every scale. Multiple Hourglass modules are stacked sequentially for gradually refining the predictions via reintegrating the previous estimation results. Intermediate supervision is applied at each Hourglass module.
Hourglass network was proposed for singleperson pose estimation. GPN extends it to multiperson cases. GPN introduces modules enabling simultaneous joint detection (Sec. 3.2) and dense jointcentroid regression (Sec. 3.3), as shown in Figure 4. In particular, GPN utilizes the Hourglass module to learn image representations and then separates into two branches: one produces the dense regression maps for detecting person centroids, via one convolution on feature maps from the Hourglass module and another convolution for classification; the other branch produces joint detection confidence maps. With this design, GPN obtains joint detection and partition in one feedforward pass. When using multistage Hourglass modules, GPN feeds the predicted dense regression maps at every stage into the next one through convolution, and then combines intermediate features with features from the previous stage.
For training GPN, we use loss to learn both joint detection and dense regression branches with supervision at each stage. The losses are defined as
where and represent predicted joint confidence maps and dense regression maps at the th stage, respectively. The groundtruth and are constructed as in Sec. 3.2 and 3.3 respectively. The total loss is given by , where is the number of Hourglass modules (stages) used in our implementation and the weighting factor is empirically set as .
5 Experiments
5.1 Experimental Setup
Datasets
We evaluate the proposed GPN on three widely adopted benchmarks: MPII Human Pose MultiPerson (MPII) dataset [1], extended PASCALPersonPart dataset [28], and “We Are Family” (WAF) dataset [7]. The MPII dataset consists of 3,844 and 1,758 groups of multiple interacting persons for training and testing respectively. Each person in the image is annotated for 16 body joints. It also provides more than 28,000 training samples for singleperson pose estimation. The extended PASCALPersonPart dataset contains 3,533 challenging images from the original PASCALPersonPart dataset [4], which are split into 1,716 for training and 1,817 for testing. Each person is annotated with 14 body joints shared with MPII dataset, without pelvis and thorax. The WAF dataset contains 525 web images (350 for training and 175 for testing). Each person is annotated with 6 line segments for the upperbody.
Data Augmentation
We follow conventional ways to augment training samples by cropping original images based on the person center. In particular, we augment each training sample with rotation degrees sampled in , scaling factors in , translational offset in and horizontally mirror. We resize each training sample to
pixels with padding.
Implementation
For MPII dataset, we reserve 350 images randomly selected from the training set for validation. We use the rest training images and all the provided singleperson samples to train the GPN for 250 epochs. For evaluation on the other two datasets, we follow the common practice and finetune the GPN model pretrained on MPII for 30 epochs. To deal with some extreme cases where centroids of persons are overlapped, we slightly perturb the centroids by adding small offset to separate them. We implement our model with PyTorch
[21]and adopt the RMSProp
[26] for optimization. The initial learning rate is 0.0025 and decreased by multiplying 0.5 at the 150th, 170th, 200th, 230th epoch. In testing, we follow conventions to crop image patches using the given position and average person scale of test images, and resize and pad the cropped samples to as input to GPN. We search for suitable image scales over 5 different choices. Specially, when testing on MPII, following previous works [3, 18], we apply a singleperson model [19] trained on MPII to refine the estimations. We use the standard Average Precision (AP) as performance metric on all the datasets, as suggested by [12, 28]. We will make codes and pretrained models available.Method  Head  Sho.  Elb.  Wri.  Hip  Knee  Ank.  Total  Time [s] 

Iqbal and Gall [13]  58.4  53.9  44.5  35.0  42.2  36.7  31.1  43.1  10 
Insafutdinov et al. [12]  78.4  72.5  60.2  51.0  57.2  52.0  45.4  59.5  485 
Levinkov et al. [16]  89.8  85.2  71.8  59.6  71.1  63.0  53.5  70.6   
Insafutdinov et al. [11]  88.8  87.0  75.9  64.9  74.2  68.8  60.5  74.3   
Cao et al. [3]  91.2  87.6  77.7  66.8  75.4  68.9  61.7  75.6  1.24 
Fang et al. [8]  88.4  86.5  78.6  70.4  74.4  73.0  65.8  76.7  1.5 
Newell and Deng [18]  92.1  89.3  78.9  69.8  76.2  71.6  64.7  77.5   
GPN (Ours)  92.2  89.7  82.1  74.4  78.6  76.4  69.3  80.4  0.77 
Method  Head  Sho.  Elb.  Wri.  Hip  Knee  Ank.  Total 

Chen and Yuille [5]  45.3  34.6  24.8  21.7  9.8  8.6  7.7  21.8 
Insafutdinov et al. [12]  41.5  39.3  34.0  27.5  16.3  21.3  20.6  28.6 
Xia et at. [28]  58.0  52.1  43.1  37.2  22.1  30.8  31.1  39.2 
GPN (Ours)  66.9  60.0  51.4  48.9  29.2  36.4  33.5  46.6 
Method  Head  Shoulder  Elbow  Wrist  Total 

Chen and Yuile [5]  83.3  56.1  46.3  35.5  55.3 
Pishchulin et al. [22]  76.6  80.8  73.7  73.6  76.2 
Insafutdinov et al. [12]  92.6  81.1  75.7  78.8  82.0 
GPN (Ours)  93.1  82.9  83.5  79.9  84.8 
5.2 Results and Analysis
Mpii
Table 1 shows the evaluation results on the full testing set of MPII. We can see that the proposed GPN achieves overall AP and significantly outperforms previous stateoftheart achieving AP [18]
. In addition, the proposed GPN improves the performance for localizing all the joints consistently. In particular, it brings remarkable improvement over rather difficult joints mainly caused by occlusion and high degrees of freedom, including wrists (
vs AP), ankles ( vs AP), and knees (with absolute AP increase over [18]), confirming the robustness of the proposed generative model and global affinity cues to these distracting factors. These results clearly show GPN is outstandingly effective for multiperson pose estimation. We also report the computational speed of GPN^{1}^{1}1The runtime time is measured on CPU Intel I75820K 3.3GHz and GPU TITAN X (Pascal). The time is counted with 5 scale testing, not including the refinement time by singleperson pose estimation. in Table 1. GPN is about 2 times faster than the bottomup approach [3] with stateoftheart speed for multiperson pose estimation. This demonstrates the efficiency of performing joint detection and partition simultaneously in our model.PASCALPersonPart
Table 2 shows the evaluation results. GPN provides absolute AP improvement ( vs AP) over the stateoftheart [28]. Moreover, the proposed GPN brings significant improvement on difficult joints, such as wrist ( vs AP). These results further demonstrate the effectiveness and robustness of our model for multiperson pose estimation.
Waf
As shown in Table 3, GPN achieves overall AP, bringing improvement over the best bottomup approach [12]. GPN achieves the best performance for all upperbody joints. In particular, it gives the most significant performance improvement on the elbow, about higher than previous best results. These results verify the effectiveness of the proposed GPN for tackling the multiperson pose estimation problem.
Qualitative Results
Figure 5 visualizes some pose estimation results on these three datasets. We can observe that the proposed GPN model estimates multiperson poses accurately and robustly even in challenging scenarios, e.g., joint occlusion caused by a person of interest and other overlapped persons presenting in the first example of MPII dataset, large pose variation shown in the second example of the extended PASCALPersonPart dataset, and appearance and illumination changes in the forth example of WAF dataset. These results also verify the effectiveness of GPN for producing reliable joint detections and partitions in multiperson pose estimation.
Method  Head  Sho.  Elb.  Wri.  Hip  Knee  Ank.  Total  InferTime [ms] 

GPNFull  94.4  90.0  81.3  72.1  77.8  72.7  64.7  79.0  1.9 
GPNw/oPartition  93.2  89.3  79.9  70.1  78.8  73.1  65.7  78.6  3.4 
GPNw/oLGI  93.1  89.1  79.5  68.5  79.0  71.4  64.4  77.8   
GPNw/oRefinement  90.4  86.8  79.3  69.8  77.5  69.3  61.9  76.4   
GPN  91.0  87.1  78.6  70.2  76.7  70.5  60.0  76.3   
GPNVanilla  90.5  86.4  77.1  69.4  72.2  67.7  60.2  74.8   
5.3 Ablation Analysis
We conduct ablation analysis for the proposed GPN model using the MPII validation set. We evaluate multiple variants of our proposed GPN model by removing certain components from the full model (“GPNFull”). “GPNw/oPartition” performs inference on the whole image without using obtained joint partition information, which is similar to the pure bottomup approaches. “GPNw/oLGI” removes the local greedy inference phase. It allocates joint candidates to persons through finding the most activated position for each joint in each joint partition. This is similar to the topdown approaches. “GPNw/oRefinement” does not perform refinement by using singleperson pose estimator. We use “GPN” to denote testing over images and “GPNVanilla” to denote single scale testing without refinement.
From Table 4, “GPNFull” achieves AP and the joint partition inference only costs 1.9ms, which is very efficient. “GPNw/oPartition” achieves slightly lower AP () with slower inference speed (3.4ms). The results confirm effectiveness of generating joint partitions by GPN—inference within each joint partition individually reduces complexity and improves pose estimation over multipersons. Removing the local greedy inference phase as in “GPNw/oLGI” decreases the performance to AP, showing local greedy inference is beneficial to pose estimation by effectively handling false alarms of joint candidate detection based on global affinity cues in the embedding space. Comparison of “GPNw/oRefinement”( AP) with the full model demonstrates that singleperson pose estimation can refine joint localization. “GPNVanilla” achieves AP, demonstrating the stableness of the proposed approach for multiperson pose estimation even in the case of removing refinement and multiscale testing.
(a)  (b) 
(a) Ablation study on multistage Hourglass network. (b) Confusion matrix on person number inferred from generative partition (Sec.
3.3) with groundtruth. Mean square error is . Best viewed in color and zoom.We also evaluate the pose estimation results from 4 different stages of the GPN model and plot the results in Figure 6 (a). The performance increases monotonically when traversing more stages. The final results achieved at the th stage give about improvement comparing with the first stage ( vs AP). This is because the proposed GPN can recurrently correct errors on the dense regression maps along with the joint confidence maps conditioned on previous estimations in the multistage design, yielding gradual improvement on the joint detections and partitions for multiperson pose estimation.
Finally, we evaluate the effectiveness of generative partition model for partition person instances. In particular, we evaluate how its produced partitions match the real number of persons. The confusion matrix is shown in Figure 6 (b). We can observe that the proposed generative partition model can predict very close number of persons with the groundtruth, with mean square error as small as .
6 Conclusion
We presented the Generative Partition Network (GPN) to efficiently and effectively address the challenging multiperson pose estimation problem. GPN solves the problem by simultaneously detecting and partitioning joints for multiple persons. It introduces a new approach to generate partitions through inferring over joint candidates in the embedding space parameterized by person centroids. Moreover, GPN introduces a local greedy inference approach to estimate poses for person instances by utilizing the partition information. We demonstrate that GPN can provide appealing efficiency for both joint detection and partition, and it can significantly overcome limitations of pure topdown and bottomup solutions on three benchmarks multiperson pose estimation datasets.
References
 [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
 [2] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In ICCV, 2009.
 [3] Z. Cao, T. Simon, S.E. Wei, and Y. Sheikh. Realtime multiperson 2d pose estimation using part affinity fields. In CVPR, 2017.
 [4] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. L. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
 [5] X. Chen and A. L. Yuille. Parsing occluded people by flexible compositions. In CVPR, 2015.
 [6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. J. Royal Stat. Soc. B., 39(1):1–38, 1977.
 [7] M. Eichner and V. Ferrari. We are family: Joint pose estimation of multiple persons. In ECCV, 2010.
 [8] H. Fang, S. Xie, Y. Tai, and C. Lu. RMPE: Regional multiperson pose estimation. In ICCV, 2017.
 [9] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Using kposelets for detecting people and localizing their keypoints. In CVPR, 2014.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [11] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, B. Andres, and B. Schiele. Articulated multiperson tracking in the wild. In CVPR, 2017.
 [12] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multiperson pose estimation model. In ECCV, 2016.
 [13] U. Iqbal and J. Gall. Multiperson pose estimation with local jointtoperson associations. In ECCV, 2016.
 [14] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.
 [15] L. Ladicky, P. H. Torr, and A. Zisserman. Human pose estimation using a joint pixelwise and partwise formulation. In CVPR, 2013.
 [16] E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres. Joint graph decomposition & node labeling: Problem, algorithms, applications. In CVPR, 2017.
 [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016.
 [18] A. Newell and J. Deng. Associative embedding: Endtoend learning for joint detection and grouping. In NIPS, 2017.
 [19] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
 [20] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multiperson pose estimation in the wild. In CVPR, 2017.
 [21] A. Paszke, S. Gross, and S. Chintala. Pytorch, 2017.
 [22] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, 2016.
 [23] L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, and B. Schiele. Articulated people detection and pose estimation: Reshaping the future. In CVPR, 2012.
 [24] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 [25] M. Sun and S. Savarese. Articulated partbased model for joint object detection and pose estimation. In ICCV, 2011.

[26]
T. Tieleman and G. Hinton.
Lecture 6.5rmsprop: Divide the gradient by a running average of its
recent magnitude.
COURSERA: Neural Networks for Machine Learning
, 2012.  [27] S.E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
 [28] F. Xia, P. Wang, X. Chen, and A. L. Yuille. Joint multiperson pose estimation and semantic part segmentation. In CVPR, 2017.