Generative Partition Networks for Multi-Person Pose Estimation

05/21/2017 ∙ by Xuecheng Nie, et al. ∙ National University of Singapore Qihoo 360 Technology Co. Ltd. 0

This paper proposes a new Generative Partition Network (GPN) to address the challenging multi-person pose estimation problem. Different from existing models that are either completely top-down or bottom-up, the proposed GPN introduces a novel strategy--it generates partitions for multiple persons from their global joint candidates and infers instance-specific joint configurations simultaneously. The GPN is favorably featured by low complexity and high accuracy of joint detection and re-organization. In particular, GPN designs a generative model that performs one feed-forward pass to efficiently generate robust person detections with joint partitions, relying on dense regressions from global joint candidates in an embedding space parameterized by centroids of persons. In addition, GPN formulates the inference procedure for joint configurations of human poses as a graph partition problem, and conducts local optimization for each person detection with reliable global affinity cues, leading to complexity reduction and performance improvement. GPN is implemented with the Hourglass architecture as the backbone network to simultaneously learn joint detector and dense regressor. Extensive experiments on benchmarks MPII Human Pose Multi-Person, extended PASCAL-Person-Part, and WAF, show the efficiency of GPN with new state-of-the-art performance.



There are no comments yet.


page 2

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Generative Partition (b) Local Inference
Figure 1: Generative Partition Networks for multi-person pose estimation. (a) Generative partition. GPN models person detection and joint partition as a generative process inferred from joint candidates. (b) Local inference. GPN performs local inference for joint configurations conditioned on the generated person detections with joint partitions.

Multi-person pose estimation aims to localize body joints of multiple persons captured in a 2D monocular image [7, 22]. Despite extensive prior research, this problem remains very challenging due to the highly complex joint configuration, partial or even complete joint occlusion, significant overlap between neighboring persons, unknown number of persons and more critically the difficulties in allocating joints to multiple persons. These challenges feature the unique property of multi-person pose estimation compared with the simpler single-person setting [19, 27]. To tackle these challenges, existing multi-person pose estimation approaches usually perform joint detection and partition separately, mainly following two different strategies. The top-down strategy [7, 8, 13, 20, 23] first detects persons and then performs pose estimation for each single person individually. The bottom-up strategy [3, 11, 12, 15, 16, 22], in contrast, generates all joint candidates at first, and then tries to partition them to corresponding person instances.

The top-down approaches directly leverage existing person detection models [17, 24] and single-person pose estimation methods [19, 27]. Thus they effectively avoid complex joint partitions. However, their performance is critically limited by the quality of person detections. If the employed person detector fails to detect a person instance accurately (due to occlusion, overlapping or other distracting factors), the introduced errors cannot be remedied and would severely harm performance of the following pose estimation. Moreover, they suffer from high joint detection complexity, which linearly increases with the number of persons in the image, because they need to run the single-person joint detector for each person detection sequentially.

In contrast, the bottom-up approaches detect all joint candidates at first by globally applying a joint detector for only once and then partition them to corresponding persons according to joint affinities. Hence, they enjoy lower joint detection complexity than the top-down ones and robustness to errors from early commitment. However, they suffer from very high complexity of partitioning joints to corresponding persons, which usually involves solving NP-hard graph partition problems [12, 22] on densely connected graphs covering the whole image.

In this paper, we propose a novel solution, termed the Generative Partition Network (GPN), to overcome essential limitations of the above two types of approaches and meanwhile inherit their strengths within a unified model for efficiently and effectively estimating poses of multiple persons in a given image. As shown in Figure 1, GPN solves multi-person pose estimation problem by simultaneously 1) modeling person detection for joint partition as a generative process inferred from all joint candidates and 2) performing local inference for obtaining joint categorizations and associations conditioned on the generated person detections.

In particular, GPN introduces a dense regression module to generate person detections with partitioned joints via votes from joint candidates in a carefully designed embedding space, which is efficiently parameterized by person centroids. This generative partition model produces joint candidates and partitions by running a joint detector for only one feed-forward pass, offering much higher efficiency than top-down approaches. In addition, the produced person detections from GPN are robust to various distracting factors, e.g., occlusion, overlapping, deformation, and large pose variation, benefiting the following pose estimation. GPN also introduces a local greedy inference algorithm by assuming independence among person detections for producing optimal multi-person joint configurations. This local optimization strategy reduces the search space of the graph partition problem for finding optimal poses, avoiding the high joint partition complexity challenging the bottom-up strategy. Moreover, the local greedy inference algorithm exploits reliable global affinity cues from the embedding space for inferring joint configurations within robust person detections, leading to performance improvement.

We implement GPN based on the Hourglass network [19] for learning joint detector and dense regressor, simultaneously. Extensive experiments on MPII Human Pose Multi-Person [1], extended PASCAL-Person-Part [28] and WAF [7] benchmarks evidently show the efficiency and effectiveness of the proposed GPN. Moreover, GPN achieves new state-of-the-art on all these benchmarks.

We make following contributions. 1) We propose a new one-pass solution to multi-person pose estimation, totally different from previous top-down and bottom-up ones. 2) We propose a novel dense regression module to efficiently and robustly partition body joints into multiple persons, which is the key to speeding up multi-person pose estimation. 3) In addition to high efficiency, GPN is also superior in terms of robustness and accuracy on multiple benchmarks.

2 Related Work

Top-Down Multi-Person Pose Estimation

Existing approaches following top-down strategy sequentially perform person detection and single-person pose estimation. In [9], Gkioxari et al.

proposed to adopt the Generalized Hough Transform framework to first generate person proposals and then classify joint candidates based on the poselets. Sun

et al. [25]

presented a hierarchical part-based model for jointly person detection and pose estimation. Recently, deep learning techniques have been exploited to improve both person detection and single-person pose estimation. In 

[13], Iqbal and Gall adopted Faster-RCNN [24] based person detector and convolutional pose machine [27] based joint detector for this task. Later, Fang et al. [8]

utilized spatial transformer network 

[14] and Hourglass network [19] to further improve the quality of joint detections and partitions. Despite remarkable success, they suffer from limitations from early commitment and high joint detection complexity. Differently, the proposed GPN adopts a one-pass generative process for efficiently producing person detections with partitioned joint candidates, offering robustness to early commitment as well as low joint detection complexity.

Bottom-Up Multi-Person Pose Estimation

The bottom-up strategy provides robustness to early commitment and low joint detection complexity. Previous bottom-up approaches [3, 12, 18, 22]

mainly focus on improving either the joint detector or joint affinity cues, benefiting the following joint partition and configuration inference. For joint detector, fully convolutional neural networks,

e.g., Residual networks [10] and Hourglass networks [19], have been widely exploited. As for joint affinity cues, Insafutdinov et al. [12] explored geometric and appearance constraints among joint candidates. Cao et al. [3] proposed part affinity fields to encode location and orientation of limbs. Newell and Deng [18] presented the associative embedding for grouping joint candidates. Nevertheless, all these approaches partition joints based on partitioning the graph covering the whole image, resulting in high inference complexity. In contrast, GPN performs local inference with robust global affinity cues which is efficiently generated by dense regressions from the centroid embedding space, reducing complexity for joint partitions and improving pose estimation.

Figure 2: Overview of the proposed Generative Partition Network for multi-person pose estimation. Given an image, GPN first uses a CNN to predict (a) joint confidence maps and (b) dense joint-centroid regression maps. Then, GPN performs (c) centroid embedding for all joint candidates in the embedding space via dense regression, to produce (d) joint partitions within person detections. Finally, GPN conducts (e) local greedy inference to generate joint configurations for each joint partition locally, giving pose estimation results of multiple persons.

3 Approach

3.1 Generative Partition Model

The overall pipeline for the proposed Generative Partition Network (GPN) model is shown in Figure 2. Throughout the paper, we use following notations. Let denote an image containing multiple persons, denote the spatial coordinates of joint candidates from all persons in with , and denote the labels of corresponding joint candidates, in which and is the number of joint categories. For allocating joints via local inference, we also consider the proximities between joints, denoted as . Here encodes the proximity between the th joint candidate and the th joint candidate

, and gives the probability for them to be from the same person.

The proposed GPN with learnable parameters aims to solve the multi-person pose estimation task through learning to infer the conditional distribution . Namely, given the image , GPN infers the joint locations , labels and proximities providing the largest likelihood probability. To this end, GPN adopts a generative model to simultaneously produce person detections with joint partitions implicitly and infers joint configuration and for each person detection locally. In this way, GPN reduces the difficulty and complexity of multi-person pose estimation significantly. Formally, GPN introduces latent variables to encode joint partitions, and each is a collection of joint candidates belonging to a specific person detection in which their labels are not considered, and is the number of joint partitions. With these latent variables , can be factorized into


where models the generative process of joint partitions within person detections based on joint candidates. Maximizing the above likelihood probability gives optimal pose estimation for multiple persons in .

However, directly maximizing the above likelihood is computationally intractable. Instead of maximizing w.r.t. all possible partitions , we propose to maximize its lower bound induced by a single “optimal” partition, inspired by the EM algorithm [6]. Such approximation could reduce the complexity significantly without harming the performance. Concretely, based on Eqn. (1), we have

Here, we find the optimal solution by maximizing the above induced lower bound , instead of maximizing the summation. The joint partitions disentangle independent joints and reduce inference complexity—only the joints falling in the same partition have non-zero proximities . Then is further factorized as


where denotes the labels of joints falling in the partition and denotes their proximities. In the above probabilities, we define as a Gibbs distribution:



is the energy function for the joint distribution

. Its explicit form is derived from Eqn. (2) accordingly:


Here, scores the quality of joint partitions generated from joint candidates for the input image , scores how the position is compatible with label , and represents how likely the positions with label and with label belong to the same person, i.e., characterizing the proximity . In the following subsections, we will give details for detecting joint candidates , generating optimal joint partitions , inferring joint configurations and along with the proposed algorithm to optimize the energy function.

Figure 3: Centroid embedding via dense joint regression. (a) Centroid embedding results for persons in the image. (b) Construction of the regression target for a pixel in the image (Sec. 3.3).

3.2 Joint Candidate Detection

To reliably detect human body joints, we use confidence maps to encode probabilities of joints presenting at each position in the image. The joint confidence maps are constructed by modeling the joint locations as Gaussian peaks, as shown in Figure 2 (a). We use to denote the confidence map for the th joint with being the confidence map of the th joint for the th person. For a position in the given image, is calculated by , where denotes the groundtruth position of the th joint of the th person, and

is an empirically chosen constant to control variance of the Gaussian distribution and set as 7 in the experiments. The target confidence map, which the proposed GPN model learns to predict, is an aggregation of peaks of all the persons in a single map. Here, we choose to take the maximum of confidence maps rather than average to remain distinctions between close-by peaks 

[3], i.e. . During testing, we first find peaks with confidence scores greater than a given threshold (set as 0.1) on predicted confidence maps for all types of joints. Then we perform non-maximum suppression to find the joint candidate set .

3.3 Joint Partition via Dense Regression

Our proposed joint partition model performs dense regression over all the joint candidates to localize centroids of multiple persons to partition joints into different person instances, as shown in Figure 2 (b) and (c). It learns to transform all the pixels belonging to a specific person to an identical single point in a carefully deigned embedding space, where they are easy to cluster into corresponding persons. Such a dense regression framework enables generating joint partitions by one single feed-forward pass, reducing high joint detection complexity troubling top-down solutions.

To this end, we build and parameterize the joint candidate embedding space by the person centroids, as centroid is stable and reliable to discriminate person instances even in presence of some extreme poses. We denote the constructed embedding space as . In , each person corresponds to a single point (i.e., the centroid), and each point represents a hypothesis about centroid location of a specific person instance. An example is given in Figure 3 (a).

Joint candidates are densely transformed into and can collectively determine the centroid hypotheses of their corresponding person instances, since they are tightly related in the view of articulated kinematics, as shown in Figure 2 (c). For instance, a candidate of the head joint would add votes for the presence of a person’s centroid to the location just below it. A single candidate does not necessarily provide evidence for the exact centroid of a person instance, but the population of joint candidates can vote for the correct centroid with large probability and determine the joint partitions correctly. In particular, the probability of generating joint partition at location is calculated by summing the votes from different joint candidates together, i.e.

where is the indicator function and is the weight for the votes from th joint category. We set for all joints assuming all kinds of joints equally contribute to the localization of person instances in view of unconstrained shapes of human body and uncertainties of presence of different joints. The function learns to densely transform every pixel in the image to the embedding space . For learning , we build the target regression map for the th joint of the th person as follows:

where denotes the centroid position of the th person, is the normalization factor, and are the height and width of image , denotes the neighbor positions of the th joint of the th person, and is a constant to define the neighborhood size, set as 7 in our experiments. An example is shown in Figure 3 (b) for construction of a regression target of a pixel in a given image. Then, we define the target regression map for the th joint as the average for all persons by


is the number of non-zero vectors at position

across all persons. During testing, after predicting the regression map , we define transformation function for position as . After generating for each point in the embedding space, we calculate the score as .

Then the problem of joint partition generation is converted to finding peaks in the embedding space . As there are no priors on the number of persons in the image, we adopt the Agglomerative Clustering [2] to find peaks by clustering the votes, which can automatically determine the number of clusters. We denote the vote set as , and use to denote the clustering result on , where represents the th cluster and is the number of clusters. We assume the set of joint candidates casting votes in each cluster corresponds to a joint partition , defined by

input : joint candidates , person partitions , joint confidence maps , dense regression maps , .
output : multi-person pose estimation
for   do
        while  do
               Initialize single-person pose estimation
               for th joint category, to  do
                      if  then
                             Find root joint candidate in for by:
                             Find joint candidate closest to centroid :
                      end if
                     if  then
                             Update , Update by averaging the person centroid hypotheses:
                      end if
               end for
        end while
end for
Algorithm 1 Local greedy inference for multi-person pose estimation.

3.4 Local Greedy Inference for Pose Estimation

According to Eqn. (3), we maximize the conditional probability by minimizing the energy function in Eqn. (4). We optimize in two sequential steps: 1) generate joint partition set based on joint candidates; 2) conduct joint configuration inference in each joint partition locally, which reduces the joint configuration complexity and overcomes the drawback of bottom-up approaches.

After getting joint partition according to Eqn. (5), the score becomes a constant. Let denote the generated partition set. The optimization is then simplified as


Pose estimation in each joint partition is independent, thus inference over different joint partitions becomes separate. We propose the following local greedy inference algorithm to solve Eqn. (6) for multi-person pose estimation. Given a joint partition , the unary term is the confidence score at from the th joint detector: . The binary term is the similarity score of votes of two joint candidates based on the global affinity cues in the embedding space:

where and .

For efficient inference in Eqn. (6), we adopt a greedy strategy which guarantees the energy monotonically decreases and eventually converges to a lower bound. Specifically, we iterate through each joint one by one, first considering joints around torso and moving out to limb. We start the inference with neck. For a neck candidate, we use its embedding point in to initialize the centroid of its person instance. Then, we select the head top candidate closest to the person centroid and associate it with the same person as the neck candidate. After that, we update person centroid by averaging the derived hypotheses. We loop through all other joint candidates similarly. Finally, we get a person instance and its associated joints. After utilizing neck as root for inferring joint configurations of person instances, if some candidates remain unassigned, we utilize joints from torso, then from limbs, as the root to infer the person instance. After all candidates find their associations to persons, the inference terminates. See details in Algorithm 1.

Figure 4: Architecture of Generative Partition Network. Its backbone is an Hourglass module (in blue block), followed by two branches: joint detection (in green block) and dense regression for joint partition (in yellow block).

4 Learning Joint Detector and Dense Regressor with CNNs

GPN is a generic model and compatible with various CNN architectures. Extensive architecture engineering is out of the scope of this work. We simply choose the state-of-the-art Hourglass network [19] as the backbone of GPN. Hourglass network consists of a sequence of Hourglass modules. As shown in Figure 4, each Hourglass module first learns down-sized feature maps from the input image, and then recovers full-resolution feature maps through up-sampling for precise joint localization. In particular, each Hourglass module is implemented as a fully convolutional network. Skipping connections are added between feature maps with the same resolution symmetrically to capture information at every scale. Multiple Hourglass modules are stacked sequentially for gradually refining the predictions via reintegrating the previous estimation results. Intermediate supervision is applied at each Hourglass module.

Hourglass network was proposed for single-person pose estimation. GPN extends it to multi-person cases. GPN introduces modules enabling simultaneous joint detection (Sec. 3.2) and dense joint-centroid regression (Sec. 3.3), as shown in Figure 4. In particular, GPN utilizes the Hourglass module to learn image representations and then separates into two branches: one produces the dense regression maps for detecting person centroids, via one convolution on feature maps from the Hourglass module and another convolution for classification; the other branch produces joint detection confidence maps. With this design, GPN obtains joint detection and partition in one feed-forward pass. When using multi-stage Hourglass modules, GPN feeds the predicted dense regression maps at every stage into the next one through convolution, and then combines intermediate features with features from the previous stage.

For training GPN, we use loss to learn both joint detection and dense regression branches with supervision at each stage. The losses are defined as

where and represent predicted joint confidence maps and dense regression maps at the th stage, respectively. The groundtruth and are constructed as in Sec. 3.2 and 3.3 respectively. The total loss is given by , where is the number of Hourglass modules (stages) used in our implementation and the weighting factor is empirically set as .

5 Experiments

5.1 Experimental Setup


We evaluate the proposed GPN on three widely adopted benchmarks: MPII Human Pose Multi-Person (MPII) dataset [1], extended PASCAL-Person-Part dataset [28], and “We Are Family” (WAF) dataset [7]. The MPII dataset consists of 3,844 and 1,758 groups of multiple interacting persons for training and testing respectively. Each person in the image is annotated for 16 body joints. It also provides more than 28,000 training samples for single-person pose estimation. The extended PASCAL-Person-Part dataset contains 3,533 challenging images from the original PASCAL-Person-Part dataset [4], which are split into 1,716 for training and 1,817 for testing. Each person is annotated with 14 body joints shared with MPII dataset, without pelvis and thorax. The WAF dataset contains 525 web images (350 for training and 175 for testing). Each person is annotated with 6 line segments for the upper-body.

Data Augmentation

We follow conventional ways to augment training samples by cropping original images based on the person center. In particular, we augment each training sample with rotation degrees sampled in , scaling factors in , translational offset in and horizontally mirror. We resize each training sample to

pixels with padding.


For MPII dataset, we reserve 350 images randomly selected from the training set for validation. We use the rest training images and all the provided single-person samples to train the GPN for 250 epochs. For evaluation on the other two datasets, we follow the common practice and finetune the GPN model pretrained on MPII for 30 epochs. To deal with some extreme cases where centroids of persons are overlapped, we slightly perturb the centroids by adding small offset to separate them. We implement our model with PyTorch 


and adopt the RMSProp 

[26] for optimization. The initial learning rate is 0.0025 and decreased by multiplying 0.5 at the 150th, 170th, 200th, 230th epoch. In testing, we follow conventions to crop image patches using the given position and average person scale of test images, and resize and pad the cropped samples to as input to GPN. We search for suitable image scales over 5 different choices. Specially, when testing on MPII, following previous works [3, 18], we apply a single-person model [19] trained on MPII to refine the estimations. We use the standard Average Precision (AP) as performance metric on all the datasets, as suggested by [12, 28]. We will make codes and pre-trained models available.

Method Head Sho. Elb. Wri. Hip Knee Ank. Total Time [s]
Iqbal and Gall [13] 58.4 53.9 44.5 35.0 42.2 36.7 31.1 43.1 10
Insafutdinov et al. [12] 78.4 72.5 60.2 51.0 57.2 52.0 45.4 59.5 485
Levinkov et al. [16] 89.8 85.2 71.8 59.6 71.1 63.0 53.5 70.6 -
Insafutdinov et al. [11] 88.8 87.0 75.9 64.9 74.2 68.8 60.5 74.3 -
Cao et al. [3] 91.2 87.6 77.7 66.8 75.4 68.9 61.7 75.6 1.24
Fang et al. [8] 88.4 86.5 78.6 70.4 74.4 73.0 65.8 76.7 1.5
Newell and Deng [18] 92.1 89.3 78.9 69.8 76.2 71.6 64.7 77.5 -
GPN (Ours) 92.2 89.7 82.1 74.4 78.6 76.4 69.3 80.4 0.77
Table 1: Comparison with state-of-the-arts on the full testing set of MPII Human Pose Multi-Person dataset (AP).
Method Head Sho. Elb. Wri. Hip Knee Ank. Total
Chen and Yuille [5] 45.3 34.6 24.8 21.7 9.8 8.6 7.7 21.8
Insafutdinov et al. [12] 41.5 39.3 34.0 27.5 16.3 21.3 20.6 28.6
Xia et at. [28] 58.0 52.1 43.1 37.2 22.1 30.8 31.1 39.2
GPN (Ours) 66.9 60.0 51.4 48.9 29.2 36.4 33.5 46.6
Table 2: Comparison with state-of-the-arts on the testing set of the extended PASCAL-Person-Part dataset (AP)
Method Head Shoulder Elbow Wrist Total
Chen and Yuile [5] 83.3 56.1 46.3 35.5 55.3
Pishchulin et al. [22] 76.6 80.8 73.7 73.6 76.2
Insafutdinov et al. [12] 92.6 81.1 75.7 78.8 82.0
GPN (Ours) 93.1 82.9 83.5 79.9 84.8
Table 3: Comparison with state-of-the-arts on the testing set of the WAF dataset (AP).

5.2 Results and Analysis


Table 1 shows the evaluation results on the full testing set of MPII. We can see that the proposed GPN achieves overall AP and significantly outperforms previous state-of-the-art achieving AP [18]

. In addition, the proposed GPN improves the performance for localizing all the joints consistently. In particular, it brings remarkable improvement over rather difficult joints mainly caused by occlusion and high degrees of freedom, including wrists (

vs AP), ankles ( vs AP), and knees (with absolute AP increase over [18]), confirming the robustness of the proposed generative model and global affinity cues to these distracting factors. These results clearly show GPN is outstandingly effective for multi-person pose estimation. We also report the computational speed of GPN111The runtime time is measured on CPU Intel I7-5820K 3.3GHz and GPU TITAN X (Pascal). The time is counted with 5 scale testing, not including the refinement time by single-person pose estimation. in Table 1. GPN is about 2 times faster than the bottom-up approach [3] with state-of-the-art speed for multi-person pose estimation. This demonstrates the efficiency of performing joint detection and partition simultaneously in our model.


Table 2 shows the evaluation results. GPN provides absolute AP improvement ( vs AP) over the state-of-the-art [28]. Moreover, the proposed GPN brings significant improvement on difficult joints, such as wrist ( vs AP). These results further demonstrate the effectiveness and robustness of our model for multi-person pose estimation.


As shown in Table 3, GPN achieves overall AP, bringing improvement over the best bottom-up approach [12]. GPN achieves the best performance for all upper-body joints. In particular, it gives the most significant performance improvement on the elbow, about higher than previous best results. These results verify the effectiveness of the proposed GPN for tackling the multi-person pose estimation problem.

Qualitative Results

Figure 5 visualizes some pose estimation results on these three datasets. We can observe that the proposed GPN model estimates multi-person poses accurately and robustly even in challenging scenarios, e.g., joint occlusion caused by a person of interest and other overlapped persons presenting in the first example of MPII dataset, large pose variation shown in the second example of the extended PASCAL-Person-Part dataset, and appearance and illumination changes in the forth example of WAF dataset. These results also verify the effectiveness of GPN for producing reliable joint detections and partitions in multi-person pose estimation.

Method Head Sho. Elb. Wri. Hip Knee Ank. Total InferTime [ms]
GPN-Full 94.4 90.0 81.3 72.1 77.8 72.7 64.7 79.0 1.9
GPN-w/o-Partition 93.2 89.3 79.9 70.1 78.8 73.1 65.7 78.6 3.4
GPN-w/o-LGI 93.1 89.1 79.5 68.5 79.0 71.4 64.4 77.8 -
GPN-w/o-Refinement 90.4 86.8 79.3 69.8 77.5 69.3 61.9 76.4 -
GPN- 91.0 87.1 78.6 70.2 76.7 70.5 60.0 76.3 -
GPN-Vanilla 90.5 86.4 77.1 69.4 72.2 67.7 60.2 74.8 -
Table 4: Ablation experiments on MPII validation set (AP).
Figure 5: Qualitative results of the proposed GPN on MPII (1st row), extended PASCAL-Person-Part (2nd row), and WAF (3rd row). GPN performs well even in challenging scenarios, e.g., self-occlusion and other overlapped persons in the 1st example of 1st row, large pose variations in the 2nd example of 2nd row, appearance and illumination changes in the 4th example of 3rd row.

5.3 Ablation Analysis

We conduct ablation analysis for the proposed GPN model using the MPII validation set. We evaluate multiple variants of our proposed GPN model by removing certain components from the full model (“GPN-Full”). “GPN-w/o-Partition” performs inference on the whole image without using obtained joint partition information, which is similar to the pure bottom-up approaches. “GPN-w/o-LGI” removes the local greedy inference phase. It allocates joint candidates to persons through finding the most activated position for each joint in each joint partition. This is similar to the top-down approaches. “GPN-w/o-Refinement” does not perform refinement by using single-person pose estimator. We use “GPN-” to denote testing over images and “GPN-Vanilla” to denote single scale testing without refinement.

From Table 4, “GPN-Full” achieves AP and the joint partition inference only costs 1.9ms, which is very efficient. “GPN-w/o-Partition” achieves slightly lower AP () with slower inference speed (3.4ms). The results confirm effectiveness of generating joint partitions by GPN—inference within each joint partition individually reduces complexity and improves pose estimation over multi-persons. Removing the local greedy inference phase as in “GPN-w/o-LGI” decreases the performance to AP, showing local greedy inference is beneficial to pose estimation by effectively handling false alarms of joint candidate detection based on global affinity cues in the embedding space. Comparison of “GPN-w/o-Refinement”( AP) with the full model demonstrates that single-person pose estimation can refine joint localization. “GPN-Vanilla” achieves AP, demonstrating the stableness of the proposed approach for multi-person pose estimation even in the case of removing refinement and multi-scale testing.

(a) (b)
Figure 6:

(a) Ablation study on multi-stage Hourglass network. (b) Confusion matrix on person number inferred from generative partition (Sec. 

3.3) with groundtruth. Mean square error is . Best viewed in color and zoom.

We also evaluate the pose estimation results from 4 different stages of the GPN model and plot the results in Figure 6 (a). The performance increases monotonically when traversing more stages. The final results achieved at the th stage give about improvement comparing with the first stage ( vs AP). This is because the proposed GPN can recurrently correct errors on the dense regression maps along with the joint confidence maps conditioned on previous estimations in the multi-stage design, yielding gradual improvement on the joint detections and partitions for multi-person pose estimation.

Finally, we evaluate the effectiveness of generative partition model for partition person instances. In particular, we evaluate how its produced partitions match the real number of persons. The confusion matrix is shown in Figure 6 (b). We can observe that the proposed generative partition model can predict very close number of persons with the groundtruth, with mean square error as small as .

6 Conclusion

We presented the Generative Partition Network (GPN) to efficiently and effectively address the challenging multi-person pose estimation problem. GPN solves the problem by simultaneously detecting and partitioning joints for multiple persons. It introduces a new approach to generate partitions through inferring over joint candidates in the embedding space parameterized by person centroids. Moreover, GPN introduces a local greedy inference approach to estimate poses for person instances by utilizing the partition information. We demonstrate that GPN can provide appealing efficiency for both joint detection and partition, and it can significantly overcome limitations of pure top-down and bottom-up solutions on three benchmarks multi-person pose estimation datasets.


  • [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
  • [2] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In ICCV, 2009.
  • [3] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
  • [4] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. L. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
  • [5] X. Chen and A. L. Yuille. Parsing occluded people by flexible compositions. In CVPR, 2015.
  • [6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. J. Royal Stat. Soc. B., 39(1):1–38, 1977.
  • [7] M. Eichner and V. Ferrari. We are family: Joint pose estimation of multiple persons. In ECCV, 2010.
  • [8] H. Fang, S. Xie, Y. Tai, and C. Lu. RMPE: Regional multi-person pose estimation. In ICCV, 2017.
  • [9] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Using k-poselets for detecting people and localizing their keypoints. In CVPR, 2014.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [11] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, B. Andres, and B. Schiele. Articulated multi-person tracking in the wild. In CVPR, 2017.
  • [12] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV, 2016.
  • [13] U. Iqbal and J. Gall. Multi-person pose estimation with local joint-to-person associations. In ECCV, 2016.
  • [14] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.
  • [15] L. Ladicky, P. H. Torr, and A. Zisserman. Human pose estimation using a joint pixel-wise and part-wise formulation. In CVPR, 2013.
  • [16] E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres. Joint graph decomposition & node labeling: Problem, algorithms, applications. In CVPR, 2017.
  • [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016.
  • [18] A. Newell and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS, 2017.
  • [19] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
  • [20] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, 2017.
  • [21] A. Paszke, S. Gross, and S. Chintala. Pytorch, 2017.
  • [22] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, 2016.
  • [23] L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, and B. Schiele. Articulated people detection and pose estimation: Reshaping the future. In CVPR, 2012.
  • [24] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [25] M. Sun and S. Savarese. Articulated part-based model for joint object detection and pose estimation. In ICCV, 2011.
  • [26] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.

    COURSERA: Neural Networks for Machine Learning

    , 2012.
  • [27] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
  • [28] F. Xia, P. Wang, X. Chen, and A. L. Yuille. Joint multi-person pose estimation and semantic part segmentation. In CVPR, 2017.