A tensorlfow implementation about arxiv paper "Single-Stage Multi-Person Pose Machines" (SPM)
Multi-person pose estimation is a challenging problem. Existing methods are mostly two-stage based--one stage for proposal generation and the other for allocating poses to corresponding persons. However, such two-stage methods generally suffer low efficiency. In this work, we present the first single-stage model, Single-stage multi-person Pose Machine (SPM), to simplify the pipeline and lift the efficiency for multi-person pose estimation. To achieve this, we propose a novel Structured Pose Representation (SPR) that unifies person instance and body joint position representations. Based on SPR, we develop the SPM model that can directly predict structured poses for multiple persons in a single stage, and thus offer a more compact pipeline and attractive efficiency advantage over two-stage methods. In particular, SPR introduces the root joints to indicate different person instances and human body joint positions are encoded into their displacements w.r.t. the roots. To better predict long-range displacements for some joints, SPR is further extended to hierarchical representations. Based on SPR, SPM can efficiently perform multi-person poses estimation by simultaneously predicting root joints (location of instances) and body joint displacements via CNNs. Moreover, to demonstrate the generality of SPM, we also apply it to multi-person 3D pose estimation. Comprehensive experiments on benchmarks MPII, extended PASCAL-Person-Part, MSCOCO and CMU Panoptic clearly demonstrate the state-of-the-art efficiency of SPM for multi-person 2D/3D pose estimation, together with outstanding accuracy.READ FULL TEXT VIEW PDF
A tensorlfow implementation about arxiv paper "Single-Stage Multi-Person Pose Machines" (SPM)
Multi-person pose estimation from a single monocular RGB image aims to simultaneously isolate and locate body joints of multiple person instances. It is a fundamental yet challenging task with broad applications in action recognition , person Re-ID , pedestrian tracking , etc.
Existing methods typically adopt two-stage solutions. As shown in Figure 1 (b), they either follow the top-down strategy [12, 35, 17, 9, 8, 34] that employs off-the-shelf detectors to localize person instances at first and then locates their joints individually; or the bottom-up strategy [3, 15, 27, 31, 25] that locates all the body joints at first and then assigns them to the corresponding person. Though with high accuracy, these methods are not efficient as they require two-stage processing to predict human poses with computational redundancy. We observe that such a requirement mainly comes from the conventional pose representation they adopt. As shown in Figure 2 (b), absolute positions of allocated body joints separate the position information w.r.t. person instances and body joints, each of which requires a stage to process, leading to low efficiency.
To overcome such an intrinsic limitation, we propose a new Structured Pose Representation (SPR) to unify position information of person instances and body joints. SPR allows to simplify the pipeline for person separation and body joint localization and thus enables a much more efficient single-stage solution to multi-person pose estimation. In particular, SPR defines a unique identity joint, the root joint, for each person instance to indicate its position in the image. Then, the positions of body joints are encoded by their displacements w.r.t. the root joints. In this way, the pose of a person instance is represented together with its location, as shown in Figure 2 (c), making a single-stage solution feasible. To tackle the long-range displacements (e.g. limb joints), we further extend SPR to a hierarchical one by dividing body joints into hierarchies induced from articulated kinematics . Such a Hierarchical Structured Pose Representation is shown in Figure 2 (d).
Based on SPR, we propose a Single-stage multi-person Pose Machine (SPM) model to solve multi-person pose estimation with compact pipeline and high efficiency. As aforementioned, existing two-stage models isolate different instances and estimate their poses separately. Different from them, SPM maps a given image to multiple human poses represented by SPR in a single-stage manner. As shown in Figure 1
(a), it simultaneously regresses the root joint positions and body joint displacements, predicting multi-person poses within one stage. We implement SPM with Convolutional Neural Networks (CNNs) based on the state-of-the-art Hourglass architecture for learning and inferring root joint position and body joint displacement simultaneously and end-to-end.
Comprehensive experiments on benchmarks MPII , extended PASCAL-Person-Part , MSCOCO  and CMU Panoptic  evidently demonstrate the high efficiency of the proposed SPM model. In addition, it achieves new state-of-the-art on MPII and extended PASCAL-Person-Part datasets, and competitive performance on MSCOCO dataset. Moreover, it also achieves promising results on CMU Panoptic dataset for multi-person 3D pose estimation. Our contributions is summarized as: 1) We propose the first single-stage solution to multi-person 2D/3D pose estimation. 2) We propose novel structured pose representations to unify position information of person instances and body joints. 3) Our model achieves outperforming efficiency with competitive accuracy on multiple benchmarks.
In this section, we review the state-of-the-art multi-person pose estimation methods based on conventional pose representation. Given an image , multi-person pose estimation targets at estimating human poses of all the person instances in via inferring coordinates of their body joints. Conventionally, poses are represented as
where is the number of persons in , is the number of joint categories, and denotes coordinates of the th body joint from person , where for 2D case while for 3D case. To obtain , existing methods typically exploit two-stage solutions, i.e. separately predicting positions of person instances and their body joints. Based on the processing order, they can be divided into two categories: the top-down methods and the bottom-up ones.
A top-down method generates multiple human poses as follows. It first uses a person detector to localize and separate person instances, and then conducts single-person pose estimation using a single-person model to individually locate body joints for each person instance. Formally, the process can be summarized as
Here denotes person instance localization results that are usually represented by a set of bounding boxes. Following this strategy, for 2D case, Gkioxari et al. 
exploited a Generalized Hough Transform framework to detect person instances and then localize body joints via classifying poselets—the tightly clustered body parts with similar appearances and configurations. Iqbal and Gall
improved the person detector and single-person model via exploiting deep learning based techniques, including Faster-RCNN and convolutional pose machine , to acquire more accurate human poses. Similarly, Fang et al. 
proposed to incorporate spatial transformer network and Hourglass network  to further improve person instance and body joint detections. Papandreou et al. 
further improved the top-down strategy via location refinement with predictions of 2D offset vector from a pixel to the corresponding joint. For 3D case, Rogez first utilized region proposal network to detect persons of interest and found 3D anchor pose for each detection, then exploit iterative regression for refinement. Dong  performed top-down multi-person 2D pose estimation for images from multiple views and reconstructed 3D pose for each person from multi-view 2D poses.
In contrast, to obtain poses , a bottom-up method first utilizes a body joint estimator to localize body joints for all instances, and then estimates the position of each instance and the joint allocation by solving a graph partition problem with the model , formulated as
as the joint detector and defined geometric correlations for allocating body joints, and then performed Integer Linear Programming to partition joint candidates. Caoet al.  proposed a real-time model with improved joint correlations via introducing part affinity fields to encode location and orientation of limbs and allocate joint candidates via solving a maximum weight bipartite graph matching problem. Later, Mehta  extended  to multi-person 3D pose estimation. Newell and Deng  introduced the associative embedding model followed by a greedy algorithm for allocating body joints. Papandreou et al.  presented the bottom-up PersonLab model by defining different levels of offsets to calculate association scores and adjust joint positions for grouping joint candidates into person instance and refining pose estimations.
Different from all the previous methods relying on a two-stage pipeline, we present a new pose representation method that unifies positions of person instances and body joints, enabling a compact and efficient single-stage solution to multi-person 2D/3D pose estimation, as explained below.
In this section, we elaborate on the proposed Structured Pose Representations (SPR) for multi-person pose estimation. Different from the conventional pose representation in Eqn. (1), SPR aims to unify the position information of person instance and body joint to deliver a single-stage solution for multi-person pose estimation. In particular, SPR introduces an auxiliary joint, the root joint, to denote the person instance position. It is a unique identity joint for a specific person instance. In the following, we illustrate the formulations of SPR in 2D case for simplification, which can be directly extended to 3D case via replacing 2D coordinates with 3D ones. Specifically, we use to denote the root joint position of the th person. Then the position of the th joint of person can be defined as
where represents the displacement of the th body joint position w.r.t. the root joint. Eqn. (4) directly establishes the structured relationship between person instance position and body joint position. Thus, we use the Structured Pose Representations to represent human poses with the root joint position and body joint displacements, formulated as
By the definition in Eqn. (5), SPR unifies position information of the person instance and the body joint and can be obtained in an efficient single-stage prediction. In addition, SPR can be effortlessly converted back to the conventional pose representation based on Eqn. (4). Here, we exploit the person centroid as the root joint of the person instance, due to its stability and robustness in discriminating person instances even with extreme poses. An example of SPR representing multiple human poses is shown in Figure 2 (c).
SPR in Eqn. (5) may involve long-range displacements between body joints and the root joint due to possible large pose deformation, , wrists and ankles relative to the person centroid, bringing difficulty to displacement estimation by mapping from image representation to the vector domain. Thereby, we propose to factorize long-range displacements into accumulative shorter ones to further improve SPR. Specifically, we divide the root joint and body joints into four hierarchies based on articulated kinematics 
by their degrees of freedom and extent of deformation. Here, the root joint is placed in the first hierarchy; torso joints including neck, shoulders and hips are in the second one; head, elbows and knees are put in the third; wrists and ankles are put in the fourth. Then we can identify joint positions via shorter-range displacements between joints in adjacent hierarchies. For example, the wrist position can be encoded by its displacement relative to the elbow. Modeling short-range displacements can alleviate the learning difficulty of mapping from image representation to the vector domain and better utilize appearance cues along limbs. Formally, for theth joint in the th layer (, wrist in the 4th layer) and its corresponding th joint in the th layer (, elbow in the 3rd layer), the relation between their positions and can be formulated as
where denotes the displacement between joints in adjacent hierarchies. According to the articulated kinematics, we can define an articulated path (a set of ordered joints) connecting the root joint to any body joint. Then, the body joint can be identified via the root joint position and accumulation of short-range displacements along the articulated path. Namely,
where represents the articulated path between the root joint and the th body joint and denotes the th articulated joint on the path. In this way, we propose the Hierarchical Structured Pose Representations to denote a human pose with the root joint position, the short-range body joint displacements between neighboring hierarchies, and the articulated path set as
Similar to SPR, hierarchical SPR defined in Eqn. (8) also unifies representations of person instance position and body joint position, leading to a single-stage solution to multi-person pose estimation as well. Moreover, hierarchical SPR factorizes displacements between the root joint and long-range body joints, benefiting estimation results for the cases with large body joint displacements. Hierarchical SPR can also be easily converted to SPR and conventional pose representation via Eqn. (7). Figure 2 (d) gives an example of Hierarchical SPR for multi-person pose representation.
With SPR, we propose to construct a regression model, termed as Single-stage multi-person Pose Machine (SPM), to map an input image to the poses of multiple persons :
), SPM only needs to learn a single mapping function. Motivated by recent success of Convolutional Neural Networks (CNNs) in computer vision tasks[14, 22, 24], we implement SPM with a CNN model. Below we will describe regression targets, network architecture, and training and inference details of SPM in 2D case for simplification. For 3D case111We set the camera position as the origin of the 3D coordinate system., the same scheme can be exploited with 3D coordinates.
Since the root joint and body joint displacements are respectively in the coordinate and vector domains, we construct different regression targets for the proposed SPM to learn to predict these two kinds of information.
, it is difficult to directly regress the absolute joint coordinates in an image. To reliably detect root joint positions, we exploit a confidence map to encode probabilities of the root joint of a person instance at each location in the image. The root joint confidence map is constructed by modeling the root joint position as Gaussian peaks. We useto denote the root joint confidence map and the root joint map of the th person. For a position in the given image , is calculated by
where is the groundtruth root joint position of the th person instance andin our experiments. The root joint confidence map is an aggregation of peaks of all persons in a single map. Here, we choose to take the maximum of confidence maps rather than their average to maintain distinctions between close-by peaks , , . An example of the root joint confidence map is shown in Figure 3 (a).
We construct a dense displacement map for each joint. We use to denote it for joint and to denote the one for joint of person . For a location in image , is calculated by
where denotes the neighboring positions of the root joint of person , is the normalization factor, with and denoting the height and width of , and is a constant controlling the neighborhood size, set as 7 in our experiments. Then, we define the dense displacement map for the th joint to be the average for all persons:
where is the number of non-zero vectors at position across all persons. Figure 3 (b) shows examples for the constructed dense displacement maps. For hierarchical SPR, is constructed in a similar way, just replacing the root joint with the one in the neighbor hierarchy.
We use the Hourglass network , the state-of-the-art architecture for human pose estimation, as the backbone of SPM. It is a fully convolutional network composed of multiple stacked Hourglass modules. Each Hourglass module, as shown in Figure 4, adopts a U-Shape structure that first decreases feature map resolution to learn abstract semantic representations and then upsamples the feature maps for body joint localization. Additionally, skip connections are added between feature maps with the same resolution for reusing low-level spatial information to refine high-level semantic information. In the original design, the Hourglass network utilizes a single branch to predict body joint confidence maps for single-person pose estimation. In this paper, SPM exploits the confidence regression branch of the Hourglass network to regress confidence maps for the root joint. In addition, SPM extends the Hourglass network via adding a displacement regression branch, to estimate body joint displacement maps. In this way, SPM can produce (Hierarchical) SPR in a single forward pass.
For training SPM, we adopt loss and smooth loss  for root joint confidence and dense displacement map regression respectively. Intermediate supervision is applied at all Hourglass modules to avoid gradient vanishing. The total loss is the accumulation of weighted sum of and across all hourglass modules:
where is the number of Hourglass modules, set as , and denote the predicted root joint confidence map and dense displacement maps at the th stage, and is a constant weight factor to balance two kinds of losses, set as
in our experiments. The overall framework of SPM is end-to-end trainable via gradient backpropagation.
The overall inference procedure for SPM to predict SPR is illustrated in Figure 1 (a). Given an image, SPM first produces root joint confidence map and displacement maps via a CNN. Then, it performs NMS on to generate root joint positions , with denoting the estimated number of persons. After that, SPM gets the displacement of the body joint of person by . Finally, SPM outputs human poses represented by SPRs via combining root joint positions and body joint displacements. For predicting hierarchical SPRs, SPM follows the above procedure to sequentially get joint displacements according to the joint hierarchies in Eqn. (7).
We evaluate the proposed SPM model for multi-person pose estimation on three widely adopted 2D benchmarks: MPII  dataset, extended PASCAL-Person-Part  dataset and MSCOCO  dataset, and one 3D benchmark CMU Panoptic dataset .
MPII dataset contains 5,602 groups of images of multiple persons, which are split into 3,844 for training and 1,758 for testing. It also provides over 28,000 annotated single-person pose samples. Each person is annotated with 16 body joints. We use the official mean Average Precision (mAP) for evaluation on this dataset. The extended PASCAL-Person-Part dataset consists of 1,716 training and 1,817 testing images collected from the original PASCAL-Person-Part dataset 
, and provides 14 body joint annotations for each person. Similar to MPII, this dataset also adopts mAP as the evaluation metric. MSCOCO dataset contains about 60,000 training images with 17 annotated body joints per person. Evaluations are conducted on the test-dev subset, including roughly 20,000 images, with the official Average Precision (AP) as metric.
CMU Panoptic is a large scale dataset providing 3D pose annotations for multiple people engaging social activities. It totally includes 65 videos with multi-view annotations, but only 17 of them are in multi-person scenario and given the camera parameters. We use the front-view captures of these 17 videos in our experiments, which contains 75,552 images in total and are randomly split into 65,552 for training and 10,000 for testing. We following conventions [25, 34] to utilize 3D-PCK@150mm as metric.
We follow the conventional data augmentation strategies for multi-person pose estimation via cropping original images centered at person centroid to input samples to SPM. For MPII and extended PASCAL-Person-Part datasets, we augment training samples with rotation degrees in , scaling factors in , translation offset in and horizontally flipping. For MSCOCO dataset, scaling factors are sampled in and other augmentation parameters are set the same as MPII and extended PASCAL-Person-Part datasets. For CMU Panoptic dataset, we conduct data augmentation with scale factors in and set the other augmentation parameters the same as 2D case.
For MPII dataset, we randomly select 350 groups of multi-person training samples as the validation dataset and use the remaining training samples and all single-person pose images to learn SPM. For MSCOCO dataset, we use the standard training split for training the model. Following conventions [3, 37]30]
and utilize RMSprop
as the optimizer with an initial learning rate of 0.003. For MPII dataset, we train SPM for 250 epochs and decrease learning rate by a factor of 2 at the 150th, 170th, 200th, 230th epoch. For extended PASCAL-Person-Part dataset, we fine-tune the model pre-trained on MPII for 30 epochs. For MSCOCO dataset, SPM is trained for 100 epochs and learning rate is decreased at the 30th, 60th, and 80th epoch by a factor of 2. For CMU Panoptic dataset, we adopt the same training strategy as MPII. Testing is performed on six-scale image pyramids with flipping for both datasets. Specially, we follow previous works[3, 27] to refine estimation results with a single-person model trained on the same dataset on MPII and MSCOCO.
|Iqbal and Gall ||58.4||53.9||44.5||35.0||42.2||36.7||31.1||43.1||10|
|Insafutdinov et al. ||78.4||72.5||60.2||51.0||57.2||52.0||45.4||59.5||485|
|Levinkov et al. ||89.8||85.2||71.8||59.6||71.1||63.0||53.5||70.6||-|
|Insafutdinov et al. ||88.8||87.0||75.9||64.9||74.2||68.8||60.5||74.3||-|
|Cao et al. ||91.2||87.6||77.7||66.8||75.4||68.9||61.7||75.6||0.6|
|Fang et al. ||88.4||86.5||78.6||70.4||74.4||73.0||65.8||76.7||0.4|
|Newell and Deng ||92.1||89.3||78.9||69.8||76.2||71.6||64.7||77.5||0.25|
|Fieraru et al. ||91.8||89.5||80.4||69.6||77.3||71.7||65.5||78.0||-|
In Table 1, we compare our SPM model with hierarchical SPR to state-of-the-arts on the full test split of MPII dataset222For our SPM model, the time is counted with single-scale testing on GPU TITAN X and CPU Intel I7-5820K 3.3GHz, excluding the refinement time by single-person pose estimation. For time evaluation on , we report the runtime with the code provided by authors in the link: https://github.com/umich-vl/pose-ae-train. For runtime on , we refer to its speed for single-scale inference setting on MPII testing set, which can be found in Table 1 of 1st version of .. We can see that our SPM model only requires 0.058s to process an image, about faster than the bottom-up model  with state-of-the-art speed, verifying the efficiency advantage of the proposed single-stage solution over existing two-stage ones for multi-person pose estimation. In addition, our SPM model achieves new state-of-the-art mAP on MPII dataset and improves accuracies for most kinds of body joints, which demonstrates its superior performance for estimating human poses of multiple persons in a single stage.
We conduct ablation analysis on MPII validation dataset. We first evaluate the impact of the hierarchical division to SPR on the proposed SPM model. Results are shown in Table 2. We use SPM-Vanilla and SPM-Hierar to denote the models for predicting SPR and Hierarchical SPR, respectively.
We can see SPM-Vanilla achieves mAP with 0.058s per image. By introducing joint hierarchies, SPM-Hierar improves the performance to mAP without increasing time cost as SPR and hierarchical SPR have the same complexity and both of them are generated by SPM in a single-stage manner. In addition, we can see SPR-Hierar improves the accuracy of all joints. Moreover, we can also see that improvements by SPM-Hierar on long-range body joints wrists and ankles are significant, from to mAP and to mAP, respectively, verifying the effectiveness of shortening long-range displacements with Hierarchical SPR that divides body joints to different hierarchies. These results clearly show the efficacy of incorporating hierarchical SPR to improve performance and efficiency of multi-person pose estimation.
We then conduct experiments to analyze the impact of important hyper-parameter , the neighborhood size in constructing regression targets for body joint displacements in Section 4.1, on the proposed SPM model. We range from 1 to 20 and results are given in Figure 5. From Figure 5, we can see increasing from 1 to 7 gradually improves the performance, mainly because with the increase of positive samples, more variations of body joints can be covered for displacement regression in training. Further increasing from 7 to 10 cannot achieve performance improvement. However, when , we observe performance drop. This is because noise from background is taken as positive samples and the overlap of displacement fields among multiple persons degrades the performance. Hence, we set in our experiments for the trade-off of efficiency and accuracy.
Qualitative results on MPII dataset are shown in the top row of Figure 6. We can see that the proposed SPM is effective and robust for estimating human poses represented by Hierarchical SPRs even in challenging scenarios, , large pose deformation (1st example), blurred and cluttered background (2nd example), occlusion and person overlapping (3rd example), and illumination variations (4th example). These results further validate the efficacy of SPM.
|Chen and Yuille ||45.3||34.6||24.8||21.7||9.8||8.6||7.7||21.8|
|Insafutdinov et al. ||41.5||39.3||34.0||27.5||16.3||21.3||20.6||28.6|
|Xia et at. ||58.0||52.1||43.1||37.2||22.1||30.8||31.1||39.2|
Table 3 shows the comparison results with state-of-the-arts on the extended PASCAL-Person-Part dataset. We can see that the proposed SPM model achieves mAP and provides new state-of-the-art. Besides, SPM outperforms previous models for all body joints, demonstrating the effectiveness of the proposed single-stage model for tackling the multi-person pose estimation problem.
Qualitative results are shown in the middle row of Figure 6. We observe SPM can deal with person scale variations (1st example), occlusion (2nd to 4th examples) and person overlapping (the last example), showing the efficacy of SPM on producing robust pose estimation in various challenging scenes.
Table 4 shows experimental results on MSCOCO test-dev. We can see that the proposed SPM model achieves overall 0.669 AP, which is slightly lower than the state-of-the-art . However, our SPM achieves superior speed, faster than . These results further confirm the superior efficiency of our single-stage solution over existing two-stage top-down or bottom-up strategies, while achieving very competitive performance, for addressing the multi-person pose estimation tasks.
Qualitative results on MSCOCO dataset are shown in the bottom row of Figure 6. We can see that our SPM model is effective in challenging scenes, , appearance variations (1st example) and occlusion (2nd to 4th examples).
We evaluate the proposed SPM model for multi-person 3D pose estimation on the CMU Panoptic dataset, which provides large-scale data with accurate 3D pose annotations and thus is suitable to be an evaluation benchmark. Since previous works [19, 8] only conduct qualitative evaluation on this dataset, there are no reported quantitative results for comparison. For better understanding the model performance, we present the first quantitative evaluation here. We separate 10,000 images from the dataset to form the testing split and use the remaining for training as mentioned in Section 5.1. In particular, our SPM model achieves 3D-PCK, a promising result for multi-person 3D pose estimation. The effectiveness of our SPM model can be also verified through the qualitative results in Figure 7. We can see our SPM model is robust for pose variations (1st and 2nd examples), self occlusions (3rd example), scale and depth changes (4th and 5th examples).
In addition, the proposed SPM model achieves attractive efficiency with speed of about 20 FPS. Moreover, its single-stage design also significantly simplifies the pipeline for multi-person 3D pose estimation from a single monocular RGB image, alleviating the requirements of intermediate 2D pose estimations  or 3D pose reconstructions from multiple views .
In this paper, we present the first single-stage model, Single-stage multi-person Pose Machine (SPM), for multi-person pose estimation. The SPM model offers a more compact pipeline and attractive efficiency advantage over existing two-stage based solutions. The superiority of SPM mainly comes from a novel Structured Pose Representation (SPR) that unifies the person instance and body joint position information and overcomes the intrinsic limitations of conventional pose representations. In addition, we present a hierarchical extension of SPR to effectively factorize long-range displacements into accumulative short-range ones between adjacent articulated joints, without introducing extra complexity to SPR. With SPR, SPM can estimate poses of multiple persons in a single-stage feed-forward manner. We implement SPM with CNNs, which can perform end-to-end learning and inference. Moreover, SPM can be flexibly adopted in both 2D and 3D scenarios. Extensive experiments on 2D benchmarks demonstrate the state-of-the-art speed of the proposed SPM model also with superior performance for predicting poses of multiple persons. Results on 3D benchmark also show the promising performance of our SPM model with attractive efficiency.
Jiashi Feng was partially supported by NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133 and MOE Tier-II R-263-000-D17-112.