3D human pose-estimation consists of inferring the 3D joint-locations from an image or a sequence of images. It is the key to unlocking a large number of applications in AR/VR, Human-Computer-Interaction (HCI), Gaming, Activity Recognition, Surveillance, etc.. Although, there is a vast literature on single-person 3D pose estimation [31, 12, 41, 39, 23, 20, 5, 36, 3, 15, 1, 40, 6, 11], the space of multi-person 3D pose estimation is mostly unexplored with only a handful of prior work [27, 19, 37, 28, 38]. Ironically, real-life human pose-estimation applications, most often, require multi-person pose estimation. For example, surveillance systems require real-time capturing of the poses for every person in the scene. Similarly, sports-analytics demands that all the players are simultaneously analyzed to capture inter-player interactions. Consequently, there exists a gap between existing research and real-world requirements.
A simple extension of the single-person pose estimation systems to the multi-person setting involves separate detection of every person followed by single-person pose estimation on person crop.
Unfortunately, the run-time of this approach is likely to increase linearly with the number of people in the scene, making it inefficient for analysis in crowded scenes. Additionally, most existing multi-person pose estimation methods [27, 19, 28], with the exception of  estimate 3D pose configuration only relative to the root joint. However, relative spatial ordering of different people in the scene is also needed to facilitate reasoning about human interactions and provide a better understanding of the scene. Relative spatial estimation has the potential to unlock accurate tracking of multiple persons in a scene video.
Moreover, most prior work on multi-person pose estimation [27, 19, 38] relies on creating or simulating a multi-person 3D human pose dataset as a necessity for training. The pre-requisite is due to the end-to-end integrated person detection and pose estimation pipeline. This limits the variability presented to the system while training because obtaining real-world in-the-wild 3D annotations in multi-person setting is challenging, expensive and a research problem in itself.
In light of the aforementioned discussion of multi-person 3D pose estimation, we propose a quasi top-down architecture that decouples the 2D key-point detection and 2D-to-3D lifting tasks. The proposed architecture, HG-RCNN, brings together the goodness of Mask-RCNN  and the Hourglass  network for heatmap regression. The regressed heatmaps are then fed to an independently trained lifting module to regress the root-relative 3D poses. Consequently, we completely avoid using any multi-person 3D pose dataset in the pipeline since it leverages the existing multi-person 2D pose datasets and single-person 3D pose datasets. Owing to its modular architecture, the first step of obtaining 2D poses can be trained with publicly available large-scale in-the-wild multi-person datasets, such as COCO , LIP  and MPII 2D dataset . This allows HG-RCNN to cope with challenging variations in view-point, lighting, apparel, occlusion and extreme poses without the need of costly 3D annotations in-the-wild setting. The keypoint heatmaps from the HG-RCNN are passed through a soft-argmax module and fed to a 2D-3D lifting module. Finally, our pipeline approximates pose-configurations in camera coordinates without the need of costly geometric optimization. The resulting system outperforms all previous approaches on the challenging MuPoTS-3D  test-set that contains a majority of in-the-wild test scenarios. The method generalizes well to in-the-wild images, even without exploiting any structural priors, while running at 12-15fps on images of size on a single Nvidia 1080Ti graphics card.
In summary, we contribute a state-of-the-art model for performing in-the-wild multi-person 3D pose estimation. The model can be trained without using any multi-person 3D dataset and the system also estimates the relative ordering of the persons in the 3D space.
2 Related Works
Human Pose Estimation has been a widely studied problem. Here, we describe prior art relevant to this work from three broad viewpoints: (a) 2D Pose estimation, (b) Single-person 3D Pose estimation and (c) Multi-Person 3D Pose estimation. A detailed survey of the area can be found in .
2D Human Pose Estimation: Most 2D human pose estimation methods represent their joint outputs as heatmaps, wherein a heatmap’s value at a point represents the possibility of the corresponding joint’s existence in that position.  proposed Convolutional Pose Machines that iteratively refined the heatmap predictions at every stage. The Stacked Hourglass network  was an encoder-decoder architecture with skip connections to facilitate joint reasoning of high level structural and low level textural features of human pose. Mask-RCNN  proposed an extension of Faster-RCNN  for simultaneously predicting the pose and 2D keypoints and/or instance segmentation masks. In a similar line of work,  predicted the u-v maps of the persons which can then be used for dense reconstruction.  proposed a variant to Mask-RCNN by defining joints as regions instead of persons. In similar spirits, our proposed pipeline attempts to synergise Mask-RCNN and Hourglass networks for multi-person 3D pose estimation task.
Single-person 3D Pose Estimation: Single person 3D pose estimation works can be broadly divided based on whether they directly regress 3D joints [31, 12, 41, 39] or use a pipelined approach of inferring 3D pose from 2D pose [33, 40, 20, 21, 13]. VNect  proposed the first real-time approach and parameterized a 3D joint by a heatmap and 3 location maps. Using a 2D-to-3D pipeline enables the use of rich 2D pose datasets which, in turn, improves in-the-wild generalizability. Many approaches perform a direct 2D-to-3D lifting of poses [39, 16, 21, 5, 36] by either learning the transformation or by a nearest-neighbour lookup in a pose library. Furthermore, many pipelined approaches [20, 27, 40, 31, 39, 23] have reported significant improvements in in-the-wild performances by using the more diverse 2D pose datasets to pre-train or jointly train their 2D prediction modules.
Several methods in the past have also reported significant improvements by using temporal cues [24, 20, 39, 6, 35, 37] by either learning a motion/refinement model or by using temporal constraints in a constrained optimization framework.
Multi-Person 3D Pose Estimation:
Broadly, multi-person pose estimation approaches, 2D and 3D alike, can be classified into top-down and bottom-up approaches. Bottom-up approaches simultaneously predict all the key-points followed by assembling them into full poses for all persons. On the other hand, top-down approaches first detect the human candidates and subsequently perform pose estimation for each of them. While bottom-up methods are lucrative in terms of efficiency, they tend to be less accurate. For example, the top 5 entries in MS COCO key-points challenge employ top-down approaches. Intuitively, it makes sense to solve for pose estimation on a person’s crop, instead of solving a much more challenging problem of grouping detected key-points into a full person. In recent years, however, a middle ground has been found in the form of quasi top-down architectures based on Mask-RCNN [8, 26, 9, 30] that have been successful in simultaneously detecting the object RoIs and performing downstream tasks on the corresponding RoI feature-maps, without having to crop the image back.
which first proposes Regions of Interest (RoIs) that are fed to a classifier and a regressor. The classifier estimates the most probable anchor pose out of thepre-defined anchor poses obtained from a MoCap dataset. The regressor then refines the anchor poses towards an accurate pose prediction. Alternately,  propose a bottom-up approach wherein they regress the heatmaps along with X, Y, and Z location maps for every image. The location maps provide the corresponding 3D positions of joints in metric space. The estimated 3D joints are then associated using Part Affinity fields  based on the heatmaps. Both the approaches depend on the explicit creation or simulation of multi-person 3D pose datasets for training. Our method, on the other hand, avoids the use of such datasets and relies on 3D data only for the single person case. Further, Zanfir  proposed a large-scale human sensing system for multiple people that estimates pose and shape using the top-down approach of person detection followed by pose estimation for each person. Recently, Zanfir  proposed MubyNet, a bottom-up approach that performs joint association by formulating it as a binary integer programming problem. In contrast, Mask-RCNN  based quasi top-down methods [26, 9, 27] have proven to be effective for simultaneously locating objects at a coarse level and detecting finer spatial layouts like segmentation masks, key-point heatmaps, u-v maps, etc.. Our proposed HG-RCNN exploits this setting and also regresses for 3D key-points. However, unlike LCRNet  and LCRNet++ , our method does not require anchor-poses and is relatively simpler.
3 Problem Formulation
Given an image containing people, we estimate the poses , wherein and is the number of joints. Every pose is a set of joints in 3D Euclidean space with the origin set to a root joint, pelvis in this case. As an intermediate step, our method first estimates the 2D key-points with in the image coordinate space. Finally, we approximate the global poses in camera coordinate space.
3.1 Multi-Person 3D Pose Estimation
We follow a generic, two-step pipeline for root-relative 3D pose estimation. First, we estimate per-frame 2D key-points of all the people in an image and lift them to 3D pose using a simple residual network. We use a Mask-RCNN based architecture to estimate 2D key-points. However, vanilla keypoint head of Mask-RCNN is not the most conducive architecture for reasoning with structured/articulated objects like human pose. Fortunately, the Hourglass  family of networks have been found to be extremely effective in reasoning about a human pose in a structure-aware way. Therefore, we propose to employ a tiny Hourglass head as a surrogate to the key-point head. This simple patch alone leads to noticeable improvements in the results and will be discussed further in Section 5.2.
In the second step, the obtained keypoint heatmaps are lifted to 3D joints using a network with two residual modules of size 2048. When deployed in wild settings, it is trained with the heatmaps regressed on the MPI-INF-3DHP training dataset  which provides a wide variety of viewpoints and poses activities, thereby adding to the generalization capability of the network. It is worth noting that it is this modular structure of the pipeline that allows us to train the network without any multi-person 3D dataset. The in-the-wild performance is guaranteed by two aspects: a) The heatmaps are learnt on completely wild multi-person 2D keypoint datasets, and b) the lifting module is agnostic to the image features and trained on a dataset consisting of a wide variety of 2D-3D paired annotations.
Further details on the architecture are discussed in Section 4. At this stage, all the outputs (3D keypoints) are in their individual root relative space. For placing the detected poses in camera-relative space, we estimate the common focal length of the camera and the translation vectors from the individuals’ roots to the camera center.
3.2 Global Pose Approximation
Our approach for camera-relative pose approximation is based on jointly optimizing the root joints’ global positions and the camera’s focal length for the projection error. We initialize the root joint positions using a weak-perspective projection assumption, thus, requiring us to estimate the shrinking parameter for every pose in the scene. To this end, we compute the sum of bone lengths of the 2D keypoints, , followed by computing the sum of bone lengths, , of the 3D pose’s orthographic projection.
The ratio acts as a surrogate to the shrinking factor . This finally leads to the following formulation for estimating the global (horizontal) and (depth) coordinates of a joint:
where, corresponds to the 2D keypoint and is the co-ordinate of the image center. The focal length, , is initialized by assuming a field-of-view of . The same formulation holds for the (vertical) coordinate as well.
Once the root translations are initialized and the full 3D poses are placed in the respective root positions, we iteratively optimize the translation and focal length. The global rotations are assumed to be identity. Thus, the objective function can be written as:
where with being the translation vector of subject’s root joint and being the projection operator. This, finally, leads to the global pose, .
It is worth noting that the proposed global pose approximation method is just an approximation that can be quickly implemented and run in real-time. The approximation is not expected to work when the person is aligned with the optical axis. We discuss further limitations in section 6. It is not intended to be highly accurate, but only expected to make spatial ordering apparent to systems that need it, eg. action recognition.
shows results when only the detected poses are considered. The evaluation metric is 3D PCK and higher is better. *Note, that the average PCK provided in LCRNet++ is not weighed by the number of persons in each test sequence unlike [27, 19] and ours.
4 Network and Training Details
HG-RCNN: The HG-RCNN is constructed by appending an hourglass on the keypoint head of Mask RCNN as shown in Figure (2). Instead of upsampling once while deconvolving and once at the final layer, we upsample (with ) the feature maps all at once before passing the feature values on to the hourglass. The number of feature-maps is brought down from to using a convolution layer. The original hourglass is modified to have three nested residuals (instead of ) and has a feature-map of size at the bottle-neck layer. The hourglass output is then fed to a final classification layer which predicts the heatmaps for every joint.
We train the network described above with the Cross-Entropy Loss. While finetuning, we train on top 500 RoIs and use a batch size of 16. The network is trained with a base learning rate of on a single Nvidia P6000 Quadro graphics card.
3D Pose Module: Our 2D-to-3D pose module converts the heatmap activations to 3D pose using a residual architecture and is in line with the 2D-3D lifting pipelines proposed in [16, 32, 21]. We input the 2D poses in heatmap space after passing the heatmaps through a softargmax
layer. This has two benefits: a) it makes learning possible from images of any given size and scale, and b) it facilitates end-to-end training of the network architecture. The network is trained using RMSProp optimizer and a learning rate of
which is reduced by 10 times after 40 epochs.
While testing on MuPoTS (multi-person) dataset, we use the 3D pose module trained only on MPI-INF-3DHP dataset because both the training and the test sets had the same motion capture system. Human3.6 was captured by a different mocap system which leads to the same joint name pointing to different physical locations on the body.
5.1 Evaluation Datasets
MuPoTS-3D Test Set: Multi-Person Test Set 3D  is a recently released multi-person 3D human pose test dataset. It consists of 20 test sequences shot with a marker-less mocap system - 5 indoor and 15 outdoor. Every sequence contains 2-3 persons in a variety of activities. The evaluation metric used is 3D PCK - percentage of correct keypoints within a radius of 15cm - on all the annotated persons. In case of a missed detection, all the joints of the missed person are considered erroneous. An alternative evaluation mode is the one in which the evaluations are performed only on the detected joints.
The official evaluation code performs a greedy matching of detections and ground truth based on the number of 2D keypoints within a proximity of . We call this method Setting 1 for MuPoTS.
We also evaluate our model in the setting wherein the greedy matching is done based on 3D distances instead of 2D distances. We call this Setting 2. This joint matching strategy is, arguably, less sensitive to cases of heavy occlusion which would, otherwise, confuse a keypoint based matching detector. This, as discussed in Section 5.2, leads to missed detections even when the model actually detects the appropriate person. Note, that the two settings differ only in the way the predicted poses are matched with the ground truth poses. All the other details of evaluation, like 3D PCK threshold, joints used for matching, etc remains the same.
Human3.6: Human 3.6M  is a single-person 3D human pose dataset captured with marker-based motion capture system. It consists of subjects performing actions. We evaluate our model on the commonly followed protocol [20, 31, 40, 27, 17, 6, 21] that uses subjects and for training, The evaluations are done on subjects and . All the videos are downsampled from to . The evaluation metric used is Mean Per Joint Position Error (MPJPE) which is calculated after aligning only the roots of the predicted and ground truth 3D poses.
MSCOCO Keypoints: MSCOCO Keypoints is a large scale dataset for 2D multi-person keypoint detection task with roughly 110k training images. It also provides the person bounding boxes and segmentation masks. The 2D keypoint detection task is evaluated on the commonly used Average Precision (AP) metric at different threshold levels. Similarly, the quality of bounding box detections are evaluated using AP.
5.2 Quantitative Evaluation
We now discuss the numerical results achieved on the datasets mentioned above.
MuPoTS-3D Test Set: Table 1 compares the performance of our simple yet effective method with the existing multi-person 3D pose results. On Setting 1, we improve the state-of-the art significantly with a 3DPCK of 71.25% as against 65% in  and in . For LCRNet , the reported results are evaluated by . We report an improved performance on several test sequences. We also significantly improve the performance of occluded joints ( vs ) as well as the non-occluded joints ( vs ) when compared with . Our method also performs significantly well when only detected persons are compared. In this setting, we observe 3DPCK while the state-of-the-art being . We also compare our performance with the recently released XNect  and demonstrate competitive results on all annotated poses (71.3% vs 70.4%) as well as the detected poses (74.2% vs 75.8%).
We also evaluate our method on the proposed Setting 2. We observed an improved 3DPCK of when compared with Setting 1. This improvement is facilitated by a simple tweak in the greedy matching algorithm of ground-truth and predicted persons. On deeper inspection, we see sharp improvements in sequences with heavy occlusions, like TS18 and TS19. Further, the overall improvement is significant when comparing the performance of occluded joints ( vs of ). This observation can be attributed to the fact that matching predictions with ground-truths based on 2D keypoints leads to matching errors and missed detections when two or more persons occlude each other. Indeed, we observe that the algorithm’s detection percentage rose from 93% to 96%, thus improving the overall 3DPCK. Interestingly, we observe that TS10 suffers under this protocol because all the three subjects bear similar poses for many frames. Thus, we believe the two settings are complimentary. Another evaluation metric used in MuPots Test Set is the Area Under Curve (AUC) of PCK values. We report an AUC of which is better than reported by  and in  using groud truth detections. Our detection rate is which is comparable to the detection rate of  under Setting 1.
The above mentioned results reveal a significant increment in the state-of-the-art. It is worth noting that all the results are comparable to performance of single-person pose estimation methods.
|2D-3D Training||HG-RCNN Training||MPJPE|
|H36M GT + noise||MS-COCO||119.7|
|H36M pred||MS-COCO + H36M||65.2|
|MPI-INF GT + noise||MS-COCO||118.16|
Human 3.6M: The results on Human 3.6M dataset are detailed in Table 3. We achieve an MPJPE of after fine-tuning HG-RCNN on Human3.6M and without fine-tuning. It may be noted that Zanfir et al.  report their results on the official Human3.6 test set and achieve MPJPE. Since the test circumstances are different, the comparison may not be fair. The combined results on MuPoTS-3D and Human3.6M also corroborate the claims in  that a good performance in Human3.6M does not necessarily indicate better generalization in wild settings. We also evaluate our method under various test-train settings in Table. 4 and observe that MPI-INF-3DHP  offers a wider range of poses to train from, thus leading to better results with ground-truth detections.
|all annotated joints||70.1||72.4|
|all occluded joints||61.0||64.1|
Mask-RCNN vs. HG-RCNN: Table 6 details the performance of HG-RCNN on MSCOCO Keypoints dataset. Our results are comparable to Mask-RCNN’s reported results. We observe a slightly reduced mAP which can be attributed to the fact that Hourglass architecture is better suited for cases when the larger structure is to be considered. MS-COCO keypoints validation dataset contains multiple cases of isolated/truncated body parts. While evaluating on MuPoTS-3D, we observe improved 3DPCK using HG-RCNN on all annotated ( vs ) and all occluded ( vs ) joints alike. We also achieve comparable results on the person bounding box detections over Mask-RCNN as shown in Table 7.
While our method attempts to account for structural information during inter-personal occlusions, we believe it can be explicitly taken care of with better structural constraints and bounding box consistencies.
Sources of Error: Figure 5 shows interesting examples of failure cases and exposes three sources of error in our pipeline. The first source is poor 2D keypoint estimation, which is apparent in the occluding persons of Figure 5 (b). The second source of error is an unseen activity/pose which leads to erroneous prediction. This can be seen in squatting players of both the figures, wherein the data-induced model bias leads to incorrectly predicting a person sitting on a chair instead.
Finally, our camera-coordinate 3D pose prediction is sensitive to 2D keypoint detections and can wrongly reason about the person depth. This effect is observable in Figure 5
(a) in which the sitting people have been pushed back, in addition to the two outliers standing behind the player. It may also be noted that this approximation also assumes the individuals to be of roughly the same size. We observe incorrect relative positioning when the height difference is high. Finally, while we compute the sums of bone lengths only on the torso joints to avoid the adverse effects of fore-shortening, the effects can not be completely alleviated.
This paper presents a simple extension of Faster-RCNN framework to yield a near-real-time multi-person 3D human pose estimation network HG-RCNN that can be trained without a multi-person 3D pose dataset. Our proposed framework is extremely simple to implement and outperforms previous state-of-the-art results by convincing margins. We also show that we can approximate the spatial layout of the scene. These claims are substantiated both quantitatively through experimental evaluation as well as through qualitative assessments on COCO and MuPoTS-3D datasets. The paper also proposes an improvement to the greedy-matching strategy for multi-person 3D pose estimation evaluation and show results on it. In the future, we plan to deploy this pipeline to a broader human-parsing pipeline while also seeking real-life applications such as activity detection and construct a better scene understanding system related to humans.
This work is supported by Mercedes-Benz Research & Development India (RD/0117-MBRDI00-001).
-  (2015) Pose-conditioned joint angle limits for 3d human pose reconstruction. In CVPR, Cited by: §1.
-  (2014) 2D human pose estimation: new benchmark and state of the art analysis. In CVPR, Cited by: §1.
-  (2016) Keep it smpl: automatic estimation of 3d human pose and shape from a single image. In ECCV, Cited by: §1.
-  (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, Cited by: §2.
-  (2017) 3D human pose estimation = 2d pose estimation + matching. In CVPR, Cited by: §1, §2.
-  (2018) Learning 3d human pose from structure and motion. In ECCV, Cited by: §1, §2, §5.1, Table 3.
-  (2017) Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. In CVPR, Cited by: §1.
-  (2017) Mask R-CNN. In ICCV, Cited by: §1, §2, §2, §2.
-  (2018) Learning to Segment Every Thing. In CVPR, Cited by: §2, §2.
-  (2014) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI. Cited by: §5.1, §5.
-  (2019) Learning 3d human dynamics from video. In CVPR, Cited by: §1, §5.2.
3d human pose estimation from monocular images with deep convolutional neural network. In ACCV, Cited by: §1, §2.
-  (2017) Recurrent 3d pose sequence machines. In CVPR, Cited by: §2.
-  (2014) Microsoft COCO: common objects in context. arXiv preprint arXiv:1405.0312. Cited by: §1, §2, §5.
-  (2015) SMPL: a skinned multi-person linear model. ACM Trans. Graph.. Cited by: §1.
-  (2017) A simple yet effective baseline for 3d human pose estimation. In ICCV, External Links: Cited by: §2, §4, Table 3.
-  (2017) Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3DV, Cited by: §3.1, §5.1, §5.2.
-  (2019) XNect: real-time multi-person 3d human pose estimation with a single rgb camera. arXiv preprint arXiv:1907.00837. Cited by: Table 1, §5.2.
-  (2018) Single-shot multi-person 3d pose estimation from monocular rgb. In 3DV, External Links: Cited by: §1, §1, §1, §1, §2, Table 1, §5.1, §5.2, §5.2, Table 3, §5.
-  (2017) VNect: real-time 3d human pose estimation with a single rgb camera. In ACM ToG, Cited by: §1, §2, §2, §5.1, §5.2.
-  (2017) 3D human pose estimation from a single image via distance matrix regression. In CVPR, Cited by: §2, §4, §5.1.
-  (2016) Stacked hourglass networks for human pose estimation. In ECCV, Cited by: §1, Figure 2, §2, §3.1.
-  (2017) Coarse-to-fine volumetric prediction for single-image 3d human pose. In CVPR, Cited by: §1, §2.
-  (2018) Exploiting temporal information for 3d human pose estimation. In ECCV, Cited by: §2, Table 3.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, Cited by: Figure 2, §2, §2.
-  DensePose: dense human pose estimation in the wild. Cited by: §2, §2, §2.
-  (2017) LCR-net: localization-classification- regression for human pose. In CVPR, Cited by: §1, §1, §1, §2, §2, Table 1, §5.1, §5.2, Table 3.
-  (2019-01) LCR-net++: multi-person 2d and 3d pose detection in natural images. TPAMI. Cited by: §1, §1, §2, Table 1, Table 3.
-  (2016) 3D human pose estimation: a review of the literature and analysis of covariates. Computer Vision and Image Understanding. Cited by: §2.
-  (2018) Pose proposal networks. In ECCV, Cited by: §2, §2.
-  (2017) Compositional human pose regression. In ICCV, Cited by: §1, §2, §5.1, Table 3.
-  (2018) Integral human pose regression. In ECCV, Cited by: §4, Table 3.
-  (2017) Lifting from the deep: convolutional 3d pose estimation from a single image. In CVPR, Cited by: §2.
-  (2016) Convolutional pose machines. CVPR. Cited by: §2.
-  (2017-10) Monocular 3d human pose estimation by predicting depth on joints. In ICCV, Cited by: §2.
-  (2016) A dual-source approach for 3d pose estimation from a single image. In CVPR, Cited by: §1, §2.
-  (2018) Monocular 3d pose and shape estimation of multiple people in natural scenes - the importance of multiple scene constraints. In CVPR, Cited by: §1, §1, §2, §2.
-  (2018) Deep network for the integrated 3d sensing of multiple people in natural images. In NeurIPS, Cited by: §1, §1, §2, §5.2.
-  (2016) Sparseness meets deepness: 3d human pose estimation from monocular video. In CVPR, Cited by: §1, §2, §2.
-  (2017) Towards 3d human pose estimation in the wild: a weakly-supervised approach. In ICCV, Cited by: §1, §2, §5.1, Table 3.
-  (2016) Deep kinematic pose regression. In ECCV Workshops, Cited by: §1, §2.