Body Meshes as Points

We consider the challenging multi-person 3D body mesh estimation task in this work. Existing methods are mostly two-stage based–one stage for person localization and the other stage for individual body mesh estimation, leading to redundant pipelines with high computation cost and degraded performance for complex scenes (e.g., occluded person instances). In this work, we present a single-stage model, Body Meshes as Points (BMP), to simplify the pipeline and lift both efficiency and performance. In particular, BMP adopts a new method that represents multiple person instances as points in the spatial-depth space where each point is associated with one body mesh. Hinging on such representations, BMP can directly predict body meshes for multiple persons in a single stage by concurrently localizing person instance points and estimating the corresponding body meshes. To better reason about depth ordering of all the persons within the same scene, BMP designs a simple yet effective inter-instance ordinal depth loss to obtain depth-coherent body mesh estimation. BMP also introduces a novel keypoint-aware augmentation to enhance model robustness to occluded person instances. Comprehensive experiments on benchmarks Panoptic, MuPoTS-3D and 3DPW clearly demonstrate the state-of-the-art efficiency of BMP for multi-person body mesh estimation, together with outstanding accuracy. Code can be found at:



There are no comments yet.


page 8


Single-Stage Multi-Person Pose Machines

Multi-person pose estimation is a challenging problem. Existing methods ...

CenterHMR: a Bottom-up Single-shot Method for Multi-person 3D Mesh Recovery from a Single Image

In this paper, we propose a method to recover multi-person 3D mesh from ...

KAMA: 3D Keypoint Aware Body Mesh Articulation

We present KAMA, a 3D Keypoint Aware Mesh Articulation approach that all...

InsPose: Instance-Aware Networks for Single-Stage Multi-Person Pose Estimation

Multi-person pose estimation is an attractive and challenging task. Exis...

Occluded Human Mesh Recovery

Top-down methods for monocular human mesh recovery have two stages: (1) ...

Distribution-Aware Single-Stage Models for Multi-Person 3D Pose Estimation

In this paper, we present a novel Distribution-Aware Single-stage (DAS) ...

Smooth Mesh Estimation from Depth Data using Non-Smooth Convex Optimization

Meshes are commonly used as 3D maps since they encode the topology of th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D human body mesh recovery aims to reconstruct the 3D full-body mesh of the person instance from images or videos. As a fundamental yet challenging task, it has been widely applied for action recognition [varol2019synthetic], virtual try-on [mir2020learning], motion retargeting [liu2019liquid], . With the recent notable progress in single-person based full-body mesh recovery [hmrKanazawa17, kolotouros2019convolutional, arnab2019exploiting, kocabas2020vibe], a more realistic and challenging setting has attracted increasing attention, to estimate body meshes for multiple persons from a single image.

Figure 1: Our single-stage solution. The proposed model represents each person instance as the center point of its body. Instance localization and body mesh recovery are then directly predicted from the center point features, enabling simultaneous reconstruction of multiple persons in a single stage.

Existing methods for multi-person mesh recovery are mainly two-stage solutions, including top-down [jiang2020coherent] and bottom-up [zanfir2018monocular] approaches. The former first localizes person instances via a person detector, based on which it then recovers the 3D meshes individually; the bottom-up approach estimates person keypoints at first, and then jointly reconstructs multiple 3D human bodies in the image via constrained optimization [zanfir2018monocular]. Though with notable accuracy, the above paradigms are inefficient with computational redundancy. For instance, the former one estimates body mesh for each person separately, and consequently the total computation cost linearly grows with the number of persons in the image, while the latter requires grouping the keypoints into corresponding persons and inferring the body meshes iteratively, leading to high computational cost.

Targeted at a more efficient and compact pipeline, we consider exploring a single-stage solution. Despite the recent popularity and promising performance of single-stage methods on 2D keypoints estimation [nie2019spm] and object detection tasks [zhou2019objects, tian2019fcos], a single-stage pipeline for multi-person mesh recovery is barely explored as it remains unclear how to effectively integrate both person localization and mesh recovery steps within a single stage. In this work, we propose a new instance representation for multi-person body mesh recovery that represents multiple person instances as points in the spatial-depth space where each point is associated with one body mesh. Such an representation allows effective parallelism of person localization and body mesh recovery. Based on it, we develop a new model architecture that exploits shareable features for both localization and mesh recovery and thus achieve a single-stage solution.

In particular, the model has two parallel branches, one for instance localization and the other for body mesh recovery. In the localization branch, we model each person instance as a single point in a 3-dimensional space, spatial (2D) and depth (1D), where each localized point (detected person) is associated with a body mesh in the body mesh branch represented by the SMPL parametric model 

[loper2015smpl]. This in turn converts the multi-person mesh recovery into a single-shot regression problem (Fig. 1). Specifically, the spatial location is represented by discrete coordinates regular grids over the input image. Similarly, we discretize depth into several levels to obtain the depth representation. To learn better feature representation to differentiate instances at different depth, motivated by the phenomenon that a person closer to the camera tends to seem larger in the image, we adopt the feature pyramid network (FPN) [lin2017feature]

to extract multi-scale features and use features from the lower scales to represent the closer (and larger) instances. In this way, each instance is represented as one point, whose associated features (extracted from its corresponding spatial location and FPN scale) are used to effectively estimate its body mesh. We name this

Body Meshes as Points (BMP).

Applying the BMP model for estimating multi-person body mesh simultaneously faces two challenges in realistic scenarios: how to coherently reconstruct instances with correct depth ordering, and how to handle the common occlusion issue (, overlapping instances and partial observations). For the first challenge, we consider explicitly using the ordinal relations among all the persons in the scene to supervise the model to learn to output body meshes with correct depth order. However, obtaining such ordinal relations is non-trivial for the scenes captured in the wild, since there is no 3D annotation available. Inspired by the recent success of depth estimation for human body joints [moon2019camera, zhen2020smap], we propose to take the depth of each person (center point) predicted by a model pre-trained on 3D datasets with depth annotations as the pseudo ordinal relation for model training on the in-the-wild data, which is experimentally proved beneficial to depth-coherent body mesh reconstruction.

Also, to tackle the common occlusion issue, we propose a novel keypoint-aware occlusion augmentation strategy to improve the model robustness to occluded person instances. Different from the previous method [sarandi2018robust] that randomly simulates occlusion in images, we generate synthetic occlusion based on the position of skeleton keypoints. Such keypoint-aware occlusion explicitly forces the model to focus on body structure, making it more robust to occlusion.

Comprehensive experiments on 3D pose benchmarks Panoptic [joo2015panoptic], MuPoTS-3D [singleshotmultiperson2018] and 3DPW [vonMarcard2018] evidently demonstrate the high efficiency of the proposed model. Moreover, it achieves new state-of-the-art on Panoptic and MuPoTS-3D datasets, and competitive performance on 3DPW dataset. Our contributions are summarized as follows: 1) To our best knowledge, we are among the first to explore the single-stage solution to multi-person mesh recovery. We introduce a new person instance representation that enables simultaneous person localization and body mesh recovery for all person instances in an image within a single stage, and design a novel model architecture accordingly. 2) We propose a simple yet effective inter-instance ordinal relation supervision to encourage depth-coherent reconstruction. 3) We propose a keypoint-aware occlusion augmentation strategy that takes body structure into consideration, to improve model robustness to occlusion.

Figure 2: Illustration of our BMP framework. An input image is uniformly divided into grids with in this example. The model adopts an FPN with levels ( here). Each person instance is thus represented by its residing grid cell and its associated FPN level (according to its depth). BMP uses the features from the grid cell and FPN level to localize the contained person (top) and estimate the body mesh (bottom) simultaneously.

2 Related Work

Single-person 3D pose and shape Previous works estimate 3D poses in the form of body skeleton [martinez2017simple, mehta2017vnect, tome2017lifting, zhou2017towards, popa2017deep, pavlakos2018ordinal, sun2018integral, zhang2020inference, gong2021poseaug] or non-parametric 3D shape [gabeur2019moulding, smith2019facsimile, varol2018bodynet]

. In this work, we use the 3D mesh to represent the full-body pose and shape, and adopt the SMPL parametric model 

[loper2015smpl] for body mesh recovery. In literature, Bogo et al. [bogo2016keep] proposed SMPLify, the first optimization-based method to fit SMPL on the detected 2D joints iteratively. Later works extend SMPLify by either using more dense reference points to replace sparse keypoints like silhouettes and voxel occupancy grids for SMPL fitting [lassner2017unite, varol2018bodynet], or fitting a more expressive model (, SMPL-X) than SMPL [pavlakos2019expressive].

Some recent works directly regress the SMPL parameters from images via deep neural networks in a two-stage manner. They first estimate the intermediate representation (, keypoints, silhouettes, etc) from images and then map it to SMPL parameters 

[pavlakos2018learning, omran2018neural, tung2017self, kolotouros2019convolutional]. Some others directly estimate SMPL parameters from images, either using complex model training strategies [hmrKanazawa17, guler2019holopose] or leveraging temporal information [arnab2019exploiting, kocabas2020vibe]. Although high accuracy is achieved in single-person cases, it remains unclear how to extend them to the more general multi-person cases.

Multi-person 3D pose and shape For multi-person 3D pose estimation, most existing methods adopt a top-down paradigm [rogez2017lcr, dabral2018learning, rogez2019lcr]. They first detect each person instance and then regress the locations of the body joints. Follow-up improvements are made by estimating additional absolute depth [moon2019camera], considering multi-person interaction [guo2020pi, li2020hmor] or extending to whole-body pose estimation [weinzaepfel2020dope]. Alternatively, some approaches also explore the bottom-up paradigm. SSMP3D [mehta2018single] and SMAP [zhen2020smap] estimate 3D poses from occlusion-aware pose maps and use Part Affinity Fields [cao2017realtime] to infer their association. LoCO [fabbri2020compressed] maps the image to the volumetric heatmaps and then estimates multi-person 3D poses from them by an encoder-decoder framework. PandaNet [benzine2020pandanet] is an anchor-based model where 3D poses are regressed for each anchor position.

In contrast to the prosperity of multi-person 3D pose estimation, there is a limited number of works denoted to body mesh recovery for multiple people. Zanfir et al. [zanfir2018monocular] first estimate 3D joints of persons in the image and then optimize their 3D shapes jointly with multiple constraints. They also propose a two-stage regression-based scheme that first estimates 3D joints for all the persons and then regresses their 3D shapes based on these 3D joints [zanfir2018deep]. Instead of regressing SMPL parameters from an intermediate representation (, 3D joints), Jiang et al. [jiang2020coherent] attach an SMPL head to the Faster R-CNN framework [ren2015faster] for estimating SMPL parameters directly from the input image in a top-down manner. Despite the encouraging results, these methods are based on the indirect multi-stage framework and suffer low efficiency. Different from all previous methods that rely on a multi-stage pipeline with computation redundancy, our method unifies person localization and body mesh, and enables a box-free and (ad hoc) optimization-free single stage solution to multi-person body mesh recovery.

Point-based representation The point-based methods [duan2019centernet, zhou2019objects, tian2019fcos] represent instances by a single point at their center. This approach is regarded as a simple replacement of the anchor-based representation, which has been widely used in many tasks, including object detection [duan2019centernet, zhou2019objects, tian2019fcos], 2D keypoints estimation [nie2019spm] and instance segmentation [wang2019solo]. However, these methods cannot be directly applied to body mesh recovery. In this work, we extend the point-based representation to multi-person body mesh recovery. A concurrent work [CenterHMR] adopts a similar solution to body mesh recovery. Our model differs from it in two significant aspects: 1) BMP aims at more coherent reconstruction of persons in the scenes. It handles challenging spatial arrangement and occlusion problems by exploiting the ordinal depth loss and the keypoint-aware augmentation strategy, which are not considered in [CenterHMR]. 2) BMP adopts a novel 3D point-based representation to differentiate instances at different depths, thus is more robust to overlapped instances; whereas [CenterHMR] uses only 2D representation, and would fail in such cases.

3 Body Meshes as Points

3.1 Proposed single-stage solution

Given an image , multi-person body mesh recovery targets at recovering body meshes of all the person instances in . Existing approaches [zanfir2018deep, zanfir2018monocular, jiang2020coherent] solve this task via sequentially localizing and estimating the body mesh in a multi-stage manner, leading to computation redundancy. Differently, this work aims to unify the instance localization and body mesh recovery into a single-stage solution to enable a more efficient and concise framework.

We represent each person instance as a single point in a 3-dimensional space (spanned by 2D spatial and 1D depth dimensions). By dividing the input image uniformly into grids, its spatial dimension can be easily represented within such a grid coordinate. If the body center of a person falls into grid cell , it is assigned with spatial coordinate . Similarly, for the depth dimension, we discretize the depth value to levels and obtain the value for each instance according to its depth. Such discretized depth value is beneficial for handling occluding instances, especially when the body centers of multiple instances fall into the same spatial grid coordinate.

Given this representation, we reformulate multi-person mesh recovery as two simultaneous prediction tasks: 1) instance localization and 2) body mesh recovery.

Instance localization For the first task, we employ the instance map , where , to locate each person instance in the image, where denotes the number of grid cells along one side, while

refers to the number of total depth levels. For each depth level, the network is trained to regress a scalar indicating the probability of every grid cell containing a person.

To construct ground truth (GT) for training, we first determine the depth value for each instance. We observe that a person tends to seem larger (smaller) in the image when standing closer to (away from) the camera. In other words, the depth of an instance is roughly inversely proportional to its scale. Inspired by it, we employ a Feature Pyramid Network (FPN) [lin2017feature] with pyramid levels to capture different scales, each of which is used to represent the instance with the corresponding depth. More specifically, for each instance, we compute its scale where denotes the GT body size, and associate it to the corresponding pyramid level , according to Table 1.


Pyramid P P P P P
Stride 8 8 16 32 32
Grid number 40 36 24 16 12
Instance scale 64 32128 64256 128512 256


Table 1: We employ FPN with five pyramid levels. P is used to predict instance and body mesh maps , where .

Next, we locate the grid cell in where the central region of that person lies. Inspired by [zhou2019objects, duan2019centernet], the central region is defined as follows: given the GT body center , body size of each person and a controllable scale factor , the position and size of the central region are defined as . In this work, we set the position of pelvis as body center and . Once identified, the grid cell of the -th pyramid level, , is labeled as positive (label 1). The above steps are repeated for all the instances in the image.

Body mesh representation In parallel with the instance localization, we use the body mesh map , for body mesh recovery, where and is the dimension of body mesh representation. Concretely, given a positive response in that indicates the presence of a person, we regress the body mesh representation using the features from the corresponding grid cell, as shown in Fig. 2. In this work, we use the SMPL parametric model [loper2015smpl] for body mesh representation, which renders a body mesh using the pose parameters and shape parameters . To improve the training stability, we adopt the 6D rotation representation [zhou2019continuity] for the pose parameters with . The body mesh map also predicts a camera parameter for projecting body joints from 3D back to 2D space, which enables training on in-the-wild 2D pose datasets [johnson2010clustered, lin2014microsoft, andriluka20142d] to improve model generalization [hmrKanazawa17]. We further introduce a scalar confidence score defined as the OKS [girdhar2018detect] between the projected and GT 2D keypoints, to reflect the confidence level of the SMPL prediction; and we also propose an absolute depth variable for the corresponding person instance that will be used for penalizing body mesh estimations with incoherent depth ordering (see Sec. 3.2 for details). Therefore, the total channel number of the body mesh map is 159.

Network architecture We employ ResNet-50 [He_2016_CVPR] as our backbone. FPN is built on top of the backbone to extract a pyramid of feature maps (256-d). To perform body mesh recovery, we attach two task-specific heads to each level of the feature pyramid, one for instance localization and the other for the corresponding body mesh recovery, responsible for obtaining the instance map and the body mesh map , respectively. As shown in Fig. 2, each head consists of 7 stacked convolutions and one task-specific prediction layer. However, directly estimating the camera parameter from the whole image is non-trivial since it is sensitive to instance position. Inspired by CoordConv [liu2018intriguing], we concatenate normalized pixel coordinates to the input feature map at the beginning of the mesh recovery head to encode position information into the network for better estimating camera parameter. Additionally, Group Normalization [wu2018group] is used in both prediction heads for facilitating model training. In order to match the features of size to

, we apply bilinear interpolation before the instance and body-mesh recovery branch.

3.2 Inter-instance ordinal depth supervision

Multi-person body mesh recovery is inherently ill-posed as multiple 3D predictions can correspond to the same 2D projection. Therefore, the trained model would produce ambiguous body mesh estimations with incorrect depth order due to lack of priors. To alleviate such a problem, we use ordinal depth relations among all the persons in the input as supervision to guide reasoning about the depth ordering during the training process.

More concretely, given any two persons (, ) in the image, we define the ordinal depth relation between them as , taking the value:


where denotes the depth of the person and is a pre-defined threshold to determine the ordinal relation. The ordinal relation means both instances are at roughly the same depth; otherwise one of them is closer to the camera than the other. With the ordinal relation of (, ), we define the ordinal depth loss for this pair as


where denotes -th person’s body mesh depth calculated from the predicted camera parameter with focal length , scale and images’s long edge width . The ordinal depth loss enforces a large margin between and if , , one of them measured as closer than the other, and otherwise enforces them to be equal.

However in practice, such ordinal depth relations are rarely available for the scenes captured in the wild due to lack of 3D annotations. To solve this issue, we propose to use pseudo ordinal relations for model training on the in-the-wild data. Specifically, we first train the model on 3D datasets [ionescu2014human3, singleshotmultiperson2018] with depth annotations to learn to estimate the depth of each person in the images. We define the depth of each person as the depth of body center (, pelvis joint). The model is trained by minimizing a depth loss , which is defined as the mean square errors (MSE) between the predicted and GT depths. After that, given an unlabeled data, we first leverage the pre-trained model to estimate the depth which is then used to obtain the pseudo ordinal relations for all the people in the image. Finally, given the pseudo ordinal relations, we adopt an OKS score-weighted ordinal depth loss to supervise the model training for images in the wild. The total loss for image is computed as the average loss of all instances pairs:


where denotes the number of paired instances in the image, denotes OKS-score of the -th person. Intuitively, training the model with such inter-instance ordinal depth supervision can help the model build a global understanding of the depth layout in the input scene and thus ensure more coherent reconstructions.

3.3 Keypoint-aware occlusion augmentation

SMPL-based body mesh recovery is highly sensitive to (partial) occlusion (, overlapping persons, truncation) [zhang2020object, rockwell2020full]. To improve model robustness to occlusion without requiring extra training data and annotations, we propose a keypoint-aware occlusion augmentation strategy during the training process. The proposed augmentation strategy aims to generate synthetic occlusion to synthesize real challenging cases such as partial observation for model training. Compared with previous work [sarandi2018robust] that randomly simulates occlusion on the images, which may produce easy training samples that are less helpful for boosting model performance, our method directly generates synthetic occlusion based on the positions of skeleton keypoints, which can force the model to pay more attention to the body structure, leading to notable enhancement. More concretely, given a set of keypoints of a person in the image, we first randomly choose a keypoint

. Then we randomly sample a non-human object from the PASCAL VOC 

[everingham2011pascal] dataset and composite it at the location of the selected keypoint . We randomly resize the sampled object to the range of before compositing, where denotes the area of that person. Additionally, we randomly shift the keypoint position by an offset to avoid over-fitting. During training, we set the probability of the occlusion augmentation as 0.5.

3.4 Training and inference


For training our proposed BMP model, we define the loss function

as follows:


where is a modified two-class Focal Loss [lin2017focal] for instance localization; is the depth loss (Sec. 3.2); is the loss for body mesh estimation. The training details of the body-mesh branch are similar to those in HMR [hmrKanazawa17]. Specifically, we formulate as


Here , , , denote MSE between the predicted and GT pose and shape parameters as well as 3D keypoints and vertices, respectively. is the 2D keypoints loss that minimizes the distance between the 2D projection from 3D keypoints and GT 2D keypoints. is the MSE of the predicted and GT confidences, where the GT confidence is computed as the OKS [girdhar2018detect] between the projected and GT 2D keypoints. Moreover, we use a discriminator and apply an adversarial loss on the regressed pose and shape parameters, to encourage the outputs to lie on the distributions of real human bodies. , , , and are the weights of the corresponding loss terms. The loss is applied independently to each positive grid cell. The ordinal depth loss illustrated in Eqn. (3) is adopted when the image contains more than one instance.

Inference The overall inference procedure for BMP is illustrated in Fig. 2. Given an image, BMP first obtains the instance map and the body mesh map

from the prediction heads. Then it performs max pooling operation to find the local maximum on

to obtain center point positions , where and () denote the pyramid level and body center location for -th person, respectively, and is the number of estimated persons. After that, BMP extracts the body mesh parameters of each person via . Finally, BMP outputs body mesh estimations by deforming the SMPL model using the predicted parameters. A keypoint-based NMS [girdhar2018detect] is applied to remove the redundant predictions if they exist. We take the multiplication of the predicted OKS score and the probability score from the instance map as the confidence score for NMS.

3.5 Implementation details

We implement BMP with PyTorch 

[paszke2017automatic] and mmdetection library [mmdetection] and utilize Rectified Adam [liu2019variance] as the optimizer with an initial learning rate of . We resize all images to 832

512 while keeping the same aspect ratio following the original COCO training scheme 

[su2019multi, wang2019solo, jiang2020coherent]. During training, we augment the samples with horizontally flip and keypoints-aware occlusion (Sec. 3.3). Flip augmentation is conducted during testing. Moreover, since the BMP model directly extracts image-level features for estimations instead of features from cropped bounding boxes, it can take images with smaller resolution (512512) as inputs. We denote such a setting as BMP-Lite. Other training and testing settings are the same between BMP-Lite and BMP. Please refer to supplementary for more details.

4 Experiments

In this section, we aim to answer following questions. 1) Can BMP provide both efficient and accurate multi-person mesh recovery? 2) Is BMP able to give coherent meshes for multiple persons with correct depth ordering? 3) Is BMP robust to cases where person instances are occluded or partially observed? To this end, we conduct extensive experiments on several large-scale benchmarks.

4.1 Datasets

Human3.6M [ionescu2014human3] is the most widely used single-person 3D pose benchmark collected in an indoor environment. It contains 3.6 million 3D poses and corresponding videos for 15 subjects. Due to its high-quality annotations, we use it following [jiang2020coherent] for both training and testing.

Panoptic [joo2015panoptic] is a large-scale dataset captured in the Panoptic studio, offering 3D pose annotations for multiple people engaging in diverse social activities. We use this dataset for evaluation with the same protocol as [zanfir2018monocular].

MuPoTS-3D [singleshotmultiperson2018] is a multi-person dataset with 3D pose annotations for both indoor and in-the-wild scenes. We follow [singleshotmultiperson2018] and use it for evaluation.

3DPW [vonMarcard2018] is a multi-person in-the-wild dataset, which features diverse motions and scenes. It contains 60 video sequences (24 train, 24 test, 12 validation) with full-body mesh annotations. To verify generalizability of the proposed model to challenging in-the-wild scenarios, we use its test set for evaluation, following the same protocol as [kocabas2020vibe].

MPI-INF-3DHP [mehta2017vnect] is a single-person multi-view 3D pose dataset. It contains 8 actors performing 8 activities, captured from 14 cameras. Mehta et al.[singleshotmultiperson2018]

generate a multi-person dataset called MuCo-3DHP, from MPI-INF-3DHP via mixing up segmented foreground human appearance. We use both datasets for training.

COCO [lin2014microsoft], LSP [johnson2010clustered], LSP Extended [Johnson11], PoseTrack [Andriluka_2018_CVPR], MPII [andriluka20142d] are in-the-wild datasets with annotations for 2D joints. We use them for training with the weakly-supervised training strategy [hmrKanazawa17] (Eqn. (5)).

4.2 Comparison with state-of-the-arts

Single-person setting We first evaluate our proposed BMP model on the single-person setting to validate the strategy of BMP on factorizing the instance localization and mesh recovery does not sacrifice on the performance. Concretely, we evaluate and compare the performance of BMP on the large-scale Human3.6M dataset with most competitive approaches [hmrKanazawa17, jiang2020coherent] sharing the similar regression target and learning strategy. The results are shown in Table 2. We can observe BMP outperforms all these methods.


Method HMR [hmrKanazawa17] CRMH [jiang2020coherent] BMP
PA-MPJPE 56.8 52.7 51.3


Table 2: Results on Human3.6M. We use mean per joint position errors in mm after Procrustes alignment (PA-MPJPE) as metric.

Multi-person settings Then we evaluate our BMP model for multi-person body mesh recovery. We first evaluate it on the multi-person dataset captured in the indoor Panoptic Studio [joo2015panoptic] and compare with the most competitive approaches [zanfir2018monocular, zanfir2018deep, jiang2020coherent]. As shown in Table 3, our BMP model achieves the best performance in all scenarios. Overall, it improves upon the state-of-the-art top-down model CRMH [jiang2020coherent] by 5.4% (135.4 mm 143.2 mm in MPJPE), while offering a faster inference speed111We count per-image inference time in seconds. For all methods, the time is counted on GPU Tesla P100 and CPU Intel E5-2650 v2 @ 2.60GHz, without using test-time augmentation.. Moreover, it significantly outperforms CRMH for Ultimatum and Pizza scenarios with crowded scenes and severe occlusion, verifying its robustness to occlusion cases. In addition, its lite version, BMP-Lite, is even faster, which only requires 0.038s to process an image, about faster than CRMH while achieving comparable performance. These results demonstrate both effectiveness and efficiency of BMP for estimating body meshes of multiple persons in a single stage.


Method Haggl. Mafia Ultim. Pizza Mean Time[s]
Zanfir et al. [zanfir2018monocular] 140.0 165.9 150.7 156.0 153.4 -
MubyNet [zanfir2018deep] 141.4 152.3 145.0 162.5 150.3 -
CRMH [jiang2020coherent] 129.6 133.5 153.0 156.7 143.2 0.077
BMP-Lite 124.2 138.1 155.2 157.3 143.7 0.038
BMP 120.4 132.7 140.9 147.5 135.4 0.056


Table 3: Results on the Panoptic.

We use MPJPE as evaluation metric. The lower the better. Best in


Another popular 3D pose estimation benchmark is the MuPoTS-3D dataset [mehta2017vnect]. We compare our method against two strong baselines, 1) the combination of OpenPose [cao2018openpose] with single-person mesh recovery methods (SMPLify-X [pavlakos2019expressive] and HMR [hmrKanazawa17]), and 2) the state-of-the-art top-down approach CRMH [jiang2020coherent]. We report the results in Table 4. As we can see, BMP outperforms significantly previous methods on both evaluation protocols.


Method All Matched Time[s]
SMPLify-X [pavlakos2019expressive] 62.84 68.04 6.4
HMR [hmrKanazawa17] 66.09 70.90 0.26
CRMH [jiang2020coherent] 69.12 72.22 0.083
BMP-Lite 68.63 71.92 0.038
BMP 73.83 75.34 0.056


Table 4: Results on MuPoTS-3D. The numbers are 3DPCK. We report the overall accuracy (All), and the accuracy only for person annotations matched to a prediction (Matched). Best in bold.

Lastly, we compare our BMP model with state-of-the-art approaches on the challenging in-the-wild 3DPW dataset. Some approaches use the self-training strategy (, SPIN [kolotouros2019spin]) or temporal information (, VIBE [kocabas2020vibe]), and they rely on off-the-shelf person detectors [cao2018openpose, redmon2018yolov3]. As shown in Table 5, our BMP outperforms CRMH [jiang2020coherent] and SPIN [kolotouros2019spin] in terms of 3DPCK while maintaining an attractive efficiency, and achieves comparable results with VIBE [kocabas2020vibe] without relying on any temporal information. Additionally, BMP-Lite obtains roughly the same performance as the state-of-the-art CRMH model while achieving faster inference speed. There results further confirm the effectiveness of our single-stage solution over existing multi-stage strategies, with very competitive efficiency.


SPIN [kolotouros2019spin] 30.8 53.4 99.4 68.1 - 0.31
VIBE [kocabas2020vibe] 33.9 56.6 94.7 66.1 112.7 -
CRMH [jiang2020coherent] 25.8 51.6 105.3 62.3 122.2 0.09
BMP-Lite 26.2 51.3 108.5 64.0 126.2 0.038
BMP 32.1 54.5 104.1 63.8 119.3 0.056


Table 5: Results on 3DPW. We use 3DPCK, AUC, MPJPE, PA-MPJPE and per-vertex error (PVE) as evaluation metrics.

Qualitative results We visualize some body mesh reconstructions of BMP on the challenging PoseTrack, MPII and COCO datasets, as shown in Fig. 3. It can be observed that BMP is robust to severe occlusion and crowded scenes and can reconstruct human bodies with correct depth ordering.

Figure 3: Qualitative results. We visualize the reconstructions of our approach on PoseTrack (1st row), MPII (2nd row) and COCO (3rd row) from different viewpoints: front (green background), top (blue background) and side (red background), respectively. Best viewed in color. Please refer to supplementary for more qualitative results.

4.3 Ablative studies

We conduct ablation analysis on Panoptic, 3DPW and MuPoTS-3D datasets both qualitatively and quantitatively to justify our design choices. The qualitative analysis of the proposed method is illustrated in Fig. 4.

Person instance representation We first evaluate the proposed 3D point-based representation for person instances. The main difference between the proposed representation and previous 2D spatial representation [nie2019spm, zhou2019objects, CenterHMR] is that we use an additional depth dimension to differentiate person instances in the discretized depth space through FPN. We then compare BMP with a baseline model (, BMP using 2D spatial representation). For fair comparison, we aggregate features from all levels of FPN pyramid in the baseline model to obtain a single output for both instance localization and body mesh recovery. Specifically, we study three methods for the aggregation: we resize all feature pyramids to 1/8 scale and then aggregate them by 1) element-wise addition (Baseline-Add), 2) concatenation (Baseline-Concat), or 3) adopting a convolutional layer after concatenating them (Baseline-Conv). Results are shown in Table 6. We can see BMP improves upon the baseline models by a large margin on all datasets, proving its efficacy for body mesh recovery. Additionally, from Fig. 4 (1st row), we observe BMP with the proposed representation is more robust in handling occluding instances, especially when the body centers of multiple instances fall at the same spatial grid coordinate, while the 2D representation would usually fail.


Method Panoptic () 3DPW () MuPoTS-3D ()
Baseline-Add 159.1 120.4 68.03
Baseline-Concat 150.3 114.6 68.52
Baseline-Conv 145.6 110.8 69.34
BMP 135.4 104.1 73.83


Table 6: Ablation for person instance representation. We report MPJPE for Panoptic and 3DPW, and 3DPCK for MuPoTS-3D.

Ordinal depth loss To investigate whether the ordinal depth loss can help produce more coherent results with correct depth ordering, we conduct experiments on the MuPoTS-3D dataset. Specifically, we evaluate the ordinal depth relations of all instance pairs in the scene and report the percentage of correctly estimated ordinal depth relations in Table 7. The model trained with significantly improves upon the baseline (BMP trained w/o ) (from 91.42% to 94.50%). Such improvements can also be observed from Fig. 4 (2nd row). Additionally, by comparing our method with Moon et al. [moon2019camera] and CRMH [jiang2020coherent], we observe BMP achieves higher accuracy w.r.t. relative depth ordering than CRMH that only considers ordinal loss for overlapped pairs (94.50% 93.68%). This demonstrates our full pair-wise ordinal loss can provide a more comprehensive supervision on the depth layout of the scene and thus train the model to give more coherent results.


Method Moon [moon2019camera] CRMH [jiang2020coherent] BMP w/o BMP
Accuracy 90.85% 93.68% 91.42% 94.50%


Table 7: Ablation for ordinal depth loss. Relative depth ordering results on MuPoTS-3D are shown. We evaluate the ordinal depth relations of all instance pairs in the scene and report the percentage of correctly estimated ordinal depth relations.

Keypoint-aware occlusion augmentation Finally, we study the impact of the proposed keypoint-aware occlusion augmentation strategy. We compare our BMP model with the models trained without occlusion augmentation (BMP-NoAug) and trained using randomly Synthetic Occlusion [sarandi2018robust] (BMP-RandOcc) in Table 8. We can see BMP outperforms both of them by a large margin on all datasets. Notably, it respectively brings 9.1% and 17.3% improvements over BMP-NoAug on Panoptic and 3DPW datasets, which feature crowded scenes with severe overlap and partial observation. In contrast, the random augmentation hurts model performance on MuPoTS-3D (71.71 70.78). This verifies that our proposed occlusion augmentation can force the model to focus on body structure and thus improve its robustness to occlusion.


Method Panoptic () 3DPW () MuPoTS-3D ()
BMP-NoAug 148.9 125.9 71.71
BMP-RandOcc 144.6 110.3 70.78
BMP 135.4 104.1 73.83


Table 8: Ablation for occlusion augmentation. We use MPJPE for the first two, and 3DPCK for the last one datasets as metrics.
Figure 4: Qualitative effect of proposed method. Results of baseline 1 (BMP using 2D representation) (middle 1st row), baseline 2 (BMP trained w/o ) (middle 2rd row) and BMP (right). Errors are highlighted by black arrows. As expected, the proposed methods take effect on producing better results (, robust to overlapping instances, more consistent depth ordering for estimated body meshes).

5 Conclusions

In this work, we present the first single-stage model, Body Meshes as Points (BMP), for multi-person body mesh recovery. BMP introduces a new representation method to enable such a compact pipeline: each person instance is represented as a point in the spatial-depth space which is associated with a parameterized body mesh. With such a representation, BMP can fully exploit shared features and perform person localization and body mesh recovery simultaneously. BMP significantly improves upon conventional two-stage paradigms, and offers outstanding efficiency and accuracy, as validated by extensive experiments on multiple benchmarks. Besides, BMP develops several new techniques to further improve the coherence and robustness of recovered body meshes, which are of broad interest for other applications like human pose estimation and detection. In future, we will explore how to make the model more compact and further improve its efficiency, as well as extend to inter-person interactions modeling.

Acknowledgement This research was partially supported by AISG-100E-2019-035, MOE2017-T2-2-151, NUS_ECRA_FY17_P08 and CRP20-2017-0006.