1 Introduction
3D human body mesh recovery aims to reconstruct the 3D fullbody mesh of the person instance from images or videos. As a fundamental yet challenging task, it has been widely applied for action recognition [varol2019synthetic], virtual tryon [mir2020learning], motion retargeting [liu2019liquid], . With the recent notable progress in singleperson based fullbody mesh recovery [hmrKanazawa17, kolotouros2019convolutional, arnab2019exploiting, kocabas2020vibe], a more realistic and challenging setting has attracted increasing attention, to estimate body meshes for multiple persons from a single image.
Existing methods for multiperson mesh recovery are mainly twostage solutions, including topdown [jiang2020coherent] and bottomup [zanfir2018monocular] approaches. The former first localizes person instances via a person detector, based on which it then recovers the 3D meshes individually; the bottomup approach estimates person keypoints at first, and then jointly reconstructs multiple 3D human bodies in the image via constrained optimization [zanfir2018monocular]. Though with notable accuracy, the above paradigms are inefficient with computational redundancy. For instance, the former one estimates body mesh for each person separately, and consequently the total computation cost linearly grows with the number of persons in the image, while the latter requires grouping the keypoints into corresponding persons and inferring the body meshes iteratively, leading to high computational cost.
Targeted at a more efficient and compact pipeline, we consider exploring a singlestage solution. Despite the recent popularity and promising performance of singlestage methods on 2D keypoints estimation [nie2019spm] and object detection tasks [zhou2019objects, tian2019fcos], a singlestage pipeline for multiperson mesh recovery is barely explored as it remains unclear how to effectively integrate both person localization and mesh recovery steps within a single stage. In this work, we propose a new instance representation for multiperson body mesh recovery that represents multiple person instances as points in the spatialdepth space where each point is associated with one body mesh. Such an representation allows effective parallelism of person localization and body mesh recovery. Based on it, we develop a new model architecture that exploits shareable features for both localization and mesh recovery and thus achieve a singlestage solution.
In particular, the model has two parallel branches, one for instance localization and the other for body mesh recovery. In the localization branch, we model each person instance as a single point in a 3dimensional space, spatial (2D) and depth (1D), where each localized point (detected person) is associated with a body mesh in the body mesh branch represented by the SMPL parametric model
[loper2015smpl]. This in turn converts the multiperson mesh recovery into a singleshot regression problem (Fig. 1). Specifically, the spatial location is represented by discrete coordinates regular grids over the input image. Similarly, we discretize depth into several levels to obtain the depth representation. To learn better feature representation to differentiate instances at different depth, motivated by the phenomenon that a person closer to the camera tends to seem larger in the image, we adopt the feature pyramid network (FPN) [lin2017feature]to extract multiscale features and use features from the lower scales to represent the closer (and larger) instances. In this way, each instance is represented as one point, whose associated features (extracted from its corresponding spatial location and FPN scale) are used to effectively estimate its body mesh. We name this
Body Meshes as Points (BMP).Applying the BMP model for estimating multiperson body mesh simultaneously faces two challenges in realistic scenarios: how to coherently reconstruct instances with correct depth ordering, and how to handle the common occlusion issue (, overlapping instances and partial observations). For the first challenge, we consider explicitly using the ordinal relations among all the persons in the scene to supervise the model to learn to output body meshes with correct depth order. However, obtaining such ordinal relations is nontrivial for the scenes captured in the wild, since there is no 3D annotation available. Inspired by the recent success of depth estimation for human body joints [moon2019camera, zhen2020smap], we propose to take the depth of each person (center point) predicted by a model pretrained on 3D datasets with depth annotations as the pseudo ordinal relation for model training on the inthewild data, which is experimentally proved beneficial to depthcoherent body mesh reconstruction.
Also, to tackle the common occlusion issue, we propose a novel keypointaware occlusion augmentation strategy to improve the model robustness to occluded person instances. Different from the previous method [sarandi2018robust] that randomly simulates occlusion in images, we generate synthetic occlusion based on the position of skeleton keypoints. Such keypointaware occlusion explicitly forces the model to focus on body structure, making it more robust to occlusion.
Comprehensive experiments on 3D pose benchmarks Panoptic [joo2015panoptic], MuPoTS3D [singleshotmultiperson2018] and 3DPW [vonMarcard2018] evidently demonstrate the high efficiency of the proposed model. Moreover, it achieves new stateoftheart on Panoptic and MuPoTS3D datasets, and competitive performance on 3DPW dataset. Our contributions are summarized as follows: 1) To our best knowledge, we are among the first to explore the singlestage solution to multiperson mesh recovery. We introduce a new person instance representation that enables simultaneous person localization and body mesh recovery for all person instances in an image within a single stage, and design a novel model architecture accordingly. 2) We propose a simple yet effective interinstance ordinal relation supervision to encourage depthcoherent reconstruction. 3) We propose a keypointaware occlusion augmentation strategy that takes body structure into consideration, to improve model robustness to occlusion.
2 Related Work
Singleperson 3D pose and shape Previous works estimate 3D poses in the form of body skeleton [martinez2017simple, mehta2017vnect, tome2017lifting, zhou2017towards, popa2017deep, pavlakos2018ordinal, sun2018integral, zhang2020inference, gong2021poseaug] or nonparametric 3D shape [gabeur2019moulding, smith2019facsimile, varol2018bodynet]
. In this work, we use the 3D mesh to represent the fullbody pose and shape, and adopt the SMPL parametric model
[loper2015smpl] for body mesh recovery. In literature, Bogo et al. [bogo2016keep] proposed SMPLify, the first optimizationbased method to fit SMPL on the detected 2D joints iteratively. Later works extend SMPLify by either using more dense reference points to replace sparse keypoints like silhouettes and voxel occupancy grids for SMPL fitting [lassner2017unite, varol2018bodynet], or fitting a more expressive model (, SMPLX) than SMPL [pavlakos2019expressive].Some recent works directly regress the SMPL parameters from images via deep neural networks in a twostage manner. They first estimate the intermediate representation (, keypoints, silhouettes, etc) from images and then map it to SMPL parameters
[pavlakos2018learning, omran2018neural, tung2017self, kolotouros2019convolutional]. Some others directly estimate SMPL parameters from images, either using complex model training strategies [hmrKanazawa17, guler2019holopose] or leveraging temporal information [arnab2019exploiting, kocabas2020vibe]. Although high accuracy is achieved in singleperson cases, it remains unclear how to extend them to the more general multiperson cases.Multiperson 3D pose and shape For multiperson 3D pose estimation, most existing methods adopt a topdown paradigm [rogez2017lcr, dabral2018learning, rogez2019lcr]. They first detect each person instance and then regress the locations of the body joints. Followup improvements are made by estimating additional absolute depth [moon2019camera], considering multiperson interaction [guo2020pi, li2020hmor] or extending to wholebody pose estimation [weinzaepfel2020dope]. Alternatively, some approaches also explore the bottomup paradigm. SSMP3D [mehta2018single] and SMAP [zhen2020smap] estimate 3D poses from occlusionaware pose maps and use Part Affinity Fields [cao2017realtime] to infer their association. LoCO [fabbri2020compressed] maps the image to the volumetric heatmaps and then estimates multiperson 3D poses from them by an encoderdecoder framework. PandaNet [benzine2020pandanet] is an anchorbased model where 3D poses are regressed for each anchor position.
In contrast to the prosperity of multiperson 3D pose estimation, there is a limited number of works denoted to body mesh recovery for multiple people. Zanfir et al. [zanfir2018monocular] first estimate 3D joints of persons in the image and then optimize their 3D shapes jointly with multiple constraints. They also propose a twostage regressionbased scheme that first estimates 3D joints for all the persons and then regresses their 3D shapes based on these 3D joints [zanfir2018deep]. Instead of regressing SMPL parameters from an intermediate representation (, 3D joints), Jiang et al. [jiang2020coherent] attach an SMPL head to the Faster RCNN framework [ren2015faster] for estimating SMPL parameters directly from the input image in a topdown manner. Despite the encouraging results, these methods are based on the indirect multistage framework and suffer low efficiency. Different from all previous methods that rely on a multistage pipeline with computation redundancy, our method unifies person localization and body mesh, and enables a boxfree and (ad hoc) optimizationfree single stage solution to multiperson body mesh recovery.
Pointbased representation The pointbased methods [duan2019centernet, zhou2019objects, tian2019fcos] represent instances by a single point at their center. This approach is regarded as a simple replacement of the anchorbased representation, which has been widely used in many tasks, including object detection [duan2019centernet, zhou2019objects, tian2019fcos], 2D keypoints estimation [nie2019spm] and instance segmentation [wang2019solo]. However, these methods cannot be directly applied to body mesh recovery. In this work, we extend the pointbased representation to multiperson body mesh recovery. A concurrent work [CenterHMR] adopts a similar solution to body mesh recovery. Our model differs from it in two significant aspects: 1) BMP aims at more coherent reconstruction of persons in the scenes. It handles challenging spatial arrangement and occlusion problems by exploiting the ordinal depth loss and the keypointaware augmentation strategy, which are not considered in [CenterHMR]. 2) BMP adopts a novel 3D pointbased representation to differentiate instances at different depths, thus is more robust to overlapped instances; whereas [CenterHMR] uses only 2D representation, and would fail in such cases.
3 Body Meshes as Points
3.1 Proposed singlestage solution
Given an image , multiperson body mesh recovery targets at recovering body meshes of all the person instances in . Existing approaches [zanfir2018deep, zanfir2018monocular, jiang2020coherent] solve this task via sequentially localizing and estimating the body mesh in a multistage manner, leading to computation redundancy. Differently, this work aims to unify the instance localization and body mesh recovery into a singlestage solution to enable a more efficient and concise framework.
We represent each person instance as a single point in a 3dimensional space (spanned by 2D spatial and 1D depth dimensions). By dividing the input image uniformly into grids, its spatial dimension can be easily represented within such a grid coordinate. If the body center of a person falls into grid cell , it is assigned with spatial coordinate . Similarly, for the depth dimension, we discretize the depth value to levels and obtain the value for each instance according to its depth. Such discretized depth value is beneficial for handling occluding instances, especially when the body centers of multiple instances fall into the same spatial grid coordinate.
Given this representation, we reformulate multiperson mesh recovery as two simultaneous prediction tasks: 1) instance localization and 2) body mesh recovery.
Instance localization For the first task, we employ the instance map , where , to locate each person instance in the image, where denotes the number of grid cells along one side, while
refers to the number of total depth levels. For each depth level, the network is trained to regress a scalar indicating the probability of every grid cell containing a person.
To construct ground truth (GT) for training, we first determine the depth value for each instance. We observe that a person tends to seem larger (smaller) in the image when standing closer to (away from) the camera. In other words, the depth of an instance is roughly inversely proportional to its scale. Inspired by it, we employ a Feature Pyramid Network (FPN) [lin2017feature] with pyramid levels to capture different scales, each of which is used to represent the instance with the corresponding depth. More specifically, for each instance, we compute its scale where denotes the GT body size, and associate it to the corresponding pyramid level , according to Table 1.


Pyramid  P  P  P  P  P 
Stride  8  8  16  32  32 
Grid number  40  36  24  16  12 
Instance scale  64  32128  64256  128512  256 

Next, we locate the grid cell in where the central region of that person lies. Inspired by [zhou2019objects, duan2019centernet], the central region is defined as follows: given the GT body center , body size of each person and a controllable scale factor , the position and size of the central region are defined as . In this work, we set the position of pelvis as body center and . Once identified, the grid cell of the th pyramid level, , is labeled as positive (label 1). The above steps are repeated for all the instances in the image.
Body mesh representation In parallel with the instance localization, we use the body mesh map , for body mesh recovery, where and is the dimension of body mesh representation. Concretely, given a positive response in that indicates the presence of a person, we regress the body mesh representation using the features from the corresponding grid cell, as shown in Fig. 2. In this work, we use the SMPL parametric model [loper2015smpl] for body mesh representation, which renders a body mesh using the pose parameters and shape parameters . To improve the training stability, we adopt the 6D rotation representation [zhou2019continuity] for the pose parameters with . The body mesh map also predicts a camera parameter for projecting body joints from 3D back to 2D space, which enables training on inthewild 2D pose datasets [johnson2010clustered, lin2014microsoft, andriluka20142d] to improve model generalization [hmrKanazawa17]. We further introduce a scalar confidence score defined as the OKS [girdhar2018detect] between the projected and GT 2D keypoints, to reflect the confidence level of the SMPL prediction; and we also propose an absolute depth variable for the corresponding person instance that will be used for penalizing body mesh estimations with incoherent depth ordering (see Sec. 3.2 for details). Therefore, the total channel number of the body mesh map is 159.
Network architecture We employ ResNet50 [He_2016_CVPR] as our backbone. FPN is built on top of the backbone to extract a pyramid of feature maps (256d). To perform body mesh recovery, we attach two taskspecific heads to each level of the feature pyramid, one for instance localization and the other for the corresponding body mesh recovery, responsible for obtaining the instance map and the body mesh map , respectively. As shown in Fig. 2, each head consists of 7 stacked convolutions and one taskspecific prediction layer. However, directly estimating the camera parameter from the whole image is nontrivial since it is sensitive to instance position. Inspired by CoordConv [liu2018intriguing], we concatenate normalized pixel coordinates to the input feature map at the beginning of the mesh recovery head to encode position information into the network for better estimating camera parameter. Additionally, Group Normalization [wu2018group] is used in both prediction heads for facilitating model training. In order to match the features of size to
, we apply bilinear interpolation before the instance and bodymesh recovery branch.
3.2 Interinstance ordinal depth supervision
Multiperson body mesh recovery is inherently illposed as multiple 3D predictions can correspond to the same 2D projection. Therefore, the trained model would produce ambiguous body mesh estimations with incorrect depth order due to lack of priors. To alleviate such a problem, we use ordinal depth relations among all the persons in the input as supervision to guide reasoning about the depth ordering during the training process.
More concretely, given any two persons (, ) in the image, we define the ordinal depth relation between them as , taking the value:
(1) 
where denotes the depth of the person and is a predefined threshold to determine the ordinal relation. The ordinal relation means both instances are at roughly the same depth; otherwise one of them is closer to the camera than the other. With the ordinal relation of (, ), we define the ordinal depth loss for this pair as
(2) 
where denotes th person’s body mesh depth calculated from the predicted camera parameter with focal length , scale and images’s long edge width . The ordinal depth loss enforces a large margin between and if , , one of them measured as closer than the other, and otherwise enforces them to be equal.
However in practice, such ordinal depth relations are rarely available for the scenes captured in the wild due to lack of 3D annotations. To solve this issue, we propose to use pseudo ordinal relations for model training on the inthewild data. Specifically, we first train the model on 3D datasets [ionescu2014human3, singleshotmultiperson2018] with depth annotations to learn to estimate the depth of each person in the images. We define the depth of each person as the depth of body center (, pelvis joint). The model is trained by minimizing a depth loss , which is defined as the mean square errors (MSE) between the predicted and GT depths. After that, given an unlabeled data, we first leverage the pretrained model to estimate the depth which is then used to obtain the pseudo ordinal relations for all the people in the image. Finally, given the pseudo ordinal relations, we adopt an OKS scoreweighted ordinal depth loss to supervise the model training for images in the wild. The total loss for image is computed as the average loss of all instances pairs:
(3) 
where denotes the number of paired instances in the image, denotes OKSscore of the th person. Intuitively, training the model with such interinstance ordinal depth supervision can help the model build a global understanding of the depth layout in the input scene and thus ensure more coherent reconstructions.
3.3 Keypointaware occlusion augmentation
SMPLbased body mesh recovery is highly sensitive to (partial) occlusion (, overlapping persons, truncation) [zhang2020object, rockwell2020full]. To improve model robustness to occlusion without requiring extra training data and annotations, we propose a keypointaware occlusion augmentation strategy during the training process. The proposed augmentation strategy aims to generate synthetic occlusion to synthesize real challenging cases such as partial observation for model training. Compared with previous work [sarandi2018robust] that randomly simulates occlusion on the images, which may produce easy training samples that are less helpful for boosting model performance, our method directly generates synthetic occlusion based on the positions of skeleton keypoints, which can force the model to pay more attention to the body structure, leading to notable enhancement. More concretely, given a set of keypoints of a person in the image, we first randomly choose a keypoint
. Then we randomly sample a nonhuman object from the PASCAL VOC
[everingham2011pascal] dataset and composite it at the location of the selected keypoint . We randomly resize the sampled object to the range of before compositing, where denotes the area of that person. Additionally, we randomly shift the keypoint position by an offset to avoid overfitting. During training, we set the probability of the occlusion augmentation as 0.5.3.4 Training and inference
Training
For training our proposed BMP model, we define the loss function
as follows:(4) 
where is a modified twoclass Focal Loss [lin2017focal] for instance localization; is the depth loss (Sec. 3.2); is the loss for body mesh estimation. The training details of the bodymesh branch are similar to those in HMR [hmrKanazawa17]. Specifically, we formulate as
(5) 
Here , , , denote MSE between the predicted and GT pose and shape parameters as well as 3D keypoints and vertices, respectively. is the 2D keypoints loss that minimizes the distance between the 2D projection from 3D keypoints and GT 2D keypoints. is the MSE of the predicted and GT confidences, where the GT confidence is computed as the OKS [girdhar2018detect] between the projected and GT 2D keypoints. Moreover, we use a discriminator and apply an adversarial loss on the regressed pose and shape parameters, to encourage the outputs to lie on the distributions of real human bodies. , , , and are the weights of the corresponding loss terms. The loss is applied independently to each positive grid cell. The ordinal depth loss illustrated in Eqn. (3) is adopted when the image contains more than one instance.
Inference The overall inference procedure for BMP is illustrated in Fig. 2. Given an image, BMP first obtains the instance map and the body mesh map
from the prediction heads. Then it performs max pooling operation to find the local maximum on
to obtain center point positions , where and () denote the pyramid level and body center location for th person, respectively, and is the number of estimated persons. After that, BMP extracts the body mesh parameters of each person via . Finally, BMP outputs body mesh estimations by deforming the SMPL model using the predicted parameters. A keypointbased NMS [girdhar2018detect] is applied to remove the redundant predictions if they exist. We take the multiplication of the predicted OKS score and the probability score from the instance map as the confidence score for NMS.3.5 Implementation details
We implement BMP with PyTorch
[paszke2017automatic] and mmdetection library [mmdetection] and utilize Rectified Adam [liu2019variance] as the optimizer with an initial learning rate of . We resize all images to 832512 while keeping the same aspect ratio following the original COCO training scheme
[su2019multi, wang2019solo, jiang2020coherent]. During training, we augment the samples with horizontally flip and keypointsaware occlusion (Sec. 3.3). Flip augmentation is conducted during testing. Moreover, since the BMP model directly extracts imagelevel features for estimations instead of features from cropped bounding boxes, it can take images with smaller resolution (512512) as inputs. We denote such a setting as BMPLite. Other training and testing settings are the same between BMPLite and BMP. Please refer to supplementary for more details.4 Experiments
In this section, we aim to answer following questions. 1) Can BMP provide both efficient and accurate multiperson mesh recovery? 2) Is BMP able to give coherent meshes for multiple persons with correct depth ordering? 3) Is BMP robust to cases where person instances are occluded or partially observed? To this end, we conduct extensive experiments on several largescale benchmarks.
4.1 Datasets
Human3.6M [ionescu2014human3] is the most widely used singleperson 3D pose benchmark collected in an indoor environment. It contains 3.6 million 3D poses and corresponding videos for 15 subjects. Due to its highquality annotations, we use it following [jiang2020coherent] for both training and testing.
Panoptic [joo2015panoptic] is a largescale dataset captured in the Panoptic studio, offering 3D pose annotations for multiple people engaging in diverse social activities. We use this dataset for evaluation with the same protocol as [zanfir2018monocular].
MuPoTS3D [singleshotmultiperson2018] is a multiperson dataset with 3D pose annotations for both indoor and inthewild scenes. We follow [singleshotmultiperson2018] and use it for evaluation.
3DPW [vonMarcard2018] is a multiperson inthewild dataset, which features diverse motions and scenes. It contains 60 video sequences (24 train, 24 test, 12 validation) with fullbody mesh annotations. To verify generalizability of the proposed model to challenging inthewild scenarios, we use its test set for evaluation, following the same protocol as [kocabas2020vibe].
MPIINF3DHP [mehta2017vnect] is a singleperson multiview 3D pose dataset. It contains 8 actors performing 8 activities, captured from 14 cameras. Mehta et al., [singleshotmultiperson2018]
generate a multiperson dataset called MuCo3DHP, from MPIINF3DHP via mixing up segmented foreground human appearance. We use both datasets for training.
COCO [lin2014microsoft], LSP [johnson2010clustered], LSP Extended [Johnson11], PoseTrack [Andriluka_2018_CVPR], MPII [andriluka20142d] are inthewild datasets with annotations for 2D joints. We use them for training with the weaklysupervised training strategy [hmrKanazawa17] (Eqn. (5)).
4.2 Comparison with stateofthearts
Singleperson setting We first evaluate our proposed BMP model on the singleperson setting to validate the strategy of BMP on factorizing the instance localization and mesh recovery does not sacrifice on the performance. Concretely, we evaluate and compare the performance of BMP on the largescale Human3.6M dataset with most competitive approaches [hmrKanazawa17, jiang2020coherent] sharing the similar regression target and learning strategy. The results are shown in Table 2. We can observe BMP outperforms all these methods.


Method  HMR [hmrKanazawa17]  CRMH [jiang2020coherent]  BMP 
PAMPJPE  56.8  52.7  51.3 

Multiperson settings Then we evaluate our BMP model for multiperson body mesh recovery. We first evaluate it on the multiperson dataset captured in the indoor Panoptic Studio [joo2015panoptic] and compare with the most competitive approaches [zanfir2018monocular, zanfir2018deep, jiang2020coherent]. As shown in Table 3, our BMP model achieves the best performance in all scenarios. Overall, it improves upon the stateoftheart topdown model CRMH [jiang2020coherent] by 5.4% (135.4 mm 143.2 mm in MPJPE), while offering a faster inference speed^{1}^{1}1We count perimage inference time in seconds. For all methods, the time is counted on GPU Tesla P100 and CPU Intel E52650 v2 @ 2.60GHz, without using testtime augmentation.. Moreover, it significantly outperforms CRMH for Ultimatum and Pizza scenarios with crowded scenes and severe occlusion, verifying its robustness to occlusion cases. In addition, its lite version, BMPLite, is even faster, which only requires 0.038s to process an image, about faster than CRMH while achieving comparable performance. These results demonstrate both effectiveness and efficiency of BMP for estimating body meshes of multiple persons in a single stage.


Method  Haggl.  Mafia  Ultim.  Pizza  Mean  Time[s] 
Zanfir et al. [zanfir2018monocular]  140.0  165.9  150.7  156.0  153.4   
MubyNet [zanfir2018deep]  141.4  152.3  145.0  162.5  150.3   
CRMH [jiang2020coherent]  129.6  133.5  153.0  156.7  143.2  0.077 
BMPLite  124.2  138.1  155.2  157.3  143.7  0.038 
BMP  120.4  132.7  140.9  147.5  135.4  0.056 

We use MPJPE as evaluation metric. The lower the better. Best in
bold.Another popular 3D pose estimation benchmark is the MuPoTS3D dataset [mehta2017vnect]. We compare our method against two strong baselines, 1) the combination of OpenPose [cao2018openpose] with singleperson mesh recovery methods (SMPLifyX [pavlakos2019expressive] and HMR [hmrKanazawa17]), and 2) the stateoftheart topdown approach CRMH [jiang2020coherent]. We report the results in Table 4. As we can see, BMP outperforms significantly previous methods on both evaluation protocols.


Method  All  Matched  Time[s] 
SMPLifyX [pavlakos2019expressive]  62.84  68.04  6.4 
HMR [hmrKanazawa17]  66.09  70.90  0.26 
CRMH [jiang2020coherent]  69.12  72.22  0.083 
BMPLite  68.63  71.92  0.038 
BMP  73.83  75.34  0.056 

Lastly, we compare our BMP model with stateoftheart approaches on the challenging inthewild 3DPW dataset. Some approaches use the selftraining strategy (, SPIN [kolotouros2019spin]) or temporal information (, VIBE [kocabas2020vibe]), and they rely on offtheshelf person detectors [cao2018openpose, redmon2018yolov3]. As shown in Table 5, our BMP outperforms CRMH [jiang2020coherent] and SPIN [kolotouros2019spin] in terms of 3DPCK while maintaining an attractive efficiency, and achieves comparable results with VIBE [kocabas2020vibe] without relying on any temporal information. Additionally, BMPLite obtains roughly the same performance as the stateoftheart CRMH model while achieving faster inference speed. There results further confirm the effectiveness of our singlestage solution over existing multistage strategies, with very competitive efficiency.


Method  PCK  AUC  MPJPE  PAMPJPE  PVE  Time[s] 
SPIN [kolotouros2019spin]  30.8  53.4  99.4  68.1    0.31 
VIBE [kocabas2020vibe]  33.9  56.6  94.7  66.1  112.7   
CRMH [jiang2020coherent]  25.8  51.6  105.3  62.3  122.2  0.09 
BMPLite  26.2  51.3  108.5  64.0  126.2  0.038 
BMP  32.1  54.5  104.1  63.8  119.3  0.056 

Qualitative results We visualize some body mesh reconstructions of BMP on the challenging PoseTrack, MPII and COCO datasets, as shown in Fig. 3. It can be observed that BMP is robust to severe occlusion and crowded scenes and can reconstruct human bodies with correct depth ordering.
4.3 Ablative studies
We conduct ablation analysis on Panoptic, 3DPW and MuPoTS3D datasets both qualitatively and quantitatively to justify our design choices. The qualitative analysis of the proposed method is illustrated in Fig. 4.
Person instance representation We first evaluate the proposed 3D pointbased representation for person instances. The main difference between the proposed representation and previous 2D spatial representation [nie2019spm, zhou2019objects, CenterHMR] is that we use an additional depth dimension to differentiate person instances in the discretized depth space through FPN. We then compare BMP with a baseline model (, BMP using 2D spatial representation). For fair comparison, we aggregate features from all levels of FPN pyramid in the baseline model to obtain a single output for both instance localization and body mesh recovery. Specifically, we study three methods for the aggregation: we resize all feature pyramids to 1/8 scale and then aggregate them by 1) elementwise addition (BaselineAdd), 2) concatenation (BaselineConcat), or 3) adopting a convolutional layer after concatenating them (BaselineConv). Results are shown in Table 6. We can see BMP improves upon the baseline models by a large margin on all datasets, proving its efficacy for body mesh recovery. Additionally, from Fig. 4 (1st row), we observe BMP with the proposed representation is more robust in handling occluding instances, especially when the body centers of multiple instances fall at the same spatial grid coordinate, while the 2D representation would usually fail.


Method  Panoptic ()  3DPW ()  MuPoTS3D () 
BaselineAdd  159.1  120.4  68.03 
BaselineConcat  150.3  114.6  68.52 
BaselineConv  145.6  110.8  69.34 
BMP  135.4  104.1  73.83 

Ordinal depth loss To investigate whether the ordinal depth loss can help produce more coherent results with correct depth ordering, we conduct experiments on the MuPoTS3D dataset. Specifically, we evaluate the ordinal depth relations of all instance pairs in the scene and report the percentage of correctly estimated ordinal depth relations in Table 7. The model trained with significantly improves upon the baseline (BMP trained w/o ) (from 91.42% to 94.50%). Such improvements can also be observed from Fig. 4 (2nd row). Additionally, by comparing our method with Moon et al. [moon2019camera] and CRMH [jiang2020coherent], we observe BMP achieves higher accuracy w.r.t. relative depth ordering than CRMH that only considers ordinal loss for overlapped pairs (94.50% 93.68%). This demonstrates our full pairwise ordinal loss can provide a more comprehensive supervision on the depth layout of the scene and thus train the model to give more coherent results.


Method  Moon [moon2019camera]  CRMH [jiang2020coherent]  BMP w/o  BMP 
Accuracy  90.85%  93.68%  91.42%  94.50% 

Keypointaware occlusion augmentation Finally, we study the impact of the proposed keypointaware occlusion augmentation strategy. We compare our BMP model with the models trained without occlusion augmentation (BMPNoAug) and trained using randomly Synthetic Occlusion [sarandi2018robust] (BMPRandOcc) in Table 8. We can see BMP outperforms both of them by a large margin on all datasets. Notably, it respectively brings 9.1% and 17.3% improvements over BMPNoAug on Panoptic and 3DPW datasets, which feature crowded scenes with severe overlap and partial observation. In contrast, the random augmentation hurts model performance on MuPoTS3D (71.71 70.78). This verifies that our proposed occlusion augmentation can force the model to focus on body structure and thus improve its robustness to occlusion.


Method  Panoptic ()  3DPW ()  MuPoTS3D () 
BMPNoAug  148.9  125.9  71.71 
BMPRandOcc  144.6  110.3  70.78 
BMP  135.4  104.1  73.83 

5 Conclusions
In this work, we present the first singlestage model, Body Meshes as Points (BMP), for multiperson body mesh recovery. BMP introduces a new representation method to enable such a compact pipeline: each person instance is represented as a point in the spatialdepth space which is associated with a parameterized body mesh. With such a representation, BMP can fully exploit shared features and perform person localization and body mesh recovery simultaneously. BMP significantly improves upon conventional twostage paradigms, and offers outstanding efficiency and accuracy, as validated by extensive experiments on multiple benchmarks. Besides, BMP develops several new techniques to further improve the coherence and robustness of recovered body meshes, which are of broad interest for other applications like human pose estimation and detection. In future, we will explore how to make the model more compact and further improve its efficiency, as well as extend to interperson interactions modeling.
Acknowledgement This research was partially supported by AISG100E2019035, MOE2017T22151, NUS_ECRA_FY17_P08 and CRP2020170006.
Comments
There are no comments yet.