3D human body mesh recovery aims to reconstruct the 3D full-body mesh of the person instance from images or videos. As a fundamental yet challenging task, it has been widely applied for action recognition [varol2019synthetic], virtual try-on [mir2020learning], motion retargeting [liu2019liquid], . With the recent notable progress in single-person based full-body mesh recovery [hmrKanazawa17, kolotouros2019convolutional, arnab2019exploiting, kocabas2020vibe], a more realistic and challenging setting has attracted increasing attention, to estimate body meshes for multiple persons from a single image.
Existing methods for multi-person mesh recovery are mainly two-stage solutions, including top-down [jiang2020coherent] and bottom-up [zanfir2018monocular] approaches. The former first localizes person instances via a person detector, based on which it then recovers the 3D meshes individually; the bottom-up approach estimates person keypoints at first, and then jointly reconstructs multiple 3D human bodies in the image via constrained optimization [zanfir2018monocular]. Though with notable accuracy, the above paradigms are inefficient with computational redundancy. For instance, the former one estimates body mesh for each person separately, and consequently the total computation cost linearly grows with the number of persons in the image, while the latter requires grouping the keypoints into corresponding persons and inferring the body meshes iteratively, leading to high computational cost.
Targeted at a more efficient and compact pipeline, we consider exploring a single-stage solution. Despite the recent popularity and promising performance of single-stage methods on 2D keypoints estimation [nie2019spm] and object detection tasks [zhou2019objects, tian2019fcos], a single-stage pipeline for multi-person mesh recovery is barely explored as it remains unclear how to effectively integrate both person localization and mesh recovery steps within a single stage. In this work, we propose a new instance representation for multi-person body mesh recovery that represents multiple person instances as points in the spatial-depth space where each point is associated with one body mesh. Such an representation allows effective parallelism of person localization and body mesh recovery. Based on it, we develop a new model architecture that exploits shareable features for both localization and mesh recovery and thus achieve a single-stage solution.
In particular, the model has two parallel branches, one for instance localization and the other for body mesh recovery. In the localization branch, we model each person instance as a single point in a 3-dimensional space, spatial (2D) and depth (1D), where each localized point (detected person) is associated with a body mesh in the body mesh branch represented by the SMPL parametric model[loper2015smpl]. This in turn converts the multi-person mesh recovery into a single-shot regression problem (Fig. 1). Specifically, the spatial location is represented by discrete coordinates regular grids over the input image. Similarly, we discretize depth into several levels to obtain the depth representation. To learn better feature representation to differentiate instances at different depth, motivated by the phenomenon that a person closer to the camera tends to seem larger in the image, we adopt the feature pyramid network (FPN) [lin2017feature]
to extract multi-scale features and use features from the lower scales to represent the closer (and larger) instances. In this way, each instance is represented as one point, whose associated features (extracted from its corresponding spatial location and FPN scale) are used to effectively estimate its body mesh. We name thisBody Meshes as Points (BMP).
Applying the BMP model for estimating multi-person body mesh simultaneously faces two challenges in realistic scenarios: how to coherently reconstruct instances with correct depth ordering, and how to handle the common occlusion issue (, overlapping instances and partial observations). For the first challenge, we consider explicitly using the ordinal relations among all the persons in the scene to supervise the model to learn to output body meshes with correct depth order. However, obtaining such ordinal relations is non-trivial for the scenes captured in the wild, since there is no 3D annotation available. Inspired by the recent success of depth estimation for human body joints [moon2019camera, zhen2020smap], we propose to take the depth of each person (center point) predicted by a model pre-trained on 3D datasets with depth annotations as the pseudo ordinal relation for model training on the in-the-wild data, which is experimentally proved beneficial to depth-coherent body mesh reconstruction.
Also, to tackle the common occlusion issue, we propose a novel keypoint-aware occlusion augmentation strategy to improve the model robustness to occluded person instances. Different from the previous method [sarandi2018robust] that randomly simulates occlusion in images, we generate synthetic occlusion based on the position of skeleton keypoints. Such keypoint-aware occlusion explicitly forces the model to focus on body structure, making it more robust to occlusion.
Comprehensive experiments on 3D pose benchmarks Panoptic [joo2015panoptic], MuPoTS-3D [singleshotmultiperson2018] and 3DPW [vonMarcard2018] evidently demonstrate the high efficiency of the proposed model. Moreover, it achieves new state-of-the-art on Panoptic and MuPoTS-3D datasets, and competitive performance on 3DPW dataset. Our contributions are summarized as follows: 1) To our best knowledge, we are among the first to explore the single-stage solution to multi-person mesh recovery. We introduce a new person instance representation that enables simultaneous person localization and body mesh recovery for all person instances in an image within a single stage, and design a novel model architecture accordingly. 2) We propose a simple yet effective inter-instance ordinal relation supervision to encourage depth-coherent reconstruction. 3) We propose a keypoint-aware occlusion augmentation strategy that takes body structure into consideration, to improve model robustness to occlusion.
2 Related Work
Single-person 3D pose and shape Previous works estimate 3D poses in the form of body skeleton [martinez2017simple, mehta2017vnect, tome2017lifting, zhou2017towards, popa2017deep, pavlakos2018ordinal, sun2018integral, zhang2020inference, gong2021poseaug] or non-parametric 3D shape [gabeur2019moulding, smith2019facsimile, varol2018bodynet]
. In this work, we use the 3D mesh to represent the full-body pose and shape, and adopt the SMPL parametric model[loper2015smpl] for body mesh recovery. In literature, Bogo et al. [bogo2016keep] proposed SMPLify, the first optimization-based method to fit SMPL on the detected 2D joints iteratively. Later works extend SMPLify by either using more dense reference points to replace sparse keypoints like silhouettes and voxel occupancy grids for SMPL fitting [lassner2017unite, varol2018bodynet], or fitting a more expressive model (, SMPL-X) than SMPL [pavlakos2019expressive].
Some recent works directly regress the SMPL parameters from images via deep neural networks in a two-stage manner. They first estimate the intermediate representation (, keypoints, silhouettes, etc) from images and then map it to SMPL parameters[pavlakos2018learning, omran2018neural, tung2017self, kolotouros2019convolutional]. Some others directly estimate SMPL parameters from images, either using complex model training strategies [hmrKanazawa17, guler2019holopose] or leveraging temporal information [arnab2019exploiting, kocabas2020vibe]. Although high accuracy is achieved in single-person cases, it remains unclear how to extend them to the more general multi-person cases.
Multi-person 3D pose and shape For multi-person 3D pose estimation, most existing methods adopt a top-down paradigm [rogez2017lcr, dabral2018learning, rogez2019lcr]. They first detect each person instance and then regress the locations of the body joints. Follow-up improvements are made by estimating additional absolute depth [moon2019camera], considering multi-person interaction [guo2020pi, li2020hmor] or extending to whole-body pose estimation [weinzaepfel2020dope]. Alternatively, some approaches also explore the bottom-up paradigm. SSMP3D [mehta2018single] and SMAP [zhen2020smap] estimate 3D poses from occlusion-aware pose maps and use Part Affinity Fields [cao2017realtime] to infer their association. LoCO [fabbri2020compressed] maps the image to the volumetric heatmaps and then estimates multi-person 3D poses from them by an encoder-decoder framework. PandaNet [benzine2020pandanet] is an anchor-based model where 3D poses are regressed for each anchor position.
In contrast to the prosperity of multi-person 3D pose estimation, there is a limited number of works denoted to body mesh recovery for multiple people. Zanfir et al. [zanfir2018monocular] first estimate 3D joints of persons in the image and then optimize their 3D shapes jointly with multiple constraints. They also propose a two-stage regression-based scheme that first estimates 3D joints for all the persons and then regresses their 3D shapes based on these 3D joints [zanfir2018deep]. Instead of regressing SMPL parameters from an intermediate representation (, 3D joints), Jiang et al. [jiang2020coherent] attach an SMPL head to the Faster R-CNN framework [ren2015faster] for estimating SMPL parameters directly from the input image in a top-down manner. Despite the encouraging results, these methods are based on the indirect multi-stage framework and suffer low efficiency. Different from all previous methods that rely on a multi-stage pipeline with computation redundancy, our method unifies person localization and body mesh, and enables a box-free and (ad hoc) optimization-free single stage solution to multi-person body mesh recovery.
Point-based representation The point-based methods [duan2019centernet, zhou2019objects, tian2019fcos] represent instances by a single point at their center. This approach is regarded as a simple replacement of the anchor-based representation, which has been widely used in many tasks, including object detection [duan2019centernet, zhou2019objects, tian2019fcos], 2D keypoints estimation [nie2019spm] and instance segmentation [wang2019solo]. However, these methods cannot be directly applied to body mesh recovery. In this work, we extend the point-based representation to multi-person body mesh recovery. A concurrent work [CenterHMR] adopts a similar solution to body mesh recovery. Our model differs from it in two significant aspects: 1) BMP aims at more coherent reconstruction of persons in the scenes. It handles challenging spatial arrangement and occlusion problems by exploiting the ordinal depth loss and the keypoint-aware augmentation strategy, which are not considered in [CenterHMR]. 2) BMP adopts a novel 3D point-based representation to differentiate instances at different depths, thus is more robust to overlapped instances; whereas [CenterHMR] uses only 2D representation, and would fail in such cases.
3 Body Meshes as Points
3.1 Proposed single-stage solution
Given an image , multi-person body mesh recovery targets at recovering body meshes of all the person instances in . Existing approaches [zanfir2018deep, zanfir2018monocular, jiang2020coherent] solve this task via sequentially localizing and estimating the body mesh in a multi-stage manner, leading to computation redundancy. Differently, this work aims to unify the instance localization and body mesh recovery into a single-stage solution to enable a more efficient and concise framework.
We represent each person instance as a single point in a 3-dimensional space (spanned by 2D spatial and 1D depth dimensions). By dividing the input image uniformly into grids, its spatial dimension can be easily represented within such a grid coordinate. If the body center of a person falls into grid cell , it is assigned with spatial coordinate . Similarly, for the depth dimension, we discretize the depth value to levels and obtain the value for each instance according to its depth. Such discretized depth value is beneficial for handling occluding instances, especially when the body centers of multiple instances fall into the same spatial grid coordinate.
Given this representation, we reformulate multi-person mesh recovery as two simultaneous prediction tasks: 1) instance localization and 2) body mesh recovery.
Instance localization For the first task, we employ the instance map , where , to locate each person instance in the image, where denotes the number of grid cells along one side, while
refers to the number of total depth levels. For each depth level, the network is trained to regress a scalar indicating the probability of every grid cell containing a person.
To construct ground truth (GT) for training, we first determine the depth value for each instance. We observe that a person tends to seem larger (smaller) in the image when standing closer to (away from) the camera. In other words, the depth of an instance is roughly inversely proportional to its scale. Inspired by it, we employ a Feature Pyramid Network (FPN) [lin2017feature] with pyramid levels to capture different scales, each of which is used to represent the instance with the corresponding depth. More specifically, for each instance, we compute its scale where denotes the GT body size, and associate it to the corresponding pyramid level , according to Table 1.
Next, we locate the grid cell in where the central region of that person lies. Inspired by [zhou2019objects, duan2019centernet], the central region is defined as follows: given the GT body center , body size of each person and a controllable scale factor , the position and size of the central region are defined as . In this work, we set the position of pelvis as body center and . Once identified, the grid cell of the -th pyramid level, , is labeled as positive (label 1). The above steps are repeated for all the instances in the image.
Body mesh representation In parallel with the instance localization, we use the body mesh map , for body mesh recovery, where and is the dimension of body mesh representation. Concretely, given a positive response in that indicates the presence of a person, we regress the body mesh representation using the features from the corresponding grid cell, as shown in Fig. 2. In this work, we use the SMPL parametric model [loper2015smpl] for body mesh representation, which renders a body mesh using the pose parameters and shape parameters . To improve the training stability, we adopt the 6D rotation representation [zhou2019continuity] for the pose parameters with . The body mesh map also predicts a camera parameter for projecting body joints from 3D back to 2D space, which enables training on in-the-wild 2D pose datasets [johnson2010clustered, lin2014microsoft, andriluka20142d] to improve model generalization [hmrKanazawa17]. We further introduce a scalar confidence score defined as the OKS [girdhar2018detect] between the projected and GT 2D keypoints, to reflect the confidence level of the SMPL prediction; and we also propose an absolute depth variable for the corresponding person instance that will be used for penalizing body mesh estimations with incoherent depth ordering (see Sec. 3.2 for details). Therefore, the total channel number of the body mesh map is 159.
Network architecture We employ ResNet-50 [He_2016_CVPR] as our backbone. FPN is built on top of the backbone to extract a pyramid of feature maps (256-d). To perform body mesh recovery, we attach two task-specific heads to each level of the feature pyramid, one for instance localization and the other for the corresponding body mesh recovery, responsible for obtaining the instance map and the body mesh map , respectively. As shown in Fig. 2, each head consists of 7 stacked convolutions and one task-specific prediction layer. However, directly estimating the camera parameter from the whole image is non-trivial since it is sensitive to instance position. Inspired by CoordConv [liu2018intriguing], we concatenate normalized pixel coordinates to the input feature map at the beginning of the mesh recovery head to encode position information into the network for better estimating camera parameter. Additionally, Group Normalization [wu2018group] is used in both prediction heads for facilitating model training. In order to match the features of size to
, we apply bilinear interpolation before the instance and body-mesh recovery branch.
3.2 Inter-instance ordinal depth supervision
Multi-person body mesh recovery is inherently ill-posed as multiple 3D predictions can correspond to the same 2D projection. Therefore, the trained model would produce ambiguous body mesh estimations with incorrect depth order due to lack of priors. To alleviate such a problem, we use ordinal depth relations among all the persons in the input as supervision to guide reasoning about the depth ordering during the training process.
More concretely, given any two persons (, ) in the image, we define the ordinal depth relation between them as , taking the value:
where denotes the depth of the person and is a pre-defined threshold to determine the ordinal relation. The ordinal relation means both instances are at roughly the same depth; otherwise one of them is closer to the camera than the other. With the ordinal relation of (, ), we define the ordinal depth loss for this pair as
where denotes -th person’s body mesh depth calculated from the predicted camera parameter with focal length , scale and images’s long edge width . The ordinal depth loss enforces a large margin between and if , , one of them measured as closer than the other, and otherwise enforces them to be equal.
However in practice, such ordinal depth relations are rarely available for the scenes captured in the wild due to lack of 3D annotations. To solve this issue, we propose to use pseudo ordinal relations for model training on the in-the-wild data. Specifically, we first train the model on 3D datasets [ionescu2014human3, singleshotmultiperson2018] with depth annotations to learn to estimate the depth of each person in the images. We define the depth of each person as the depth of body center (, pelvis joint). The model is trained by minimizing a depth loss , which is defined as the mean square errors (MSE) between the predicted and GT depths. After that, given an unlabeled data, we first leverage the pre-trained model to estimate the depth which is then used to obtain the pseudo ordinal relations for all the people in the image. Finally, given the pseudo ordinal relations, we adopt an OKS score-weighted ordinal depth loss to supervise the model training for images in the wild. The total loss for image is computed as the average loss of all instances pairs:
where denotes the number of paired instances in the image, denotes OKS-score of the -th person. Intuitively, training the model with such inter-instance ordinal depth supervision can help the model build a global understanding of the depth layout in the input scene and thus ensure more coherent reconstructions.
3.3 Keypoint-aware occlusion augmentation
SMPL-based body mesh recovery is highly sensitive to (partial) occlusion (, overlapping persons, truncation) [zhang2020object, rockwell2020full]. To improve model robustness to occlusion without requiring extra training data and annotations, we propose a keypoint-aware occlusion augmentation strategy during the training process. The proposed augmentation strategy aims to generate synthetic occlusion to synthesize real challenging cases such as partial observation for model training. Compared with previous work [sarandi2018robust] that randomly simulates occlusion on the images, which may produce easy training samples that are less helpful for boosting model performance, our method directly generates synthetic occlusion based on the positions of skeleton keypoints, which can force the model to pay more attention to the body structure, leading to notable enhancement. More concretely, given a set of keypoints of a person in the image, we first randomly choose a keypoint
. Then we randomly sample a non-human object from the PASCAL VOC[everingham2011pascal] dataset and composite it at the location of the selected keypoint . We randomly resize the sampled object to the range of before compositing, where denotes the area of that person. Additionally, we randomly shift the keypoint position by an offset to avoid over-fitting. During training, we set the probability of the occlusion augmentation as 0.5.
3.4 Training and inference
For training our proposed BMP model, we define the loss functionas follows:
where is a modified two-class Focal Loss [lin2017focal] for instance localization; is the depth loss (Sec. 3.2); is the loss for body mesh estimation. The training details of the body-mesh branch are similar to those in HMR [hmrKanazawa17]. Specifically, we formulate as
Here , , , denote MSE between the predicted and GT pose and shape parameters as well as 3D keypoints and vertices, respectively. is the 2D keypoints loss that minimizes the distance between the 2D projection from 3D keypoints and GT 2D keypoints. is the MSE of the predicted and GT confidences, where the GT confidence is computed as the OKS [girdhar2018detect] between the projected and GT 2D keypoints. Moreover, we use a discriminator and apply an adversarial loss on the regressed pose and shape parameters, to encourage the outputs to lie on the distributions of real human bodies. , , , and are the weights of the corresponding loss terms. The loss is applied independently to each positive grid cell. The ordinal depth loss illustrated in Eqn. (3) is adopted when the image contains more than one instance.
Inference The overall inference procedure for BMP is illustrated in Fig. 2. Given an image, BMP first obtains the instance map and the body mesh map
from the prediction heads. Then it performs max pooling operation to find the local maximum onto obtain center point positions , where and () denote the pyramid level and body center location for -th person, respectively, and is the number of estimated persons. After that, BMP extracts the body mesh parameters of each person via . Finally, BMP outputs body mesh estimations by deforming the SMPL model using the predicted parameters. A keypoint-based NMS [girdhar2018detect] is applied to remove the redundant predictions if they exist. We take the multiplication of the predicted OKS score and the probability score from the instance map as the confidence score for NMS.
3.5 Implementation details
We implement BMP with PyTorch[paszke2017automatic] and mmdetection library [mmdetection] and utilize Rectified Adam [liu2019variance] as the optimizer with an initial learning rate of . We resize all images to 832
512 while keeping the same aspect ratio following the original COCO training scheme[su2019multi, wang2019solo, jiang2020coherent]. During training, we augment the samples with horizontally flip and keypoints-aware occlusion (Sec. 3.3). Flip augmentation is conducted during testing. Moreover, since the BMP model directly extracts image-level features for estimations instead of features from cropped bounding boxes, it can take images with smaller resolution (512512) as inputs. We denote such a setting as BMP-Lite. Other training and testing settings are the same between BMP-Lite and BMP. Please refer to supplementary for more details.
In this section, we aim to answer following questions. 1) Can BMP provide both efficient and accurate multi-person mesh recovery? 2) Is BMP able to give coherent meshes for multiple persons with correct depth ordering? 3) Is BMP robust to cases where person instances are occluded or partially observed? To this end, we conduct extensive experiments on several large-scale benchmarks.
Human3.6M [ionescu2014human3] is the most widely used single-person 3D pose benchmark collected in an indoor environment. It contains 3.6 million 3D poses and corresponding videos for 15 subjects. Due to its high-quality annotations, we use it following [jiang2020coherent] for both training and testing.
Panoptic [joo2015panoptic] is a large-scale dataset captured in the Panoptic studio, offering 3D pose annotations for multiple people engaging in diverse social activities. We use this dataset for evaluation with the same protocol as [zanfir2018monocular].
MuPoTS-3D [singleshotmultiperson2018] is a multi-person dataset with 3D pose annotations for both indoor and in-the-wild scenes. We follow [singleshotmultiperson2018] and use it for evaluation.
3DPW [vonMarcard2018] is a multi-person in-the-wild dataset, which features diverse motions and scenes. It contains 60 video sequences (24 train, 24 test, 12 validation) with full-body mesh annotations. To verify generalizability of the proposed model to challenging in-the-wild scenarios, we use its test set for evaluation, following the same protocol as [kocabas2020vibe].
MPI-INF-3DHP [mehta2017vnect] is a single-person multi-view 3D pose dataset. It contains 8 actors performing 8 activities, captured from 14 cameras. Mehta et al., [singleshotmultiperson2018]
generate a multi-person dataset called MuCo-3DHP, from MPI-INF-3DHP via mixing up segmented foreground human appearance. We use both datasets for training.
COCO [lin2014microsoft], LSP [johnson2010clustered], LSP Extended [Johnson11], PoseTrack [Andriluka_2018_CVPR], MPII [andriluka20142d] are in-the-wild datasets with annotations for 2D joints. We use them for training with the weakly-supervised training strategy [hmrKanazawa17] (Eqn. (5)).
4.2 Comparison with state-of-the-arts
Single-person setting We first evaluate our proposed BMP model on the single-person setting to validate the strategy of BMP on factorizing the instance localization and mesh recovery does not sacrifice on the performance. Concretely, we evaluate and compare the performance of BMP on the large-scale Human3.6M dataset with most competitive approaches [hmrKanazawa17, jiang2020coherent] sharing the similar regression target and learning strategy. The results are shown in Table 2. We can observe BMP outperforms all these methods.
|Method||HMR [hmrKanazawa17]||CRMH [jiang2020coherent]||BMP|
Multi-person settings Then we evaluate our BMP model for multi-person body mesh recovery. We first evaluate it on the multi-person dataset captured in the indoor Panoptic Studio [joo2015panoptic] and compare with the most competitive approaches [zanfir2018monocular, zanfir2018deep, jiang2020coherent]. As shown in Table 3, our BMP model achieves the best performance in all scenarios. Overall, it improves upon the state-of-the-art top-down model CRMH [jiang2020coherent] by 5.4% (135.4 mm 143.2 mm in MPJPE), while offering a faster inference speed111We count per-image inference time in seconds. For all methods, the time is counted on GPU Tesla P100 and CPU Intel E5-2650 v2 @ 2.60GHz, without using test-time augmentation.. Moreover, it significantly outperforms CRMH for Ultimatum and Pizza scenarios with crowded scenes and severe occlusion, verifying its robustness to occlusion cases. In addition, its lite version, BMP-Lite, is even faster, which only requires 0.038s to process an image, about faster than CRMH while achieving comparable performance. These results demonstrate both effectiveness and efficiency of BMP for estimating body meshes of multiple persons in a single stage.
|Zanfir et al. [zanfir2018monocular]||140.0||165.9||150.7||156.0||153.4||-|
We use MPJPE as evaluation metric. The lower the better. Best inbold.
Another popular 3D pose estimation benchmark is the MuPoTS-3D dataset [mehta2017vnect]. We compare our method against two strong baselines, 1) the combination of OpenPose [cao2018openpose] with single-person mesh recovery methods (SMPLify-X [pavlakos2019expressive] and HMR [hmrKanazawa17]), and 2) the state-of-the-art top-down approach CRMH [jiang2020coherent]. We report the results in Table 4. As we can see, BMP outperforms significantly previous methods on both evaluation protocols.
Lastly, we compare our BMP model with state-of-the-art approaches on the challenging in-the-wild 3DPW dataset. Some approaches use the self-training strategy (, SPIN [kolotouros2019spin]) or temporal information (, VIBE [kocabas2020vibe]), and they rely on off-the-shelf person detectors [cao2018openpose, redmon2018yolov3]. As shown in Table 5, our BMP outperforms CRMH [jiang2020coherent] and SPIN [kolotouros2019spin] in terms of 3DPCK while maintaining an attractive efficiency, and achieves comparable results with VIBE [kocabas2020vibe] without relying on any temporal information. Additionally, BMP-Lite obtains roughly the same performance as the state-of-the-art CRMH model while achieving faster inference speed. There results further confirm the effectiveness of our single-stage solution over existing multi-stage strategies, with very competitive efficiency.
Qualitative results We visualize some body mesh reconstructions of BMP on the challenging PoseTrack, MPII and COCO datasets, as shown in Fig. 3. It can be observed that BMP is robust to severe occlusion and crowded scenes and can reconstruct human bodies with correct depth ordering.
4.3 Ablative studies
We conduct ablation analysis on Panoptic, 3DPW and MuPoTS-3D datasets both qualitatively and quantitatively to justify our design choices. The qualitative analysis of the proposed method is illustrated in Fig. 4.
Person instance representation We first evaluate the proposed 3D point-based representation for person instances. The main difference between the proposed representation and previous 2D spatial representation [nie2019spm, zhou2019objects, CenterHMR] is that we use an additional depth dimension to differentiate person instances in the discretized depth space through FPN. We then compare BMP with a baseline model (, BMP using 2D spatial representation). For fair comparison, we aggregate features from all levels of FPN pyramid in the baseline model to obtain a single output for both instance localization and body mesh recovery. Specifically, we study three methods for the aggregation: we resize all feature pyramids to 1/8 scale and then aggregate them by 1) element-wise addition (Baseline-Add), 2) concatenation (Baseline-Concat), or 3) adopting a convolutional layer after concatenating them (Baseline-Conv). Results are shown in Table 6. We can see BMP improves upon the baseline models by a large margin on all datasets, proving its efficacy for body mesh recovery. Additionally, from Fig. 4 (1st row), we observe BMP with the proposed representation is more robust in handling occluding instances, especially when the body centers of multiple instances fall at the same spatial grid coordinate, while the 2D representation would usually fail.
|Method||Panoptic ()||3DPW ()||MuPoTS-3D ()|
Ordinal depth loss To investigate whether the ordinal depth loss can help produce more coherent results with correct depth ordering, we conduct experiments on the MuPoTS-3D dataset. Specifically, we evaluate the ordinal depth relations of all instance pairs in the scene and report the percentage of correctly estimated ordinal depth relations in Table 7. The model trained with significantly improves upon the baseline (BMP trained w/o ) (from 91.42% to 94.50%). Such improvements can also be observed from Fig. 4 (2nd row). Additionally, by comparing our method with Moon et al. [moon2019camera] and CRMH [jiang2020coherent], we observe BMP achieves higher accuracy w.r.t. relative depth ordering than CRMH that only considers ordinal loss for overlapped pairs (94.50% 93.68%). This demonstrates our full pair-wise ordinal loss can provide a more comprehensive supervision on the depth layout of the scene and thus train the model to give more coherent results.
|Method||Moon [moon2019camera]||CRMH [jiang2020coherent]||BMP w/o||BMP|
Keypoint-aware occlusion augmentation Finally, we study the impact of the proposed keypoint-aware occlusion augmentation strategy. We compare our BMP model with the models trained without occlusion augmentation (BMP-NoAug) and trained using randomly Synthetic Occlusion [sarandi2018robust] (BMP-RandOcc) in Table 8. We can see BMP outperforms both of them by a large margin on all datasets. Notably, it respectively brings 9.1% and 17.3% improvements over BMP-NoAug on Panoptic and 3DPW datasets, which feature crowded scenes with severe overlap and partial observation. In contrast, the random augmentation hurts model performance on MuPoTS-3D (71.71 70.78). This verifies that our proposed occlusion augmentation can force the model to focus on body structure and thus improve its robustness to occlusion.
|Method||Panoptic ()||3DPW ()||MuPoTS-3D ()|
In this work, we present the first single-stage model, Body Meshes as Points (BMP), for multi-person body mesh recovery. BMP introduces a new representation method to enable such a compact pipeline: each person instance is represented as a point in the spatial-depth space which is associated with a parameterized body mesh. With such a representation, BMP can fully exploit shared features and perform person localization and body mesh recovery simultaneously. BMP significantly improves upon conventional two-stage paradigms, and offers outstanding efficiency and accuracy, as validated by extensive experiments on multiple benchmarks. Besides, BMP develops several new techniques to further improve the coherence and robustness of recovered body meshes, which are of broad interest for other applications like human pose estimation and detection. In future, we will explore how to make the model more compact and further improve its efficiency, as well as extend to inter-person interactions modeling.
Acknowledgement This research was partially supported by AISG-100E-2019-035, MOE2017-T2-2-151, NUS_ECRA_FY17_P08 and CRP20-2017-0006.