Recent years have witnessed great progress in vision-based human performance capture, which is promising to enable various applications (e.g., tele-presence, sportscast, gaming and mixed reality) with enhanced interactive and immersive experiences. To achieve surprisingly detailed geometry and texture reconstruction, dense camera rigs even equipped with sophisticated lighting systems are introduced [vlasic2009dynamic, collet2015high, joo2018total, joo2019panoptic, guo2019relightables]. However, the extremely expensive and professional setups limited their popularity. Although other light-weight multi-view human performance capture systems have achieved impressive results even in real-time, they still relies on pre-scanned templates [liu2011markerless, liu2013markerless], custom-designed RGBD sensors [dou2016fusion4d, dou2017motion2fusion], or limited to single-person reconstruction [Starck07, gall2009motion, huang2018deep, Minimal18].
Benefiting from the fast improvement of deep implicit functions for 3D representations, recent methods [saito2019pifu, saito2020pifuhd, Monoport2020] are able to recover the 3D body shape only from a single RGB image. Compared with the voxel-based [varol2018bodynet, huang2018deep, zheng2019deephuman] or mesh-based [natsume2019siclope, alldieck2019tex2shape]
representations, an implicit function guides the deep learning models to notice geometric details in a more efficient way. Specifically, PIFu[saito2019pifu, Monoport2020] achieves plausible single human reconstruction using only RGB images, and PIFuHD [saito2020pifuhd] further utilizes normal maps and high resolution images to generate more detailed results.
Despite the prominent performance in digitizing 3D human body, both PIFu [saito2019pifu] and PIFuHD [saito2020pifuhd] suffer from several drawbacks when extending the frameworks to multi-person scenarios and multi-view setups. Firstly, the average-pooling-based multi-view feature fusion strategy in PIFu will lead to over-smoothed outputs when high frequency details are included in multi-view features, i.e, the normal maps used in PIFuHD. More importantly, in the two approaches, reconstruction results are only promised with ideal input images without severe occlusions in multi-person performance capture scenarios. The reconstruction performance of [saito2019pifu, saito2020pifuhd] will be significantly deteriorated due to the lack of observations caused by severe occlusions.
To address the aforementioned problems, we propose a novel framework to perform multi-person reconstruction from multi-view images. First of all, inspired by [vaswani2017attention], we design an spatial attention-aware module to adaptively aggregate information from multi-view inputs. The module is effective to capture and merge the geometric details from different view points, and finally contributing to the significant improvement of results under multi-view setups. Moreover, for multi-person reconstruction, we further combine the attention module with parametric models, i.e., SMPL to enhance the robustness while maintaining the fine-grained details. The SMPL model serves as a 3D geometry proxy which compensates for the missing information where occlusions take place. With the semantic information provided by SMPL, the network is capable of reconstructing complete human bodies even under close interactive scenarios. Finally, when dealing with moving characters from video, we propose a temporal fusion method by weighting the signed distance field (SDF) across the time domain, which further enhance the temporal consistency of the reconstructed dynamic 3D sequences.
Another urgent problem is that the lack of high-quality scans of multi-person interactive scenarios in the community makes it difficult for accurately evaluating multi-person performance capture systems like ours. To fill this gap and better evaluate the performance of our system, we contribute a novel dataset, MultiHuman, which consists of 150 high-quality scans with each containing from 1 to 3 multi-person interactive actions (including both natural and close interactions). The dataset is further divided into several categories according to the level of occlusions and number of persons in the scene, where a detailed evaluation can be conducted. Experimental results demonstrate the state-of-the-art performance and well generalization capacity of our approach. In general, the main contribution in this work can be summarized as follows:
We propose a novel framework for high-fidelity multi-view reconstruction for multi-person interactive scenarios. By leveraging the human shape and pose prior for resolving the ambiguities introduced by severe occlusions, we achieve the state-of-the-art performance even with partial observations in each view.
We design an efficient spatial attention-aware module to obtain fine-grained details for multi-view setups, and introduce a novel temporal fusion method to reduce the reconstruction inconsistencies for moving characters from video inputs.
We contribute an extremely high quality 3D model dataset containing of 150 multi-person interacting scenes. The dataset can be used for training and evaluation of related topics in future research.
2 Related Work
Single-view performance capture
Many methods have been proposed to reconstruct detailed geometry from single-view inputs. Typical techniques include silhouette estimation[natsume2019siclope], depth estimation [gabeur2019moulding, smith2019facsimile] and template-based deformation [alldieck2019tex2shape, zhu2019detailed, habermann2020deepcap]. Moreover, SMPL [loper2015smpl] regression or optimization can be incorporated to generate more reliable and robust outputs as shown in [zheng2019deephuman, zheng2020pamir]. Real-time methods can be implemented with the aid of a single depth sensor [yu2017bodyfusion, yu2018doublefusion] or by innovating computation and rendering algorithms [Monoport2020]. Regarding to the 3D representations used in these methods, we can split them into two categories: explicit [varol2018bodynet, natsume2019siclope, zheng2019deephuman] and implicit [saito2019pifu, saito2020pifuhd, huang2018deep, huang2020arch] reconstruction methods. Compared with traditional explicit 3D representations, implicit representations show certain advantages in domain-specific shape learning and detail preservation. For example, PIFu [saito2019pifu] define the surface as a level set of function . Similarly, [huang2018deep]
defines a probability field of surface points, and ARCH[huang2020arch] predicts a 3D occupancy map. However, all of the methods above are mainly focusing on single-person reconstruction, and it remains difficult for them to achieve accurate reconstruction under multi-person scenarios.
Multi-view performance capture
Motion capture has been developed to make accurate motion predictions in multi-person interaction scenes [belagiannis20143d, liu2011markerless, liu2013markerless, joo2019panoptic]. Some of them even achieve real-time performance [bridgeman2019multi, dong2019fast, zhang20204d]. However, these works only capture skeleton motions instead of the detailed geometries. Regarding to multi-view geometry reconstruction, previous studies use template-based deforming methods [de2008performance, vlasic2008articulated, gall2009motion], skeleton tracks [vlasic2008articulated, gall2009motion] or fusion-based techniques [dou2013scanning]. Aside from the long computation time, these methods often show deficiency in mapping textures, handling topology changes or dealing with drastic frame-to-frame motion. Moreover, the aforementioned methods also show limited adaptability for multi-person capture as they cannot effectively deal with occlusions. Robust quality reconstruction methods often come at prohibitive dependencies and constraints. Some methods depend on dense viewpoints [collet2015high, joo2018total] and even controlled lighting [vlasic2009dynamic, guo2019relightables] to reconstruct detailed geometry. Another branch of multi-view RGBD systems [dou2016fusion4d, dou2017motion2fusion] achieve impressive real-time performance capture results even for multi-person scenarios benefiting from the strong depth observations. Note that Huang et,al. [huang2018deep] also presents a volumetric capture approach to accomplish quality results using very sparse-view RGB inputs, but they only focus on single-person reconstruction without considering how to resolve the challenges introduced by multi-person occlusions.
Apart from the huge success of attention mechanism in natural language processing[vaswani2017attention, devlin2018bert], attention-based network has achieved prominent performance in visual tasks, including image classification [wang2017residual], image segmentation [zhang2018context, yu2018learning, li2019attention]dai2019second], multi-view stereo [luo2020attention]
and hand pose estimation[huang2020hot]. In these works, attention mechanism is applied to capture the correlation of embedding features or context relationship of hierarchical structure. In particular, Luo et al. [luo2020attention] propose an attention-aware network AttMVS to synthesize contextual information from multi-view scenes. An attention-guided regularization module is used for more robust prediction. In [huang2020hot], Lin et al. design a non-autoregressive transformer to learn the structural correlations among hand joints, which achieves real-time speed and state-of-the-art performance for 3D hand-object pose estimation.
An overview of our approach is illustrated in Fig 2. The input of DeepMultiCap is the segmented single person multi-view images as well as the corresponding SMPL, and the system outputs the reconstructed 3D human. The results are combined together directly with no need for modifying the relative position, since the multi-view setting ensures the 3D spatial relationship between different individuals.
To obtain the inputs, we firstly fit SMPL-X [pavlakos2019expressive] models through 3D keypoints estimated from multi-view by a 4D association algorithm [zhang20204d]. For multi-person segmentation, we refer to a self-correction method [li2019self] and use SMPL projection maps to track different characters in multi-view scenes. Finally, the 3D human can be generated through the spatial attention-aware network based on the pixel-aligned implicit function, and further polished by the temporal fusion method when the time information is available in video inputs, which will be described detailedly in Section 4.
Our method is implemented by the implicit function. An implicit function represents the surface of a 3D model as a level set of an occupancy field function F, e.g. F(X) = 0.5. Specifically, PIFu [saito2019pifu] combines 3D points with conditional variables to formulate a pixel-aligned implicit function:
where for an image and a given 3D point , is the 2D projection coordinate on the image plane, is the depth value in the camera coordinate space, and is the image feature at location
. In PIFu, a multi-layer perceptron (MLP) is trained to fit the implicit function F.
In order to improve the quality of reconstruction results, PIFuHD [saito2020pifuhd] maintains the origin PIFu framework as a coarse level prediction while adding high resolution images to a fine level network:
where are the high resolution image, the predicted frontal and back normal map, and is the 3D embeddings extracted from the intermediate features in the coarse level. More detailed human models can be reconstructed with additional information brought by the increasing resolution and high frequent details in the normal maps.
For multi-view images, a naive strategy is proposed in PIFu to synthesize multi-view features, i.e., performing mean pooling on the features from the intermediate layer of the MLP. However, this simple method may lead to loss of details and even collapse in real world cases, especially when the multi-view features are not consistent due to the various depth in different views and occlusions.
4 Single Person Reconstruction
Reconstruction of a single person from multi-view is a challenging problem. The main concern is to extract the meta information of the observations from different views. For this end, we propose an novel feature fusion module based on self-attention mechanism, which is effective to help the network aware of geometry details shown in multi-view scenes. To tackle with the inconsistencies and loss of information brought by occlusions, we combine the attention module with parametric models to enhance the robustness of reconstruction while preserving the fine-grained details. The architecture of our network is illustrated in Figure 3. Following PIFuHD [saito2020pifuhd], our method builds on a two-level coarse to fine framework. The coarse level conditioned with images and SMPL models ensures a confident result, and the fine level refines the reconstruction by utilizing high resolution image feature maps (). The results can be further polished by a temporal fusion method when time information is available for video inputs. With the proposed spatial attention and temporal fusion framework, the reconstruction remains robust and in high quality .
4.1 Attention-aware Multi-view Feature Fusion
In PIFu [saito2019pifu], the simple strategy for multi-view reconstruction is averaging the multi-view feature embeddings from the intermediate layer of MLP. We argue that the method is not efficient enough to synthesize the geometry details from multi-view scenes, which could lead to losing information. As shown in Figure 4, when the strategy is applied to PIFuHD [saito2020pifuhd], we obtain a smoother output. Specially, the geometry features may not remain consistent since the visible regions changes from different views. The mean pooling method can not handle these cases effectively.
To capture correlations between different views, inspired by [vaswani2017attention], we propose a multi-view feature fusion method based on self-attention mechanism. The detailed architecture of the module is illustrated in Figure 3. Firstly, the input multi-view feature are embedded with three different linear layers and self-attention mechanism is applied:
where are the query feature, source feature and target feature embedded by linear weights, and is the embedding size. The dot-product result is divided by to prevent the gradient vanishing problem.
Multi-head attention is used in our method, i.e, encode the multi-view features into different embedding subspaces, which allows the model to notice the different geometry patterns under multi-view jointly. The weights of views in target feature are obtained through softmax function by calculating the similarity between views in the query feature and the source feature . The confident observations in each view tend to have large weights and will be maintained, while the invisible regions which lead to small weights have little influence on the outputs.
We stack the linear and attention layers to form the self-attention encoder as proposed in [vaswani2017attention]. Finally, the meta-view prediction is generated as:
where is the multi-view features, is the feature output of the self-attention encoder, and the implicit function predicts the occupancy field. The output meta-view feature is expected to contain the global spatial information. As demonstrated in Figure 4, when combining the attention module with PIFuHD [saito2020pifuhd], we are able to capture and preserve details with increasing observations.
4.2 Embedded with Parametric Body Model
Although the attention-aware feature fusion module is effective to synthesize details from multi-view, without auxiliary 3D information, the network struggles to make a reasonable prediction when information is lost due to occlusions. To address the limitation, we combine the strength of attention mechanism and parametric models.
A parametric body model, e.g, SMPL, contains the pose and shape information of human bodies. The semantic feature of SMPL is extracted by 3D convolution network for geometry inference. To improve the efficiency of attention module, inspired by the position encoding introduced in [vaswani2017attention]
, we further design an informative view representation by rendering SMPL global normal maps. The global maps offer guidance for the network to identify the particular visible body parts in multi-view observations, and then geometry features can be synthesized for the corresponding parts. Specially, to render the global normal maps, SMPL is transformed to the canonical model coordinate system, where RGB color is obtained from the normal vector and standard rendering procedure can be applied. In multi-person scenes, though images of single person can be fragmentary due to occlusions, the extra information provided by SMPL compensates for the missing part and remain consistent under different views, which significantly improves the quality and robustness of reconstruction results.
and the fine level:
where is the volumetric representation of SMPL, is the SMPL semantic features, are the predicted frontal normal map and the rendered SMPL global normal map under the camera view, and is the 3D embeddings from the coarse level.
4.3 Temporal Fusion
For moving characters in video inputs, inconsistencies could raise between continuous frames due to the change of visible parts. To address the limitation, we propose a simple temporal fusion method. Suppose is a vertex of the reconstructed mesh at time , we first calculate the blending weight by:
where is the nearest SMPL vertex set of , is the blending weight of SMPL vertex , and
is the weight of vertex . Given estimated SMPL at time and , the reconstructed vertices can then be warped to time through the standard blend skinning:
where refers to the skinning procedure, are the SMPL parameters. With the warped mesh, we calculate the signed distance field (SDF) and performing mean pooling to generate continuous reconstructions:
where denotes the SDF, is a sliding time window with size of . In our approach, is set to 3 for consistent results while maintaining details.
5 Extend to multi-person Reconstruction
Multi-person reconstruction is implemented by reconstructing each individual separately. The key challenge is to train the network to maintain robust against occlusions in interactive scenes. For this end, we utilize several strategies during training. We firstly collect 1700 single human models from Twindom111https://web.twindom.com/ to construct a large scale dataset. To simulate multi-person cases, we render images via taichi [hu2019taichi]
and randomly project other persons to the masks, where various situations can be generated from non-occlusion to heavy occlusion scenes. Besides, to help the network aware of visible details and leverage SMPL information for robust reconstruction, we use a sampling method based on the visiblity of points. The input points during training are sampled from Gaussian distribution centered by surface points with standard deviationas introduced in [saito2019pifu]. We further choose a small standard deviation for visible points for guiding the network to learn fine-grained geometry details, while a larger for invisible points to avoid unreasonable predictions, which we find contributes to the improvement of performance under occluded scenes.
6 Dataset and Experiment
6.1 MultiHuman Dataset
Since no multi-person dataset is available to evaluate our method, we propose a high quality 3D model dataset MultiHuman, which is collected using a dense camera-rig equipped with 128 DLSRs and a commercial photogrammetry software. The dataset contains 150 multi-person static scenes, where in total there are 278 characters, which are mostly university students wearing casual clothes, dresses, etc. In each scene, the number of person is within the range from 1 to 3, and each model consists of about 300,000 triangles.
To evaluate the proposed approach, we divide the dataset into different categories by the level of occlusions and number of persons. In particular, we split the dataset into 30 single human scenes, 18 occluded single human scenes (by different objects), 46 natural interactive two person scenes, 30 closely interactive two person scenes, and 26 scenes with three persons. For more examples of MultiHuman dataset, please refer to the supplementary materials.
Performance on MultiHuman We compare our method with current state-of-the-art approaches, i.e, PIFu [saito2019pifu], PIFuHD [saito2020pifuhd] and PaMIR [zheng2020pamir] (PIFu + SMPL). All the methods are trained with the same strategies on the dataset described in Sec. 5. For PIFuHD, the backside normal maps are not used in our implementation, and multi-view features are fused by mean pooling as introduced in [saito2019pifu, zheng2020pamir]. During test, the ground truth models are normalized to centimeters height and then we render the 6 view images as the inputs.
The point-to-surface distance and chamfer distance between the reconstruction and ground truth geometry are used as evaluation matrix. Quantitative results are shown in Table 1. When occlusions intensify with increasing number of persons and interacting elements, the loss of prior methods exacerbate while ours remains competitive. Qualitative results are illustrated in Figure 6, which indicates the prominence of our method and the large gap between prior works and ours for handling occlusions in multi-person scenes. Our method is able to reconstruct highly detailed 3D human robustly even under closely interactive scenes.
|PIFu (Mview + Mean)[saito2019pifu]||1.131||1.220||1.402||1.522||1.578||1.620||1.745||1.831||1.780||1.564|
Performance on Real World Data We evaluate our method on ZJU-MoCap dataset [peng2020neural], a multi-view real world dataset, with comparison to DeepVisualHull [huang2018deep], a volumetric performance capture from sparse multi-view, PIFuHD [saito2020pifuhd], Neural Body [peng2020neural], a differentiable rendering method directly trained on the test image sequence. For inference, we re-implement DeepVisualHUll and use the released code and pretrained models of PIFuHD and Neural Body . Figure 5 shows the state-of-the-art performance of our method on the benchmark. Reconstruction on real world images (6 view for our data and 8 view for CMU dataset [joo2018total]) is demonstrated in Figure 1. For more results, please refer to our supplementary video.
6.3 Ablation Study
This section aims to find the factors that contribute to the prominence of our method. We achieve the state-of-the-art performance mainly by leveraging a self-attention network combined with SMPL and a temporal fusion method for consistent results. We then demonstrate how the approaches improve the reconstruction under different situations.
Variant 1: Self-attention Module We design a self-attention module to better capture the details from different observations. To figure out the strength of our multi-view feature fusion method, we combine the attention module with PIFu[saito2019pifu] (PIFu + Att) and PIFuHD[saito2020pifuhd] (PIFuHD + Att), and further evaluate our method’s performance without the module (replaced by mean pooling). Quantitative results in Table 1 shows that the module benefits baseline models under non occluded and occluded scenes. PIFuHD with attention module even outperforms ours on single human reconstruction, since the limitations brought by SMPL (Section 6.4) can lead to a lower accuracy for our method. For PIFu the improvement is marginal, indicating that the module is more effective to merge multi-view features with the detailed geometry information offered by image normal maps. For our method, we lose the competitive performance without the module. Qualitative examples in Figure 4 further demonstrate how the module can help the baseline model maintain geometry details with increasing views.
Variant 2: Use of SMPL SMPL is used in our method as a 3D proxy for the network to generate a reasonable output, and we further design a SMPL global normal map (described in Section 4.2) to improve the robustness of reconstruction against occlusions and preserving details. The huge gap between PaMIR [zheng2020pamir] and ours indicates SMPL is not only the factor contributing to our advantages. Table 1 shows the performance of our method without the designed global maps (Ours w/o SN). The results demonstrate lower accuracy of reconstruction, which implies the efficiency of the global maps as a visual reference to guide the attention network merge multi-view information.
Variant 3: Temporal Fusion Figure 8 illustrates the results of our method with and without temporal fusion on real world image sequence. The temporal fusion method further enhance the reconstruction consistency, which can be witnessed more clearly in our supplementary video.
Since we use SMPL as a 3D reference, our method can not reconstruction other objects aside from human. For challenging clothes, Figure 9 demonstrates that we are able to reconstruct tight dress, while for loose clothing like a wind coat, the reconstruction can be unstable.
Besides, our method relies on an accurately fitted SMPL, i.e, the SMPL body within the correct corresponding region. An inaccurate SMPL can lead to artifacts and failure cases (Figure 10).
7 Discussion and Future Works
Though our methods is capable of reconstructing multi-person from real world images, we rely on SMPL as a 3D reference. The camera parameters are required to estimate 3D keypoints and fit SMPL models. Besides, the attention network and image encoders are extremely memory-consuming, which restricts the inference efficiency. Future works can focus on the network pruning to achieve real-time inference, and design more sophisticated approaches free from SMPL, which will surely make the system more applicable.