SparseFusion: Dynamic Human Avatar Modeling from Sparse RGBD Images

06/05/2020 ∙ by Xinxin Zuo, et al. ∙ University of Guelph University of Alberta University of Kentucky 0

In this paper, we propose a novel approach to reconstruct 3D human body shapes based on a sparse set of RGBD frames using a single RGBD camera. We specifically focus on the realistic settings where human subjects move freely during the capture. The main challenge is how to robustly fuse these sparse frames into a canonical 3D model, under pose changes and surface occlusions. This is addressed by our new framework consisting of the following steps. First, based on a generative human template, for every two frames having sufficient overlap, an initial pairwise alignment is performed; It is followed by a global non-rigid registration procedure, in which partial results from RGBD frames are collected into a unified 3D shape, under the guidance of correspondences from the pairwise alignment; Finally, the texture map of the reconstructed human model is optimized to deliver a clear and spatially consistent texture. Empirical evaluations on synthetic and real datasets demonstrate both quantitatively and qualitatively the superior performance of our framework in reconstructing complete 3D human models with high fidelity. It is worth noting that our framework is flexible, with potential applications going beyond shape reconstruction. As an example, we showcase its use in reshaping and reposing to a new avatar.



There are no comments yet.


page 3

page 6

page 7

page 8

page 9

page 10

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

3D modeling or reconstruction of human bodies is an important topic that has wide range of applications in areas such as virtual reality, gaming, virtual try on, and teleconference. Many scanning systems under multi-view setup [Tong12Scanning, Alexiadis13Real, Dou13Scanning, dou2016fusion4d, Vlasic09Multiview] have been developed over the years, from which impressive results have been achieved. Such a system, on the other hand, is usually not portable and could be rather expensive. Rather than building on these sophisticated setups, in this paper we propose to reconstruct complete 3D human body shapes from a sparse set of frames taken by a single commodity-level RGBD camera. It is a challenging task especially in the presence of non-rigid articulated motions and surface occlusions.

The problem of recovering 3D models of deformable objects from a single depth camera has recently been studied. As an extension to the celebrated KinectFusion [Newcombe11KinectFusion] system, a dynamic fusion [Newcombe15DynamicFusion] approach has been developed which takes non-rigid motion into account by solving a non-rigid warp field for every frame. However, they cannot handle fast motion, and the tracking error would accumulate as the sequence proceeds. To address these issues, several follow-up systems have been proposed to exploit either sparse feature correspondences [Innmann16voluedeform], dense color information along the sequence [Guo17Real], or the articulated motion constraints [Yu17bodyfusion, Yu18doublefusion] for more robust tracking, and enforcing loop closure [Dou15scanning, Wang18Dynamic] to recover a complete shape. The improved performance is achieved with a cost – they rely on the existence of both a continuous image sequence and a reliable and continuous dense tracking over the entire sequence, which is computationally expensive and contains much redundant information. To account for this issue, we propose to instead consider only a sparse set of RGBD frames as input. The most related work is that of Li [Hao13selfportrait] and Shapiro  [shapiro2014rapid], which takes several frames from a RGBD camera as input. However, in previous works, the user has to maintain a certain static pose while rotating in front of the camera, which is difficult to hold in practical settings. On the contrary, our proposed approach is capable of handling situations where human subjects are allowed to have significant pose changes.

To achieve this goal, we exploit the Skinned Multi-Person Linear model (SMPL) [Loper15SMPL] as a generative human template to register sparse frames of the human subject into a canonical model. First, the SMPL parameters are optimized to closely fit to the partial scans generated from the input depth images. Then, for every two partial scans that have sufficient overlap they are aligned by the correspondences conveyed and transferred via the SMPL template model. Starting from this pairwise alignment, a global non-rigid registration procedure is performed to get all those partial pieces deformed into canonical coordinate as guided by those correspondences acquired from the pairwise registration. After obtaining the 3D body shapes, a texture optimization approach is proposed to attach clear and consistent texture maps to the 3D model. During the texturing process, we take the non-rigid deformation into account, and deal with the possible misalignment by computing a warping field for each image successively.

The proposed approach is examined on both synthetic and several real datasets captured with a single depth sensor. As demonstrated by the experiments, our approach is capable of generating complete and high quality human avatars from a very sparse set of RGBD frames.

The main contribution of this paper is that instead of taking a continuous depth sequence as input to fuse the sequence into a canonical model, we propose to use sparse RGBD frames to reconstruct a complete human avatar free from accumulation error. To be different from previous 3D self-portrait methods which usually assume static poses during the capture, we allow large pose variations by exploiting a statistical human template for the registration.

As an interesting application, we can synthesize the reconstructed avatar by changing its shape and pose. A personalized SMPL model is built from the reconstructed human avatar. To achieve this goal, we propose a hierarchical representation of the reconstructed model with sparse control vertices mapped to the SMPL template, and the deformation of the reconstructed surface mesh is driven by those vertices. In this way, we could take advantage of the SMPL model in expressing human poses and shapes while still maintaining the surface details of the reconstructed model.

Ii Related Literature

In this section, we review the related efforts on human body modeling. They could be roughly partitioned by the input modality and whether any human template is involved in the reconstruction.

Ii-a Human modeling with color images

The problem of 3D human body reconstruction has been studied for decades under the multi-view stereo setup [Wu11fusing, Aguiar08, zhu2016video] where multiple color images are taken as input. Typically, they exploit both the correspondence cues between images of neighboring views and the temporal consistency along the sequence to build up the involving surface. The involved multiple cameras are supposed to be synchronized and calibrated. Although very impressive and pleasing results have been achieved, this controlled setup is therefore mostly suitable in a laboratory setting.

On the other hand, recent monocular human modeling approaches [NIPS2017Tung, varol2018bodynet, omran2018neural, alldieck2019learning, alldieck2018detailed, huang2018deep, natsume2019siclope, saito2019pifu, kolotouros2019convolutional] have shown compelling reconstruction results of human bodies from images in the wild. For example, Kanazawa et al. [kanazawa2018end] proposed an end-to-end framework to directly regress the parameters of a statistical body template from a single color image. A number of follow-up efforts proceed to incorporate additional information including body silhouettes, shading information  [natsume2019siclope, zhu2019detailed, alldieck2019learning], or mutual constraints across multiple images [liang2019shape, huang2018deep]

to train a neural network. Another branch of investigation is to employ volumetric representations 

[varol2018bodynet, zheng2019deephuman], depth maps [tang2019neural] or UV maps [alldieck2019tex2shape] for the deep neural network. For instance, BodyNet [varol2018bodynet] learned to directly generate a voxel representation of the person using a deep neural network. However, due to the high memory requirements of voxel representations, fine-scale details are often missing in the output. Instead, PIFu [saito2019pifu] regressed an implicit surface representation that locally aligned pixels with the global context of the corresponding 3D object. Unlike voxel-based representations, this implicit per-pixel representation is more memory efficient. Despite the widespread usage of learning based methods, the reconstructed human body usually lacks sufficient surface details. More importantly the inherent depth ambiguity of the color image stops the reconstructed human body from fitting closely to the real surface.

Ii-B Human modeling with depth images

The advent of affordable consumer grade RGB-D cameras has brought about a profound advancement of human modeling approaches. There are some methods [Newcombe15DynamicFusion, Innmann16voluedeform, Guo17Real, Yu17bodyfusion, Yu18doublefusion] that use only a single depth sensor for the non-rigid objects reconstruction. As for the fusion based approaches, the surface is reconstructed in an incremental manner by tracking each frame along the RGBD sequence and updating the canonical model. First, as an extension to the KinectFusion system [Newcombe11KinectFusion], a dynamic fusion approach [Newcombe15DynamicFusion] has been proposed to handle non-rigid motion by solving a non-rigid warp field for every frame. Later on, sparse feature information [Innmann16voluedeform] and dense color correspondences [Guo17Real] in the sequence were incorporated to improve the robustness of surface tracking. Besides, Yu et al. [Yu17bodyfusion] enforced the skeleton constraints in the typical fusion pipeline to get better performance on both surface fusion and skeleton tracking. Later on, a more robust fusion approach [Yu18doublefusion] was proposed by tracking both the inner and outer surface but they assume A-pose as the starting pose. Those methods allow the user to move more freely. However, as the sequence proceeds the almost inevitable drifting problem makes it difficult to recover a complete model without loop closure.

To tackle the above mentioned problem and build up 3D self-portraits, there are efforts [Dou15scanning, Hao13selfportrait, Tong12Scanning, shapiro2014rapid, cui2012kinectavatar, mao2017easy, lin2016fast] that generate partial pieces in the first place and handle the error accumulation problem with a global registration. For instance, Shapiro et al. [shapiro2014rapid] aligned depth images from four static poses taken at 90 degree angles relative to each other with their proposed piecewise rigid registration method. Similarly, Li et al. [Hao13selfportrait] had eight partial scans as input and registered them globally with a non-rigid deformation approach. Mao et al. [mao2017easy] have taken 18 depth frames as input for the human modeling. However, they always assume static and same poses during capture. To make sure the pose is kept as same as possible during the capture, a turn-table was used in [lin2016fast]. On the other hand, Dou et al. [Dou15scanning] allowed more free movement and proposed a non-rigid bundle adjustment method to align the partial pieces. Although impressive results were obtained, the bundle adjustment could be quite computationally expensive and time-consuming due to the large number of unknowns and search space.

Using a single depth sensor for human modeling is challenging as we need to handle the occlusion problem and the non-rigid motion. To meet this challenge, multiple depth sensors were exploited for dynamic surface modeling [Dou13Scanning, dou2016fusion4d]. For example, as the current state-of-the-art approach, Fusion4D [dou2016fusion4d] proposed a system for live multi-view performance capture, generating temporally coherent high-quality reconstructions in real-time. Although surfaces with great details have been reconstructed, the system is rather expensive and again takes extra effort to calibrate and synchronize the sensors.

Ii-C Template-based human body modeling

For the human body modeling, the idea of incorporating the human template has also attracted much attention. Early models were based on simple primitives [metaxas1993shape, gavrila19963]. The recent statistical human body models, such as SCAPE [anguelov2005scape] and SMPL model [Loper15SMPL]

, were learned from thousands of scans of human bodies. The pose and shape deformations are encoded in the parametric model. Therefore, instead of recovering the 3D vertices on the surface, researchers 

[Bogo15Detailed, Zhang14Quality] set to obtain the pose and shape coefficients of the statistical model. For instance, a SCAPE based parametric human model was used in [Bogo15Detailed] with a displacement map to represent the skin details. However, they did not take the surface deformation caused by cloth into account but assumed that the captured human subject is almost naked. In paper [fechteler2019markerless, achenbach2017fast], a kinematic skinning model was used for human pose and shape reconstruction from the 3D point cloud acquired by multi-view stereo methods. Alldieck et al. [alldieck2018detailed, alldieck2018video]

took a monocular video sequence as input and exploited the SMPL model for coarse shape and pose estimation, together with the human silhouettes and image shading information for more detailed reconstruction. As we have reviewed in the learning based approaches in section

II-A, the parametric human template also plays an important role in the recent learning based approaches, as only a small number of parameters are needed for regression.

Instead of employing a general human template, there are endeavors [habermann2019livecap, xu2018monoperfcap, Zollhofer14real, Guo15Robust] that take pre-scanned human models as template for human performance capture. They are more related to surface tracking and the problem becomes easier to handle as the overall shape is already available. Furthermore, Yu et al. [yu2019simulcap] also incorporated cloth simulation during the tracking procedure to model the deformation of inner body and outer cloth separately.

In general, the template-based approaches are more reliable in handling occlusions, complex motion, and work well when the input is limited input such as a single or few images. In this paper, we utilize a probabilistic human template model to achieve more robust fusion under large pose changes, but still retain the surface details in the reconstructed model by using free-form deformation similar to the template-free approaches.

Iii Approach

We are given sparse frames captured with a human subject under different poses with different body orientation. Therefore, for each frame we have a partial scan of the human body and our goal is to build up a complete model by fusing all those partial scans. In the following equations, denotes the partial scans obtained from the depth images and are the corresponding color images. In this paper, the SMPL model [Loper15SMPL] is used to register sparse frames into a canonical model.

The SMPL model is a skinned vertex-based model which parametrizes a triangulated mesh by pose and shape parameters. The shape parameters are coefficients of a low-dimensional shape space, learned from a training set of thousands of registered 3D human body scans. The pose parameters represent the joint angle in an axis-angle representation of the relative rotation between body parts. The posed body model is formulated as below given the shape and pose parameters,


where is the base template mesh, and

are vectors of vertices representing offsets from the base template as controlled by the shape and pose parameters respectively. Therefore,

is the mesh of base template with the addition of both shape and pose blend shapes . is the joints position under the rest pose as controlled by the shape parameters. is a blend skinning function which transforms the mesh from pose to the current pose as controlled by the blending weights . More details about the SMPL model can be found in paper [Loper15SMPL].

An overview of our method is shown in Figure 1. First, we optimize the SMPL model to let it fit to each of the partial scan. Afterwards we align every two partial pieces that have great overlap region by using the correspondences conveyed by the SMPL model. Finally, we register those pieces altogether with a global non-rigid registration approach. The model is further textured with our texture mapping procedure as described in Section III-D.

Figure 1: System Pipeline.

Iii-a Initial fitting

For every frame of the RGBD images, we solve the pose and shape parameters of the SMPL model so that the generated 3D human model fits as closely as possible to the captured RGBD image. For each frame and , we achieve this by minimizing the following objective:


The data term is defined as:


First, we have the surface fitting term so that for each vertex in the surface , we minimize its distance to the closest vertex on the generated SMPL model :


The joints fitting term is formulated to match the model joints to the joints of the partial scans (denoted as ).

is the function that transforms the joint from its rest pose to current positions as controlled by the pose parameters using the chain rule defined by the human skeleton. We compute the 2D joint locations in the color image using OpenPose 

[Cao18Openpose], after which the 3D human joints are estimated by back-projecting the 2D joints into 3D space with the depth information. is a robust Geman-McClure penalty function [Geman87penalty]. This term is important to address large pose changes.


The other term

is a pose regularization term formulated as below which penalizes unusual poses. It is defined as a Gaussian mixture model trained from the CMU dataset 

[CMUmocap19] where

is a Gaussian distribution with its mean and variance denoted as

and respectively.


We get the shape and pose parameters for each piece by minimizing the above objective function so that the optimized SMPL model will fit to the partial scans.

Furthermore, for every partial scan they should have consistent body shapes as for the same human subject. Therefore, we propose a bundle adjustment approach to refine the shape and pose parameters by minimizing the total misalignment error of all those partial pieces to the SMPL model with respect to a consistent body shape and their poses respectively. Mathematically the objective function is formulated as below,


We initialize the pose parameters with those computed separately from each piece. The shape parameters are initialized by the one computed from a frontal piece. We show the fitting results in Figure 2 showing the optimized SMPL that fits to the input partial scans.

Figure 2: Initial Fitting results. (a) is the input RGBD frame and we show the detected joints on the color image. (b) shows the optimized SMPL aligned with the input scan. (c) shows the deformed input scan that fits even better to the SMPL model.

Iii-B Template guided pairwise alignment

After we get the optimized SMPL model that fits to the input RGBD images, we take it as guidance for initial alignment of those partial scans. Before that, since we cannot find any SMPL model that will fit perfectly to the input mesh because of the casual clothes, we further deform the input mesh onto the optimized SMPL model to get better alignment, as shown in Figure 2(c). After that, we can establish correspondences from every input scan to the optimized SMPL model via nearest search. And then the correspondences between every two input scans are established via the SMPL model.

Similar to the registration approach proposed in  [li2008global], we register partial scans by exploiting the Embedded Deformation Model (EDM)  [Sumner07EDM] to parametrize the mesh. To be different from the previous registration method which requires the partial scans to be close to each other so as to have a proper initialization, we get the correspondences between the partial scans via the SMPL model. We describe the proposed method in detail below.

For the deformation model, a set of graph nodes () are uniformly sampled throughout the mesh, and for each node , it has an affine transformation specified by a matrix and a translation vector . For each vertex on the mesh it is controlled and deformed by its nearest graph nodes with a set of weights:


We compute the deformation from to by building a graph for the mesh and estimate the deformation parameters (denoted as ) and (denoted as ) by minimizing the following objective function:


The term serves as the as-rigid-as-possible term preventing arbitrary surface deformation.


The smoothness term ensures smooth deformation of neighboring graph nodes.


The term is our data term which penalizes the distances between correspondences on these two pieces, which are extracted through the above optimized SMPL model for and for . Specifically, for a vertex on piece , we find its nearest vertex on within a certain threshold, which is denoted as . And we extract the vertex from which has the same vertex index as . Then we find the nearest vertex for with respect to the mesh , which is denoted as . The distance between and is minimized.


To get better alignment, we use the color information to refine the initial registration. In details, first every partial scan is textured with its corresponding color image. Suppose we have got the deformed mesh of which is aligned to after the above registration, and we denote it as . Now, we render a color image with the deformed mesh onto the same space with respect to the color image . We compute a flow field from to and map the flow correspondences to the meshes. Finally, the deformation from to is further optimized using the EDM by enforcing the color correspondences. We show a pairwise registration result in Figure 3. As shown in Figure 3, we are able to align pieces that have large pose variation. As we can see in Figure 3(c), it seems that we can already get good overlaid meshes without color information. However, the misalignment still exists which can be seen clearly when we attach color onto the meshes. Therefore, we enforce the color correspondences to resolve this issue where Figure 3(d) shows the overlaid meshes.

Figure 3: Pairwise registration results. (a) and (b) are two sampled pieces. We also demonstrate the overlay of optimized SMPL model and the input scans below. The mesh of (a) is deformed onto the mesh of (b). (c) shows our registration result of mesh (a) and mesh (b) but without color information. (c) shows our registration result of mesh (a) and mesh (b) with color information. We also display the overlaid meshes with color attached to demonstrate the effectiveness of the color information for the registration.

Topology Change Another important property of our method on pairwise registration is that we are able to deal with the topology changes quite conveniently by exploiting the information provided by the human template. That is, we can extract body part information from the optimized template model and assign a body part for each vertex of the input mesh. First, we delete the faces for which their corresponding vertices do not belong to the same body part nor do they belong to the body parts that have parent or child relationship. Next, while building up the embedded graph, we only connect graph nodes that belong to the same body part or neighboring parts. In the meanwhile, we set further constraints that the vertex is controlled by the graph nodes belonging to either the same body part or neighboring parts defined by its parents or child nodes. We show example of pairwise registration of two partial pieces that have topology changes in Figure 4. As we want to deform mesh of Figure 4(a) to the mesh of Figure 4(b) where the topology has changed, the deformation cannot get implemented correctly without explicitly handling the topology change(Figure 4(c)). However, the problem can be resolved with our method by taking advantage of the semantic information contained in the template. The deformed mesh with our approach is shown in Figure 4(d), which aligns well with the target mesh as shown in Figure 4(e).

Figure 4: Pairwise registration results with topology changes. (a) and (b) are two sampled pieces. We try to deform mesh (a) onto mesh (b) where the topology has changed. (c) shows the deformed mesh of (a) without taking the topology change into account. (d) shows the deformed mesh using our approach. (e) is the overlay of deformed mesh and the target mesh.
Figure 5: Texture mapping results.

Iii-C Global alignment

After the initial alignment, we are able to establish correspondences between those partial pieces, with which we can align them globally into a canonical model. Similar to the registration of two partial pieces, we exploit the Embedded Deformation Model here to extrapolate the deformation field. It means for every partial piece() we have a deformation graph embedded with it and our goal will be to solve those graph parameters(, ) altogether. The objective function is formulated as,


The first two terms are the as-rigid-as-possible and smoothness term respectively as defined in Equation 12 and 13. We have the third term defined as below as the data term enforcing the correspondences between partial scans achieved from the above pairwise initial alignment.


where and are any two pieces that have sufficient overlaps, and is the correspondence set we have got after the pairwise alignment. The deformed mesh of is supposed to fit onto the target mesh as controlled by the correspondences. Besides, vertices of the reference frame is enforced as fixed constraints.

Finally, with all those input partial pieces deformed to a canonical space, we apply Poisson surface reconstruction and get the final fused human model.

Iii-D Texture optimization

In some applications such as free-viewpoint video generation and teleconference, a 3D geometric human body is not enough and we want the model to be textured. Previous human model scanning systems that use a single RGBD camera usually output models with per-vertex color since it is rather difficult to maintain and update the texture atlas during the fusion process. However, the per-vertex color could be very blurry (as shown in Figure  5(a)) when the resolution of the mesh is not high enough. Therefore, instead of computing per-vertex color we attach texture maps onto the model. The input is the reconstructed human model together with those partial pieces aligned to the canonical model as well as their corresponding color images. Our goal is to generate a consistent and clear texture map for the 3D human model given the input.

There are some texture mapping methods that project the meshes onto multiple image planes, and then adopt weighted average blending strategy to synthesize model textures. However, the generated texture is still blurry in our case (as shown in Figure  5 (b)) as the misalignment between those partial pieces still exists, which means the textures from different images are not perfectly matched. Previous approaches  [gal10seamless] tackled the misalignment problem by selecting the textures from multiple views while minimizing the seams. But it also failed in our case (as shown in Figure 5 (c)) as we only have sparse input frames. Therefore, instead of directly synthesizing from multiple images, we try to eliminate possible misalignment and optimize a warping field for every image consecutively before attaching these to the mesh model. We describe our texture optimization approach below.

Figure 6: Comparison of reposing of a human avatar.

Starting from the reference frame, we attach the corresponding image onto the reconstructed mesh model by projecting the mesh onto the image plane and compute the texture coordinates for every face that is visible in the reference frame. For the next neighboring frame , we deform the reconstructed human model onto mesh using the correspondences acquired from the above global registration. Then, we render a color image with respect to the view direction of frame from the current textured human mesh model. On the other hand, we have the captured color image for the frame . The possible misalignment between and will cause visual seams if we attach the image directly onto the current human mesh. To address this problem, instead of adjusting the texture coordinates for each face in the 3D mesh which is difficult to optimize, we try to find a warping field for in the image plane so that the warped image will be well aligned with . In details, first we detect the overlap regions of the texture map between and , which we denote as . A flow field is computed from to for the overlap part. Next, we propagate the flow field to the non-overlap part by minimizing the following objective function, from which the overall warping field is estimated,


where the first term is to keep the warping field close to the estimated flow filed in the overlap region and the second term is enforced to keep the warping field as smooth as possible so that we can propagate the flow to the non-overlap region. Finally, we introduce the last term as a boundary term to set constraints for pixels that are not connected to the overlap regions.

Afterwards, we select the optimal texture image for each face of the human model to generate the final texture maps. In Figure 5, we show the texture mapping results w/o our texture optimization procedure.

Iv Implementation details

To capture the real dataset, we have used the Kinect V2 and the human subject is asked to rotate in front of the camera. But we do not assume any specific pose or slow motion during capture. We have captured twelve frames for each human subject. But we only used two or four frames in some case as demonstrated in VI. The captured depth maps are quite noisy, so they are smoothed in the first place as a preprocessing step before fusion.

The parameters , in the initial fitting objective function are set to be 7.5, 2.0 respectively. For the deformation model, is set to be 0.2, is 0.5 and is 1.0. For each input scan, we evenly sample 500 nodes over the mesh to build up the deformation graph. During the warping field computation in texture optimization process, is set to be 0.8 and is 1.0. Those parameters are manually tuned and kept fixed in all the experiments shown in the paper.

We implement most parts of our framework in Matlab. We run the algorithm on a desktop with 8-core 3.2GHz Intel CPU and 32 GB memory. It takes approximately 490s for the overall framework. In details, for the initial fitting, it takes about 14s for every piece and 116s for pairwise registration, and finally 107s for the global alignment. The texture mapping procedure takes about 104s.

V Applications

Figure 7: Illustration of our personalized avatar generation. We optimize the SMPL model to has a close fit to the reconstructed model before building up our personalized SMPL model.

In this section we present an useful application to generate human models under various shapes and poses by building up a personalized SMPL model from the reconstructed human avatar estimated with our proposed sparse fusion approach.

Figure 8: Results on a synthetic dataset.

Previous approaches drive the human avatar via manual or auto rigging and setting up the skinning weights. However, it is not a trivial task to set proper skinning weights which will produce unrealistic deformations at the joints as shown in Figure 6. In addition to reposing the reconstructed model, we want to be able to adjust the shapes and synthesize avatar to be fatter or thinner. This is not easy to achieve via the simple skinning weights transfer. Therefore, instead of transferring the skinning weights from a general template, in this paper we embed the SMPL model into the reconstructed human avatar and propose a hierarchical representation for deformation. That is, we want to take advantage of the SMPL model for body reshaping and reposing and also to preserve the surface details beyond the SMPL model.

Starting from the SMPL model we have got from the initial fitting procedure by fitting to those partial pieces as described in Section III-A, we further optimize it to have a closer fit to the complete 3D model after the fusion. This is achieved in a similar fashion to the initial model fitting. The only difference is we do not need to enforce any prior in this case as we already have a good initial model. Besides, the complete human model obtained through our SparseFusion method provides us with sufficient constraints for the estimation of the SMPL parameters. Therefore, we just need to penalize the distance between the SMPL model and the reconstructed human model by solving the objective function as defined in Equation 13. We show the optimized SMPL overlaid with the reconstructed model in Figure 7.

In the next step, for each vertex in the SMPL model we could find its correspondence in the reconstructed model via nearest search. We construct a displacement map from the SMPL model to these correspondences on the reconstructed mesh. The SMPL model could be reposed or reshaped by setting up the pose or shape parameters. We apply the displacement map to the reposed mesh, which is denoted as .


However, the repose SMPL mesh still lacks surface details. We take it as intermediate mesh and the vertices on the mesh as control points to deform the reconstructed avatar under the as-rigid-as-possible deformation. The animation results are shown in Figure 6(c).

Figure 9: Models of synthetic datasets.

Vi Experiments

Figure 10: Results on real datasets. The left four columns are sampled input scans; The three middle columns are the fused model and models deformed to some input scans. We display the textured models in the two rightmost columns. The number of the vertices for the three reconstructed models from top to down is 60385, 54281 and 57826 respectively.

We demonstrate the effectiveness of our approach in the experimental part with both quantitative and qualitative results.

Vi-a Quantitative evaluation on synthetic datasets

We tested our system on synthetic datasets that we have created using Poser [Poser19]. We have selected four human subjects (as shown in Figure 9) and for each human subject we generate eight models under different poses. We synthesize one depth map and one color image for each model with a virtual camera rotating around the subject, which means we have got eight depth maps and color images as input with each frame corresponds to a model in a specific pose. We demonstrate an example in Figure 8(a). Our reconstruction system results in a shape (as shown in Figure 8(b)(c)) with respect to the first selected frame which is taken as the canonical frame. We plot the error map to show the geometric error of our reconstructed model with respect to the ground-truth model. The error for each vertex is computed via a nearest search to the ground-truth mesh. We also evaluated our method with only six input frames. As shown in Figure 8(g)(h), we are able to reconstruct the human model with quite sparse frames.

3D self-portrait [Hao13selfportrait], which also takes eight partial pieces as input, is closely related to our work. We implement the 3D self-portrait and test their method on our synthetic dataset. As can be seen in Figure 11(b), it is quite difficult to align those partial pieces without dealing with the large pose changes. Therefore, the misalignment appears especially around the arms and legs.

We also compare our method with the current state-of-the-art human body reconstruction method using deep learning techniques 

[saito2019pifu]. It is quite convenient to use a single color image as input, however, the reconstructed model is over-smoothed and lacks surface details as shown in Figure 11(c). Also the inherent depth ambiguity results in inaccurate 3D poses and body shapes.

Besides, to compare our method with the current state-of-the-art fusion based approach  [Yu18doublefusion] which fuses a depth sequence into a canonical model by continuously tracking the surface evolution, we have rendered a depth sequence with 90 frames for each human subject. To maintain continuous motion along the sequence, we conduct extrapolation among the selected sparse models. As shown in Figure 11(d), there are some artifacts along the legs and arms in the fused canonical models caused by the accumulated error and imperfect initialization as they require an A-pose as the starting pose.

Figure 11: Comparison with state-of-the-art human body modeling methods on a synthetic dataset.
Figure 12: Demonstration of the reconstructed models with quite sparse frames. (a) shows the sampled color images. (b) shows the fused model and the two pieces used to reconstruct the model. The number of the vertices for the fused model is 44673. (c) shows the fused model and the four pieces used to reconstruct the model. The reconstructed model has 47540 vertices.

Table I shows the reconstruction error. We evaluate the reconstruction error of the fused models using our method with 1, 6 and 8 frames as input. For the reconstruction using only one frame, we take the optimized SMPL model as the reconstructed model. We also compute the reconstruction error for the models achieved from DoubleFusion [Yu18doublefusion] and PIFu [saito2019pifu]. As demonstrate in Table I, our proposed method has achieved the best performance with reconstruction error as low as several millimeters.

PIFu [saito2019pifu] DoubleFusion [Yu18doublefusion] ours (number of frames)
1 6 8
subject 1 16.8 16.4 17.1 9.2 7.4
subject 2 68.1 18.9 19.2 10.3 8.7
subject 3 62.1 15.4 16.9 10.4 8.2
subject 4 58.1 51.7 18.7 9.6 6.8
mean error 51.28 25.63 17.98 9.88 7.78
Table I: Reconstruction error. For each human subject, we compute the distance of every vertex on the reconstructed model to its nearest vertex on the ground-truth model. The reconstruction error (in mm) indicates the average distance for all the vertices.

Vi-B Qualitative evaluation on real datasets

For the qualitative evaluation, we have captured RGBD sequences of several human subjects with a Microsoft Kinect V2. The results of our method are displayed in Figure 10. For each reconstruction, we use twelve RGBD frames as input. We take a frontal piece as the canonical frame and deform all other pieces onto it. As demonstrated in Figure 10, complete human models with sufficient surface details are recovered. Besides, we can also deform the reconstructed human model onto any input scan.

We have also conducted visual evaluation with DoubleFusion [Yu18doublefusion] on a real dataset with the results shown in Figure 13. The human subject was required to try to maintain A-pose while rotating in front of the camera. Although the DoubleFusion method also exploits human template to track the human poses, there are still some artifacts in the fused model caused by accumulated error as shown in Figure 13(b). The reason is that they rely on accurate tracking along the whole sequence. As compared with the fusion method, our proposed method is able to reconstruct complete models without any seams(Figure 13(c)).

Figure 13: Comparison with dynamic fusion approach.

In Figure 12, we demonstrate the ability of our method on model fusion with quite limited frames. In this case, since the overlapping regions between every two pieces are very small, it is not sufficient to perform pairwise registration. Therefore, we deform every partial scan onto the canonical space as guided by the SMPL template in the first place. We show the reconstructed models with only 2 and 4 pieces. The reconstruction becomes better when more frames are used. As shown in the red box, there are irregular bumps in the reconstructed model after Possion Surface Reconstruction when we take 2 pieces as input. The reconstructed surface gets better when we have 2 more pieces.

We further demonstrate the effectiveness of our method on dealing with topology changes in Figure 14. We have tackled this problem explicitly while performing deformation, therefore we are able to generate pleasant results in this case.

Figure 14: Results on changing topology. The reconstructed model has 58417 vertices.

Vi-C Applications on animation

In this section, we show some results on animated human avatars by building up personalized SMPL model. We could adjust the parameters representing the shape of the model to synthesize human models that are shorter/taller, or fatter/thinner as shown in Figure 15(a). Meanwhile, we could generate human avatars under various poses (as displayed in Figure 15(b)) by manipulating the pose parameters of our personalized SMPL model.

Figure 15: Reshaping and reposing of a human avatar.

Vi-D Limitations

In this section, a failure case is demonstrated in Figure 16 where the captured human subject is wearing a dress. During the pairwise registration we exploit the SMPL based human template to find initial correspondences between partial scans. Since this human template is built up from naked human models, it fails to find reliable matches around the folds of the dress. Eventually, we get the reconstructed model where the shape of the dress is not fully recovered. But we can still achieve reasonable results overall where the upper body and the legs are well reconstructed. It is noticed that there are also some artifacts around the hair, as it is quite noisy in the captured depth map for the hair due to its reflection characteristics.

Figure 16: Reconstruction results on human subject with loose clothes. (a) shows two sampled input RGBD scans. (b) shows the reconstructed model from our approach. The red box highlights the artifacts on the reconstructed model where the shape of the dress was successfully recovered.

Vii Conclusion and Future work

In this paper, we have proposed a novel approach to build up a complete human avatar from only sparse RGBD images. To align those partial pieces of a human body under different poses and viewpoints into a canonical model, a SMPL based human template was utilized to align the input partial pieces. After constructing the complete human model, we presented a texture mapping method to construct spatially consistent texture maps for the reconstructed human model. Experiments on both synthetic and real datasets demonstrate the excellent performance (with reconstruction error in few millimeters) of our framework in reconstructing complete human bodies. As a potential application, animations are carried out with our reconstructed human avatar across various shapes and poses.

At the moment, the human modeling method is designed for a single person. For future work, we look at the more challenging problem of reconstructing multiple human subjects with interactions, which often contain significant occlusions and convoluted topological structures.