Code and data for our paper "High-Fidelity 3D Digital Human Creation from RGB-D Selfies".
We present a fully automatic system that can produce high-fidelity, photo-realistic 3D digital human characters with a consumer RGB-D selfie camera. The system only needs the user to take a short selfie RGB-D video while rotating his/her head, and can produce a high quality reconstruction in less than 30 seconds. Our main contribution is a new facial geometry modeling and reflectance synthesis procedure that significantly improves the state-of-the-art. Specifically, given the input video a two-stage frame selection algorithm is first employed to select a few high-quality frames for reconstruction. A novel, differentiable renderer based 3D Morphable Model (3DMM) fitting method is then applied to recover facial geometries from multiview RGB-D data, which takes advantages of extensive data generation and perturbation. Our 3DMM has much larger expressive capacities than conventional 3DMM, allowing us to recover more accurate facial geometry using merely linear bases. For reflectance synthesis, we present a hybrid approach that combines parametric fitting and CNNs to synthesize high-resolution albedo/normal maps with realistic hair/pore/wrinkle details. Results show that our system can produce faithful 3D characters with extremely realistic details. Code and the constructed 3DMM is publicly available.READ FULL TEXT VIEW PDF
Code and data for our paper "High-Fidelity 3D Digital Human Creation from RGB-D Selfies".
Project page for our paper "High-Fidelity 3D Digital Human Creation from RGB-D Selfies".
Real-time rendering of realistic digital humans is an increasingly important task in various immersive applications like augmented and virtual reality (AR/VR). To render a realistic human face, high-quality geometry and reflectance data are essential. There exist specialized hardware like Light Stage (Alexander et al., 2009) for high-fidelity 3D faces capturing and reconstruction in the movie industry, but they are cumbersome to use for consumers. Research efforts have been dedicated to consumer-friendly solutions, trying to create 3D faces with consumer cameras, e.g., RGB-D data (Zollhöfer et al., 2011; Thies et al., 2015), multiview images (Ichim et al., 2015), or even a single image (Hu et al., 2017; Yamaguchi et al., 2018; Lattas et al., 2020). While good results have been shown, the reconstructed 3D faces still contain artifacts and are far from satisfactory.
Indeed, faithful 3D facial reconstruction is a challenging problem due to the extreme sensitivity that human perception has towards faces. First, the recovered facial geometry needs to preserve all important facial features like cheek silhouettes and mouth shapes. Single-image based approaches (Hu et al., 2017; Yamaguchi et al., 2018; Lattas et al., 2020) can hardly achieve this due to the lack of reliable geometric constraints. With multiview RGB/RGB-D inputs, existing approaches (Zollhöfer et al., 2011; Thies et al., 2015; Ichim et al., 2015)
do not fully leverage most recent advances in deep learning and differentiable rendering(Genova et al., 2018; Gecer et al., 2019), leading to inaccurate recovery that does not fully resemble the user’s facial shape. Second, the synthesized facial reflectance maps need to be high-resolution with fine details like eyebrow hair, lip wrinkles, and pore details on facial skin. Several recent work (Saito et al., 2017; Yamaguchi et al., 2018; Lattas et al., 2020) have tried to address these issues, but their results still lack natural facial details that are critical for realistic rendering.
In this paper, we present new facial geometry modeling and reflectance synthesis approaches that can produce faithful geometry shapes and high-quality, realistic reflectance maps, from multiview RGB-D data. Our geometry modeling algorithm extends differentiable renderer based 3DMM fitting, such as GANFIT (Gecer et al., 2019), from single image to multiview RGB-D data. Different from GANFIT, we employ conventional PCA-based texture bases instead of GAN to reduce the texture space, so that more data constraints can be exerted on geometric shaping. Additionally, we present an effective frame selection scheme, as well as an initial model fitting procedure, which can avoid enforcing conflicting constraints and increase system robustness. Moreover, we propose an effective approach that takes advantages of extensive data generation and perturbation to construct the 3DMM, which has much larger expressive capacity compared with previous methods. We show that even with the linear bases of the new 3DMM, our method can consistently recover accurate, personalized facial geometry.
For facial reflectance modeling, we use high-resolution 2K (2048 2048) UV-maps consisting of an albedo map and a normal map. We propose a hybrid approach that consists of a regional parametric fitting and CNN-based refinement networks. The regional parametric fitting is based on a set of novel pyramid bases constructed by considering variations in multi-resolution albedo maps, as well as high-resolution normal maps. Faithful but over-smoothed high-resolution albedo/normal maps can be obtained in this step. GAN-based networks are then employed to refine the albedo/normal maps to yield the final high-quality results. Our experiments show that even with the resolution inputs, our method can produce high-resolution albedo/normal maps, where eyebrow hair, lip wrinkles and facial skin pores are all clearly visible. The high-quality reflectance maps significantly improve the realism of the final renderings in real-time physically based rendering engines.
With the recovered facial geometry and reflectance, we further present a fully automatic pipeline to create a full head rig, by completing a head model, matching a hair model, estimating the position/scale of eyeballs/teeth models, generating the expression blendshapes, etc. We conduct extensive experiments and demonstrate potential applications of our system.
A fully automatic system for producing high-fidelity, realistic 3D digital human characters with consumer-level RGB-D selfie cameras. Compared with previous avatar approaches, our system can generate higher quality assets for physically based rendering of photo-realistic 3D characters. The total acquisition and production time for a character is less than 30 seconds. The core code and the constructed 3DMM will be made publicly available111See our project page at: https://tencent-ailab.github.io/hifi3dface_projpage/.
A robust procedure consisting of frame selection, initial model fitting, and differentiable renderer based optimization to recover faithful facial geometries from multiview RGB-D data, which can tolerate data inconsistency introduced during user data acquisition.
A novel morphable model construction approach that takes advantages of extensive data generation and perturbation. The constructed linear 3DMM by our approach has much larger expressive capacity than conventional 3DMM.
A novel hybrid approach to synthesize high-resolution facial albedo/normal maps. Our method can produce high-quality results with fine-scale realistic facial details.
Creating high-fidelity realistic digital human characters commonly relies on specialized hardware (Alexander et al., 2009; Beeler et al., 2010; Debevec et al., 2000) and tedious artist labors like model editing and rigging (von der Pahlen et al., 2014). Several recent work seek to create realistic 3D avatars with consumer devices like a smartphone using domain specific reconstruction approaches (i.e., with face shape/appearance priors) (Ichim et al., 2015; Yamaguchi et al., 2018; Lattas et al., 2020). We mainly focus on prior arts along this line and briefly summarize the most related work in this section. Please refer to the recent surveys (Egger et al., 2020; Zollhöfer et al., 2018) for more detailed reviews.
The 3D morphable model (3DMM) is introduced in (Blanz and Vetter, 1999) to represent a 3D face model by a linear combination of shape and texture bases. These bases are extracted with PCA algorithm on topological aligned 3D face meshes. To recover a 3D face model from observations, the 3DMM parameters can be estimated instead. Since the 3DMM bases are linear combinations of source 3D models, the expressive capacity of a 3DMM is rather limited. Researchers tried to increase the capacity either by automatically generating large amounts of topological aligned face meshes (Booth et al., 2016) or turn the linear procedure into a nonlinear one (Lüthi et al., 2017; Tran and Liu, 2018). However, the generated face models with these 3DMM models are usually flawed and not suitable for realistic digital human rendering. Another line to increase the expressive capacity of 3DMM is to segment the face into regions and then employ spatially localized bases to model each region (Blanz and Vetter, 1999; Tena et al., 2011; Neumann et al., 2013). We present a novel data augmentation approach that can effectively increase the capacity of either global or localized 3DMM with the same amount of source 3D face meshes as existing approaches.
Given a single face image, the 3D face model can be recovered by estimating the 3DMM parameters with analysis-by-synthesis optimization approaches (Blanz and Vetter, 2003; Romdhani and Vetter, 2005; Garrido et al., 2013; Thies et al., 2016; Hu et al., 2017; Garrido et al., 2016; Yamaguchi et al., 2018; Gecer et al., 2019). A widely adopted approach among them is described in the Face2Face work (Thies et al., 2016), where the optimization objective consists of photo consistency, facial landmark alignment, and statistical regularization. Although there is a recent surge of deep learning based approaches to use CNNs to regress 3DMM parameters (Zhu et al., 2016; Tran et al., 2017; Tewari et al., 2017; Genova et al., 2018), the results are commonly not in high fidelity due to lack of reliable geometric constraints. Some work go beyond the 3DMM parametric estimation to use additional geometric representations to model facial details (Richardson et al., 2017; Tran et al., 2018; Tewari et al., 2018; Jackson et al., 2017; Sela et al., 2017; Chen et al., 2020; Guo et al., 2019; Kemelmacher-Shlizerman and Basri, 2011; Shi et al., 2014), but the results are generally not satisfactory for realistic rendering.
Ichim et al. (2015) present a complete system to produce face rigs by taking hand-held videos with a smartphone. The system relies on a multiview stereo reconstruction of the captured head followed by non-rigid registrations, which is slow and error-prone, especially when motion occurs or no reliable feature points can be detected in face regions. Besides, the two separated steps are very brittle: the reconstruction step cannot utilize the strong human facial prior from 3DMM and hence its results are usually rather noisy, which further leads to erroneous registration results. Recent research on multiview face reconstruction with deep learning approaches (Dou and Kakadiaris, 2018; Wu et al., 2019) do not explicitly model geometric constraints and thus are not accurate enough for high-fidelity rendering.
Modeling facial geometries from RGB-D data commonly consists of several separated steps (Zollhöfer et al., 2011; Weise et al., 2011; Bouaziz et al., 2013; Li et al., 2013; Zollhöfer et al., 2014). First, accumulated point clouds are obtained with rigid registration (Newcombe et al., 2011). Then a non-rigid registration procedure is employed to obtain a deformed mesh from the target mesh model (Bouaziz et al., 2016; Chen et al., 2013). Finally, in order to obtain a 3DMM parametric representation, a morphable model fitting is applied using the deformed mesh as geometric constraints (Zollhöfer et al., 2011; Bouaziz et al., 2016). Although the approach is widely adopted as standard practices, it suffers from accumulated errors due to the long pipeline. Thies et al. (2015) propose to use an unified parametric fitting procedure to directly optimize camera poses together with 3DMM parameters, taking into account both RGB and depth constraints. Their method achieves high-quality results in facial expression tracking, but is not specially designed for recovering personalized geometric characteristics.
Saito et al. (2017) propose to synthesize high-resolution facial albedo maps using CNN style features based optimization like the style transfer algorithm (Gatys et al., 2016). However, the approach requires iterative optimization and needs several minutes of computation. Yamaguchi et al. (2018)
further propose to inference albedo maps, as well as specular maps and displacement maps, using texture completion CNNs and super-resolution CNNs. GANFIT(Gecer et al., 2019)
employs the latent vector of a Generative Adversarial Network (GAN) as the parametric representation of texture maps and then use an differentiable renderer based optimization to estimate the texture parameters. The most recent work AvatarMe(Lattas et al., 2020) propose to infer separated diffuse albedo maps, diffuse normal maps, specular albedo maps, and specular normal maps using a series of CNNs. We present a novel hybrid approach that can achieve high-quality results while at the same time is more robust than the above pure CNN-based approaches.
To complete a full head avatar model, accessories beyond face region need to be attached to the recovered face model, e.g., hair, eyeballs, teeth, etc. Ichim et al. (2015) describe a simple solution to transfer these accessories (except hair) from a template model and adapt the scales/positions to the reconstructed face model. Cao et al. (2016) use image-based billboards to deal with eyes and teeth, and coarse geometric proxy to deal with hair model. Nagano et al. (2018) employ a GAN-based network to synthesis mouth interiors. Hu et al. (2017) propose to perform hair digitization by parsing hair attributes from the input image and then retrieving a hair model for further refinement. There are also some approaches working on modeling hairs in strand level (Wei et al., 2005; Hu et al., 2015; Chai et al., 2015; Luo et al., 2013; Saito et al., 2018). For expression blendshape generation, Ichim et al. (2015) present a dynamic modeling process to produce personalized blendshapes, while Hu et al. (2017) adopt a simplified solution to transfer generic FACS-based blendshapes to the target model. The expression blendshapes can also be generated with a bilinear 3DMM model like FaceWarehouse (Cao et al., 2014), where the face identity and expression parameters are in independent dimensions.
We first introduce the 3D face dataset used in our system. Then we describe the goal of our system, followed by the user data acquisition process and a summary of the main processing steps.
We use a specialized camera array system (Beeler et al., 2010) to scan 200 East Asians, including 100 males and 100 females, aged from 20 to 50 years old (with their permissions to use their face data). The scanned face models are manually cleaned and aligned to a triangle mesh template with 20,481 vertices and 40,832 faces. Each face model is associated with a 2K-resolution (2048 2048) albedo map and a 2K normal map, where pore-level details are preserved. A linear PCA-based 3DMM (Blanz and Vetter, 1999) can be constructed from the dataset, which consists of shape bases, albedo map bases, and normal map bases. Note that we propose a novel approach to construct an augmented version of the 3DMM shape bases in Sec. 5.3. Besides, a novel pyramid version of the 3DMM albedo/normal maps is presented in Sec. 6.1.
The goal of our system is to capture high-fidelity users facial geometry and reflectance with RGB-D selfie data, which is further used to create and render full-head, realistic digital humans. For geometry modeling, we use 3DMM parameters to represent a face, since it is more robust to degraded input data and with more controllable mesh quality than deformation-based representations. For reflectance modeling, we synthesize 2K-resolution albedo and normal maps regardless of the input RGB-D resolution.
We use an iPhone X to capture user selfie RGB-D data. Note that it is common nowadays for a smartphone to be equipped with a front-facing depth sensor and any such phone can be used. While a user is taking selfie RGB-D video, our capturing interface will guide the user to consecutively rotate his/her head to left, right, upward, and back to middle. The entire acquisition process takes less than 10 seconds, and a total of 200-300 frames of RGB-D images are collected, with resolution . The face region for computation is cropped (and resized) to . The camera intrinsic parameters are directly read from the device.
We first employ an automatic frame selection algorithm to select a few high-quality frames that cover all sides of the user (Sec. 4). Then an initial 3DMM model fitting is computed with the detected facial landmarks in the selected frames (Sec. 5.1). Starting from the initial fitting, a differentiable renderer based optimization with multiview RGB-D constraints (Sec. 5.2) is applied to solve the 3DMM parameters as well as lighting parameters and poses. Based on the estimated parameters, high-resolution albedo/normal maps are then synthesized (Sec. 6). Finally, high-quality, realistic full head avatars can be created and rendered (Sec. 7).
There are typically 200-300 frames acquired from a user. For efficiency and robustness, we developed a robust frame selection procedure to select a few high-quality frames for further processing, which considers both view coverage and data quality. As shown in Fig. 2, the procedure consists of two stages as described below.
We first apply a real-time facial landmark detector (a MobileNet (Howard et al., 2017) model trained on 300W-LP dataset (Zhu et al., 2016)) on RGB images to detect 2D landmarks for each frame. Then a rough head pose for each frame can be efficiently computed with the correspondences between the 2D landmarks and the 3D keypoints on a template 3D face model using PnP algorithm (Lepetit et al., 2009). Frames with extreme/invalid poses or closed-eye/opened-mouth expressions can be easily identified and screened out with the 2D landmarks and rough head poses. We categorize the rest frames by poses into groups: front, left, right, and up. Each group only keeps 10-30 frames near the center pose of the group. Note that more groups can be obtained by categorizing the frames with finer-level angle partitioning. We experimented with different number of groups and found four is a good balance between accuracy and efficiency. The remaining depth images are preprocessed to remove depth values beyond the range between 40cm and 1m (the typical selfie distances). Bilateral filtering (Paris and Durand, 2009) with a small spatial and range kernel is then applied to the depth images to attenuate noises.
For each group, we further select one frame based on two criteria: image quality and rigidness. To measure the image quality of a frame, we compute the Laplacian of Gaussian (LoG) filter response and use the variance as a motion blur score (images with a larger score are sharper). A front face frame is first selected based on the motion blur score in the front group. We then compute the rigidness between each frame in the other groups and the front face with the help of depth data. Specifically, the detected 2D landmarks for each frame are lifted from 2D to 3D using depth data. Note that occluded landmarks are automatically removed according to the group that a frame belongs to, e.g., for a frame in the left group, the landmarks on the right side of the face are removed. We use RANSAC method to compute the relative pose between each frame in the other groups and the front face using the 3D-3D landmark correspondences(Arun et al., 1987)
. Frames with too many outliers are considered as low rigidness and thus are excluded. Then a best frame in each group can be found based on the motion blur score. The output of this step is four frames with the 3D landmarks.
We use PCA-based linear 3DMM (Blanz and Vetter, 1999)
for parametric modeling. The shape and albedo texture of a face model is represented as
where is the vector format of the mean 3D face shape model, is the shape identity basis, is the corresponding identity parameter vector to be estimated, is the vector format of the mean albedo map, is the albedo map basis, is the corresponding albedo parameter vector to be estimated. The details of the bases are presented in Secs. 5.3 (shape) and 6.1 (albedo).
We fit an initial shape model with the detected 3D landmarks using a ridge regression(Zhu et al., 2015). A partial texture map can be extracted by projecting the shape model onto each input image. With a predefined mask derived from landmarks for each view (see Fig. 3), the partial texture maps are then blended into a complete texture map using Laplacian pyramid blending (Burt and Adelson, 1983). The initial albedo parameters can be obtained with another ridge regression to fit the blended texture map.
Fig. 4 shows our optimization framework. The parameters to be optimized are
and is the shape parameter, is the albedo parameter, is the second-order spherical harmonics lighting parameter, include the rotation and translation parameters for rigid transformation. Note that we have only one and one for an user, while the number of and equals to the number of views. With a set of estimated parameters and the 3DMM bases, a set of rendered RGB-D frames can be computed via a differentiable renderer (Genova et al., 2018; Gecer et al., 2019)
. The distances between the rendered RGB-D frames and the input RGB-D frames can be minimized by backpropagating the errors to update parameters
. The loss function to be minimized is defined as:
where denotes pixel-wise RGB photometric loss, indicates pixel-wise depth loss, is identity perceptual loss, represents landmark loss, and means regularization terms. Note that the landmark loss, RGB photometric loss, and regularization term are similar to conventional analysis-by-synthesis optimization approaches (Thies et al., 2016). The identity perceptual loss is also employed in recent differentiable renderer based approaches (Genova et al., 2018; Gecer et al., 2019). We extend these losses into multiview setting and incorporate depth data for geometric constraints. The details of each term are as follows.
The pixelwise RGB photometric loss is:
where is the input RGB image, is the rendered RGB image from the differentiable renderer. We adopt -norm because it is more robust against outliers than -norm.
The depth loss is defined as:
where defines a truncated -norm that clips the per-pixel mean squared error, is the input depth image, is the rendered depth image from the differentiable renderer. The truncated function makes the optimization more robust to depth outliers.
We define the landmark loss as the average distances between the detected 2D landmarks and projected landmarks from the predicted 3D model:
where are the detected landmarks, denotes that the vertex is rigidly transformed by and projected by camera . The weighting is to control the importance of each keypoint, where we set 50 for those located in eye, nose and mouth, while others are 1.
To ensure the plausibility of the reconstructed faces, we apply regularization to shape and texture parameters:
where we set and .
For efficiency, we use albedo maps of a resolution during the optimization. We render RGB-D images and compute the pixel losses in the same resolution as input depth images, which is . The weightings in Eq. (1) is set to , , , , . We use Adam optimizer (Kingma and Ba, 2014)
in Tensorflow to update parameters for 150 iterations to get the results, with a learning ratedecaying exponentially in every 10 iterations.
The differences between our approach and state-of-the-art 3DMM fitting approaches are listed in Table 1. Result comparisons are presented in Sec. 8.2. Note that our implementation will be publicly available and can be easily configured into equivalent settings to other approaches by changing the combinations of input data and loss terms.
|(Thies et al., 2015)||RGB-D||Gauss-Newton|
As the constraints incorporated in the optimization are rich, we found the expressive capacity of the linear 3DMM constructed using conventional approaches are very limited. We here present an augmentation approach to effectively boost the 3DMM capacity. Our approach is motivated by the observation that human faces are mostly not symmetrical. This will cause ambiguities when aligning face models. The reason is that during the alignment of two models, the relative rotation and translation between them is determined by minimizing the errors at some reference points on the models. Different reference points may lead to different alignment results. There are no perfect reference points due to the asymmetrical structures of human faces. This reminds us that we can perturb the relative pose between two aligned models to get an “alternative” alignment. In this way, we can actually get additional samples for PCA, since the new alignments introduces new morphing targets. Furthermore, we can use a set of perturbation operations including pose perturbation, mirroring, region replacement, etc., to augment the aligned models. Based on the large amount of generated data, we propose a stochastic iterative algorithm to construct a 3DMM that compresses more capacities into lower dimensions of the bases.
Starting from the 200 aligned face shape models, our data generation and perturbation process consists of the following steps:
Region Replacement with Perturbation. We first replace the nose region of each model with other models, with a rotation perturbation along the pitch angle (uniformly sampled within degree). Mouth region is also processed in the same way. For eye region, we apply replacement without perturbation. The different perturbations are empirically designed by minimizing the introduced visual defects during processing. The facial regions used in this step are shown in Fig. 5.
Rigid Transformation Perturbation. We then apply rigid transformation perturbations to each face model, where the uniformly sampled range is set to: degree along yaw/pitch/roll angles for rotation, along each of the three axes for translation, for scale.
Mirroring. Finally, we apply a mirroring for all the generated face models along model local coordinate system. In this way, we get over 100,000 face models in total.
Our iterative 3DMM construction algorithm is presented in Alg. 1. There are two levels of loops in our algorithm. We maintain a model set for 3DMM construction and update it inside the loops. In each iteration of the outer loop, we sample a test set with models from the whole generated dataset. In each iteration of the inner loop, we use the constructed 3DMM from to fit models in , and add models with largest fitting errors in into . The convergence threshold is empirically set such that the inner loop is usually converged in less than 5 iterations. Note that in the inner loop, a model sample in could be repeatedly added into for several times. In this case, constructing a 3DMM from the final is different from directly performing PCA on the whole dataset as the data population is changed. Our algorithm encourages more data variance to be captured using fewer principal components (note that in each iteration we construct 3DMM using only the principal components with cumulative explained variance).
In order to validate the effectiveness of our technique, we design a numerical evaluation experiments with the help of BFM 2009 model (Paysan et al., 2009)
. Note that the source face models in our dataset are all East Asians, while those in BFM are mostly not Asians. The domain gap between the two datasets provides us a good benchmark for cross validation (our goal is not to model cross-ethnicity fitting, but use the relative fitting errors between different versions of 3DMM to evaluate their expressive power). For each BFM basis, we compute two 3D face models using the positive and negative standard deviation values. A total of 398 BFM face models are obtained in this way. We register the BFM face models to our mesh topology using Wrap3 software(R3ds, 2020). We use the extracted PCA bases from our dataset to fit the obtained BFM face models and measure the fitting errors. Fig. 6 shows the comparison of fitting errors between the augmented bases and the original bases. If only 100 bases are used, the augmented version has no advantage against the original version. As the number of bases grows, the augmented version clearly outperforms the original version. Note that the maximum number of bases of the original version is 200 since there are only 200 source models for 3DMM construction. For the augmented version, thousands of bases could be obtained since there are over 100,000 models after augmentation. Since our iterative algorithm emphasizes the expressive power of the principal components with cumulative explained variance, most of the expressive capacities are compressed into these components. In our experiments, we found the final number of the components with cumulative explained variance in different runs is generally around 500. Thus we use 500 bases through this paper. Fig. 7 shows a comparison of the facial geometries obtained using our optimization algorithm in Sec. 5.2 with different versions of PCA bases.
There are some approaches constructing separate 3DMM for each facial region (Blanz and Vetter, 1999; Tena et al., 2011; Neumann et al., 2013). The localized 3DMMs obtain more capacities compared with global models by separating the deformation correlations between different facial regions. The region replacement augmentation in our approach is in the same spirit as localized 3DMM by explicitly generating samples with possible combinations of facial regions from different subjects. Compared with localized models, our 3DMM avoids online fusion of facial regions and thus is more efficient. Besides, our perturbation scheme and iterative 3DMM construction algorithm can be applied to localized models to improve their capacities as well. In this paper, we employ global model for efficiency consideration.
In this section, we present a our hybrid approach to synthesis high-resolution albedo and normal maps. We notice that super-resolution based approaches (Yamaguchi et al., 2018; Lattas et al., 2020) cannot yield high-quality, hair-level details of the eyebrows. On the other hand, directly synthesizing high-resolution texture maps (Saito et al., 2017) may lead to overwhelming details, which also makes the rendering not realistic. Our approach addresses the problems with the help of a pyramid-based parametric representation. Fig. 8 shows the pipeline of our approach, which we explain as follows.
Fig. 9 illustrates the process to construct our regional pyramid bases. We first compute image pyramids consisting of two resolutions ( and ) for the 200 albedo maps in our
dataset. We divide facial regions into 8 sub-regions indicated as the different colors in the left UV-map. The region partitioning is based on the fact that different regions have different types of skin/hair details. Denote the set of all regions as . For each region , we construct a linear PCA-based blending model. We define each sample in our dataset as a triplet , where stands for albedo map of region , and and are the albedo map and normal map in resolution. Then the triplet is vectorized into a vector format by fetching and concatenating all pixel values together from the three maps in the region. Note that during the process, the pixel indices in the three maps are recorded such that the vectorized sample can be “scatter back222We use the “scatter_nd” function in Tensorflow as the “scatter back” operation.” into UV-map format. For each region , we apply a PCA on the the 200 vectorized samples to get the bases. Finally, the vectorized bases can be scattered back into UV-map format to obtain the blending bases , where is the low-resolution albedo basis, is the high-resolution albedo basis, is the high-resolution normal basis. and is the number of pixels within region in the -resolution.
The constructed regional pyramid bases have several advantages compared with conventional bases. First, the expressive capacity is larger than global linear bases, while each region can be processed individually to accelerate the runtime. Second, the bases capture variations in both albedo and normal map. Third, the incorporated multiple resolutions can emphasize more structural information in the extracted bases. With the pyramid basis, we can perform parametric fitting on the low resolution, and directly apply the same parameters on high-resolution bases to obtain high-resolution albedo and normal maps. This not only reduces computation, but also generally yields higher-quality results than directly fitting on high resolution. Note that the albedo bases employed in our geometric fitting procedure (Sec. 5.2) is the conventional global model in -resolution, while the bases used in this section dedicated for reflectance synthesis are the regional pyramid model. The reason is that we found the less powerful albedo bases would make the optimization constraints more imposed on the geometric parameter estimation rather than texture parameter estimation. Otherwise more powerful albedo bases tends to result in overfitted textures but underfitted geometries.
Since the albedo parameters obtained in Sec. 5.2 are based on conventional global bases, the resulting albedo maps are not satisfactory due to limited expressive power. Here we directly extract textures from the source images using the estimated shape and poses from Sec. 5.2. Then a model-based delighting using the estimated lighting parameters from Sec. 5.2 is applied on the extracted textures, followed by an unwrapping and blending to yield an initial albedo map . We use the -resolution regional bases to fit the initial albedo map:
where , denotes the total variation function, and . Note that the total variation term is essential to eliminate the artifacts in the resulting albedo maps near boundaries between regions. After obtaining , we can directly compute a high-resolution albedo map and a normal map as
With the help of regional pyramid bases, different types of skin/hair details in different regions can be separately preserved via the high-resolution bases, while the fitting process on low resolution makes the algorithm focus on major facial structures, e.g., the shape of the eyebrows and lips. Since the parametric representation is based on linear blending model, the results are usually over-smoothed (see Fig. 17). We next present our detail synthesis step to refine the albedo and normal maps.
We adopt two refinement networks to synthesize details for albedo and normal map respectively. The refinement networks employ the architecture of a GAN-based image translation model, pix2pix (Isola et al., 2017). As shown in Fig. 8, for albedo refinement, the network takes the fitted -resolution albedo map as input and outputs a refined albedo map in the same resolution. For normal refinement, the refined albedo map and the fitted normal map are concatenated along channel dimension. The refinement network takes the concatenation as inputs and outputs a refined normal map.
During training, we first use facial region replacement and skin color transfer (Reinhard et al., 2001) to augment the 200 high-quality albedo/normal maps (from the dataset for constructing the 3DMM) into 4000 maps, which serve as ground-truth supervision for training the two networks. Then we perform regional fitting (Sec. 6.2) on the 4000 maps to get the fitted albedo/normal maps, which serve as inputs of the networks during training. We only use the facial regions out of the whole UV maps for computing training losses. Similar to pix2pix (Isola et al., 2017), we keep loss and GAN loss in both networks. For albedo refinement, we additionally apply total variation loss to reduce artifacts and improve skin smoothness. The weights for , GAN and total variation losses are , , . For normal refinement, we additionally employ a pixel-wise cosine distances between the predictions and ground-truth maps to increase the accuracy of normal directions. The weights for , GAN and cosine distance losses are , , . The networks are trained with Adam optimizer for iterations.
There are three recent CNN-based approaches that can be adopted to synthesize high-resolution facial UV-maps, which are Yamaguchi et al. (2018), GANFIT (Gecer et al., 2019), and AvatarMe (Lattas et al., 2020). However, these approaches cannot produce satisfactory results in our case. GANFIT (Gecer et al., 2019) needs about times more training data than ours to train a GAN as the nonlinear parametric representation of texture maps. In their work, the texture maps are obtained using unwrapped photos, where shadings and specular highlights are not removed. In our system, the albedo maps and normal maps are created with very high-quality artistic efforts, where shadings and specular highlights are completely removed and hair-level details are preserved. It is rather difficult to extend our data amount to theirs while keeping such high data quality. Regarding the other two approaches, we also tried super-resolution based network as in Yamaguchi et al. (2018) and pure CNN-based synthesis in AvatarMe (Lattas et al., 2020), and found the results obtained with their approaches are generally inferior to ours. We present some comparison in Sec. 8.3.
Although the scanned 200 models in out dataset are full head models, there are usually no reliable geometric constraints beyond facial regions for a RGB-D selfie user. We employ an algorithm to automatically complete a full head model given the recovered facial model. The regions involved in our algorithm are:
A: facial region;
B: back head region;
C: intermediate region;
D: overlapped region between A and C.
Our goal is to compute a full head shape such that region A matches the facial shape and region B matches a reference back head shape. The reason of using a reference shape for back head region B is it can further ease the difficulties to attach accessories like hair models. To this end, we construct a head morphable model using only regions B C of the 200 source models. Note that this model does not need strong expressive power as face models, thus we do not employ the technique in Sec. 5.3 but directly use PCA to extract bases. Given the recovered facial shape of a user, we apply a ridge regression similar to Sec. 5.1 to get the head 3DMM parameters, using the constraints of region B D. Then the full head model is obtained by combining the resulting shape (B C) with facial region A.
We perform a hairstyle classification on the user’s front photo (using a MobileNet (Howard et al., 2017) image classification model trained on labeled front photos) and attach the corresponding hair model (created by artists in advance) to the head model according to the predicted hairstyle label. There are in total 30 hairs models in different hairstyles in our system (see supplementary materials). For eyeballs, we use template models and calculate the positions and scales based on reference points on the head model. For teeth, we employ an upper teeth model and a lower teeth model. The upper teeth model is placed according to reference points near nose, and it remains still when facial expression changes. The lower teeth model is placed according to reference points on the chin. When mouth opens or closes, the lower teeth model moves with the chin. Note that the accessory models are not the focus of this work, their modeling and animation can be found in dedicated research work (Zoss et al., 2018; Wu et al., 2016; Bérard et al., 2019, 2016; Zoss et al., 2019; Velinov et al., 2018).
The recovered full head mesh model, as well as the high-quality albedo map and normal map, can be rendered with any physically based renderer. In this work, we show rendered results using Unreal Engine 4 (UE4), with the material composition templates provided by the engine (2020). Since we do not model specular map in our approach, we use a same specular map from the material template for rendering all the results in this paper.
|Initial Model Fitting||0.1s|
|Regional Parametric Fitting||1.5s|
|(Ichim et al., 2015)||10 minutes||1 hour||15 minutes|
|(Cao et al., 2016)||10 minutes||1 hour||needed|
|(Hu et al., 2017)||1 second||6 minutes||–|
|Ours||10 seconds||15 seconds||–|
The selfie data acquisition typically takes less than 10 seconds (200-300 frames). The total processing time after data acquisition is about 15 seconds. Note that some of the processing steps like real-time landmark detection, coarse screening, and bilateral filtering can be computed on the smartphone client while the user is taking selfie. The preprocessed data are streamed to a server via WiFi during acquisition. The rest steps of the processing are computed on the server. Table 2 shows the runtime on our server with an Nvidia Tesla P40 GPU and an Intel Xeon E5-2699 CPU (22 cores). Note that the frame selection and expression blendshape generation are implemented with multi-thread acceleration and the total runtime is largely reduced thanks to parallel processing. Table 3 shows a time comparison with other avatar creation systems. In terms of total acquisition and processing time, our system provides a convenient and efficient solution for users to create high-quality digital humans.
We evaluate the quality of our recovered facial shapes in extensive experimental settings. The experiments include quantitative and qualitative comparisons of different variants of our approach and existing methods such as Face2Face (Thies et al., 2016; Hu et al., 2017; Yamaguchi et al., 2018), GANFIT (Gecer et al., 2019; Lattas et al., 2020), N-ICP (Weise et al., 2011; Bouaziz et al., 2016), etc. Note that all the results are obtained with our 3DMM bases for fair comparisons.
|Method||Average ranking percentage|
|Face2Face (Thies et al., 2016)||30.3%|
|GANFIT (Gecer et al., 2019)||27.3%|
|N-ICP (Bouaziz et al., 2016)||16.6%|
We use the same workflow as the production of our dataset to manually create the ground-truth models of two users. Since the ground-truth obtained in this way is very expensive, we only perform numerical evaluations on the two models to get quantitative observations. The corresponding geometries recovered from RGB-D selfie data with different approaches are evaluated. The results are in Fig. 11. It can be seen from the results that our method yields the lowest mean errors, closely followed by N-ICP and the single-view variants of our method (single rgbd+id). Compared with N-ICP, our method performs better on detailed facial geometries near eyes, nose, and mouth. It conforms to our motivation that appearance constraints (photo loss and identity perceptual loss) help capture more accurate facial features.
To demonstrate the ability of our method to capture high-level perceptual identity features, we design a novel face verification experiment for further numerical evaluation. We collect selfie data from 30 subjects and put their selfie photos into a large face image dataset with over photos of Asian people. Then the geometry models of the 30 subjects are obtained using different approaches. We use UE4 to render the recovered models into realistic images. For fair comparison of geometry models, we use the same albedo and normal maps for models obtained using different approaches for a user. The rendered realistic images are compared with all the images in the large dataset using a face recognition network (Deng et al., 2019). We then calculate the ranking of each user’s rendered image to his/her real selfie photo among all the face images. The average ranking percentage are shown in Table 4. Our method generally yields more recognizable shapes than other approaches.
Fig. 12 shows two examples of the shape models obtained using different variants of our method. The version with multiview RGB-D data generally outperforms other variants. Besides, as shown in the figure, random initialization of our optimization can lead to flawed models. The initial fitting in Sec. 5.1 improves the system robustness. We further show results comparisons between our method with Face2Face (Thies et al., 2016; Hu et al., 2017; Yamaguchi et al., 2018), GANFIT (Gecer et al., 2019; Lattas et al., 2020), and N-ICP (Weise et al., 2011; Bouaziz et al., 2016) in Fig. 10. In general, both our method and N-ICP can reconstruct more accurate facial shapes than the methods using only RGB data (Face2face and GANFIT). This can be clearly observed from the silhouettes near the cheek region in the results. Watching more closely, our method can recover more faithful and personalized facial shapes than N-ICP, especially near mouth region. The mouth shapes obtained with N-ICP are similar among all the faces, while our results preserve personalized mouth features and are more faithful to the photos. This can be explained by the additionally incorporated photometric loss and identity loss in our method.
Our method can produce albedo and normal maps with high-quality, realistic details while preserving major facial features of the users. Fig. 13 shows several examples of our obtained albedo and normal maps. Hair-level details in the albedo and normal maps are clearly visible. Fig. 15 shows the overlay results with synthesized albedo maps. The major facial features are consistent between the synthesized albedo maps and the input photos, especially in the eyebrow region. More importantly, the hair-level details near eyebrow region and the pore-level skin details are clearly visible. Note that the input RGB-D images in our method are in resolution, where actual skin micro/meso-structures and hair-level details are hardly visible (see Fig. 14 left column). The synthesized skin/hair details by our approach are actually plausible hallucination, which is critical for realistic rendering of digital humans. We further show the intermediate results of our albedo synthesis algorithm in Fig. 14. The unwrapped result (left) is extracted from the input photos. It can be seen from the close-up display that it is blurry and low-quality. Moreover, the overall map seems dirty due to the imperfect lighting removal step. After the regional fitting step, the result is much more clean but without hair/pore details. The final refined albedo map preserves the major facial features (e.g., the shape of eyebrows), while containing more high-quality, realistic details.
We first compare two variants of our detail enhancement models, i.e., with pix2pix (Isola et al., 2017) and pix2pix-HD (Wang et al., 2018). An example is shown below. The detail enhancement with pix2pix generally produces superior results than pix2pix-HD, especially around the eyebrow and cheek regions. Besides, we found pix2pix-HD is more difficult to train and the results are generally worse than pix2pix, possibly due to the small amounts of training data. Note that pix2pix-HD is adopted in AvatarMe (Lattas et al., 2020).
Fig. 17 shows a comparison between our method and a state-of-the-art super-resolution network (Zhang et al., 2018), which is trained on the same dataset as our detail enhancement network. The super-resolution network generally cannot yield hair-level details around eyebrows. Our regional fitting algorithm produces over-smoothed eyebrow strands, which can be further refined into clear hair-level details while preserving major eyebrow shape. Note that Yamaguchi et al. (2018) employed a super-resolution model for high-resolution texture synthesis.
We compare our result to two state-of-the-art high-resolution reflectance synthesis approaches (Yamaguchi et al., 2018), (Lattas et al., 2020) in Fig. 18. The hair-level details around eyebrows in our result are clearly with higher quality. Besides, the pore-level details on the cheek are more realistic in our result, while those in the other approaches seem noisy and unnatural.
We notice some commercial systems (e.g., Bellus3D (2020)) utilize RGB-D selfies to reconstruct static 3D face models and directly extract texture maps from input photos. Their systems commonly employ a model-free reconstruction approach like KinectFusion (Newcombe et al., 2011) and the results usually seem very faithful to input photos. However, there are several drawbacks in their results. First, as shown in Fig. 20, their reconstructed meshes are not topologically consistent and are prone to flaws. It would be difficult to attach accessories and animate them. Moreover, the extracted texture maps contain shadows and highlights, which are undesired since they cause severe unnatural issues when the rendered lighting is different from the captured lighting. In comparison, our results are high-quality and ready for realistic rendering and animation (see Fig. 20).
We conduct experiments for a user taking selfies in different lighting conditions (Fig. 19). The recovered shape and reflectance remain consistent regardless of different lighting conditions and poses. Note that there is an inherent decoupling ambiguity between skin color and illumination. The resulting skin color in the right of the figure is actually a little bit more yellowish due to the yellower input photo. However, the facial structures (like the eyebrow shape) in the resulting reflectance map are consistent.
Fig. 24 shows some of our rendered results with UE4. Thanks to the high-fidelity geometry and reflectance maps, the rendered results are realistic and faithful to input faces. Note the wrinkle details on the lips, the pore-level details on the cheek, the hair-level details of the eyebrows. Fig. 21 shows two examples of our results with attached hair models, which are retrieved from our hair model database by performing hairstyle classification on the selfie photos. More results are in the supplementary video.
Our approach does not take into account cross-ethnicity generalization. Since our 3DMM is constructed from East-Asian subjects, our evaluations are designed on the same ethnical population. We tried to directly apply our approach on some people from other ethnicities and Fig. 23 shows two examples. Although the results roughly resemble the subjects being captured, some ethnicity-specific facial features are not recovered. Besides, our approach cannot model facial hairs like moustache and beard. Children or aged people beyond the age scope of our dataset are also not considered in our approach.
The avatars created with our method are animation-ready. Animations can be retargeted from existing characters (Bouaziz and Pauly, 2014), or interactively keyframe-posed (Ichim et al., 2015), or even transferred from facial tracking applications (Weise et al., 2011). We demonstrate an application of lip-sync animation in the supplementary video, where a real-time multimodal synthesis system (Yu et al., 2019) is adopted to simultaneously synthesize speech and expression blendshape weights given input texts. Fig. 22 shows several snapshots of the animation. The application enables users to conveniently create high-fidelity, realistic digital humans that can be interacted with in real time. We also include another lip-sync animation result driven by speech inputs (Huang et al., 2020) in the supplementary video.
We have introduced a fully automatic system that can produce high-fidelity 3D facial avatars with a commercial RGB-D selfie camera. The system is robust, efficient, and consumer-friendly. The total acquisition and processing for a user can be finished in less than 30 seconds. The generated geometry models and reflectance maps are in very high fidelity and quality. With a physically based renderer, the assets can be used to render highly realistic digital humans. Our system provides an excellent consumer-level solution for users to create high-fidelity digital humans.
Future Work The animation with generic expression blendshapes are not satisfactory. We intend to extend our system to capture personalized expression blendshapes like Ichim et al. (2015). Besides, the current system employs very simple approaches to handle accessories like hair, eyeballs, and teeth. We intend to incorporate more advanced methods to model accessories.
Self-supervised learning of detailed 3d face reconstruction. IEEE Transactions on Image Processing. Cited by: §2.2.
Multi-view 3d face reconstruction with deep recurrent neural networks. Image and Vision Computing 80, pp. 80–91. Cited by: §2.2.
Image style transfer using convolutional neural networks. In Proc. CVPR, pp. 2414–2423. Cited by: §2.3.
Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §4, §7.
International journal of computer vision81 (2), pp. 155. Cited by: §4.
3D hair synthesis using volumetric variational autoencoders. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 37 (6), pp. 1–12. Cited by: §2.4.