I Introduction
Recovery of the 3D human facial geometry from a single color image is an illposed problem. Existing methods typically employ a parametric face modeling framework named as 3D morphable model (3DMM) [2]
. In a 3DMM there are a set of facial shapes and texture bases, which are built from realworld 3D face scans. A linear combination of these bases synthesizes a 3D face model. During the training process, a loss function is constructed to measure the difference between the input face image and the 3D face models. The linear coefficients (i.e., 3DMM parameters) can be generated by minimizing the computed loss. While conventional methods learn these coefficients via analysisbysynthesis optimization
[3, 19], recent studies have shown the effectiveness of regressing 3DMM parameters using CNN based approaches [26, 24, 22, 13, 9].Learning to regress 3DMM parameters via CNN requires a large amount of data. For methods based on supervised learning, the groundtruth 3DMM parameters are generated by optimizationbased fitting [26, 24] or synthetic data generation [17, 6]. The limitations appear that the generated groundtruth labels are not accurate and the synthetic data lacks realism. In comparison, the methods [22, 9] based on selfsupervised learning^{1}^{1}1We in this paper do not distinguish between the term “selfsupervised” and “unsupervised”, as both refer to learning without groundtruth annotations in our case. We prefer the term “selfsupervised”. do not employ this process while directly learning from unlabeled face images. For example, MoFA [22] learns to regress 3DMM parameters by forcing the rendered images to have similar pixel colors as input images in facial regions. However, enforcing the pixel level similarity does not imply similar facial identities. Genova et al. [9]
rendered several face images from multiple views. They use a face recognition network to measure the perceptual similarity between the input faces and the rendered faces. Although the method is capable of producing 3D models resembling the faces in the input images, it ignores local facial characteristics such as skin colors and facial expressions and leads to unfaithful reconstructions.
In order to model facial details beyond 3DMM, a few deep learning methods recently are proposed. While some methods
[12, 20] represent 3D faces completely without 3DMM, severely degraded results are usually obtained. More robust approaches typically represent 3D faces with detail layers in addition to 3DMM [18, 21, 25]. For example, learned parametric correctives are employed in [21], and 3D detail maps are employed in [18, 25]. Since the learned parametric correctives [21] have very limited expressive capabilities (see Fig. 1), we advocate 3D detail maps for detail modeling. However, existing approaches employing detail maps [18, 25] rely on surrogate groundtruth detail maps computed from traditional approaches, which are error prone and limit the fidelity of the reconstruction.In this paper, we propose a twostage framework to regress 3DMM parameters and reconstruct facial details via selfsupervised learning. In the first stage, we use a combination of multilevel loss terms to train the 3DMM regression network. These loss terms consist of lowlevel photometric loss, midlevel facial landmark loss, and highlevel facial perceptual loss, which enable the network to preserve both facial appearances and identities. In the second stage, we employ an imagetoimage translation network to capture the missing details. We unwrap both the input image and the regressed 3DMM texture maps into UVspace. The corresponding UV maps are together sent into the translation network to obtain the detailed displacement map in UVspace. The displacement map and the 3DMM coarse model together are rendered to a final face image, which is enforced to be photometric consistent with the input face image during training. Finally, the whole network can be trained endtoend without any annotated labels. The advantage of the detail modeling in UVspace is that all the training face images with different poses are aligned in UVspace, which facilitates the network to capture invariant details in spatial regions around facial components with large amount of data. The main contribution of our work is that we use a selfsupervised approach to solve a challenging task of detailed 3D face reconstruction from a single RGB image and we achieve very high quality results. We conduct extensive experiments and analysis to show the effectiveness of the proposed method. Compared with stateoftheart approaches, the 3D face models produced by our method are generally more faithful to the input face images.
Ii Related Work
In this section, we briefly perform a literature survey on singleview 3D face reconstruction methods. These methods can be categorized as the optimization based, the supervised learning and the selfsupervised learning based methods. A more complete review can be found in [27].
3DMM by Optimization.
The 3D morphable model (3DMM) is proposed in [2] to reconstruct a 3D face by a linear combination of shape and texture blenshapes. These blendshapes (i.e., bases) are extracted by PCA on aligned 3D face scans. Later, Cao et al. [5] bring facial expressions into 3DMM and introduce a bilinear face model named FaceWarehouse. Since then, reconstructing a 3D face from an input image can be formulated as generating the optimal 3DMM parameters including shape, expression, and texture coefficients, such that the modelinduced image is similar to the input image in predefined feature spaces. Under this formulation, the analysisbysynthesis optimization framework [3, 19, 8, 23] is commonly adopted. However, these optimizationbased approaches are parameter sensitive. Unrealistic appearances exist on the generated 3D model.
3DMM by Supervised Learning.
Methods based on supervised learning requires groundtruth 3DMM labels by either optimizationbased fitting or synthetic data rendering. Zhu et al. [26] and Tran et al. [24] use 3DMM parameters generated by optimizationbased approaches as groundtruth to learn their CNN models. The performance of these methods are limited by unreliable labels. On the other hand, other approaches [6, 17, 13] try to utilize synthetic data rendered with random 3DMM parameters for supervised learning. Dou et al. [6] propose to use synthetic face images and corresponding 3D scans together for network learning. Richardson et al. [17] train a 3DMM regression network with only synthetic rendered face images. Kim et al. [13] show that training with synthetic data can be adapted to real data with the bootstrapping algorithm. However, the performances of these methods are limited by the unrealistic input and the 3D face models do not resemble the input images.
3DMM by Selfsupervised Learning.
Selfsupervised methods derive supervisions by using input images without labels. MoFA [22]
uses a pixelwise photometric loss to ensure the rendered image induced by the estimated 3DMM parameters to be similar to the input image. However, the photometric loss makes the network attend to pixelwise similarity between rendered image and input image, while the personal identity of the input face are ignored. Recently, Genova
et al. [9] propose to enforce the feature similarity between the rendered image and the input image via a fixedweight face recognition network. The facial identity can be preserved while the lowlevel feature similarity is missing (e.g., illumination, skin color, and facial expressions). This is because Genova et al. [9] are designed to predict 3DMM parameters from illumination and expression invariant facial feature embeddings instead of original face images.Detail Modeling beyond 3DMM.
Due to the limited expressive power of 3DMM, some approaches try to model facial details with additional layers built upon 3DMM. Examples following this line include the depth maps [18] or bump maps [25], and trainable corrective models [21]. Besides, some other work employs nonparametric 3D representations [12, 20]
to gain more degree of freedoms, but usually are less robust than the methods built upon 3DMM. We in this paper focus on 3D representations with a coarse 3DMM model with additional detail layers. For the detail layer representation, we prefer detail maps rather than trainable corrective models
[21] due to more expressive power. Note that existing detail map based approaches [18, 25] are all based on supervised learning with surrogate groundtruth detail maps computed with traditional methods, while our approach are completely unsupervised.In this paper, we proposed a selfsupervised model to learn 3D shape and texture in a coarsetofine procedure. In the coarse model, different from MoFA [22] and Genova et al. [9]
, we combine the use of lowlevel photometric loss and highlevel perceptual loss to provide multilevel supervision for the coarse model. In the fine model, we use the unwrapped input image and 3D model in the UV space as inputs to the neural network, and make the network learn the detailed information from the difference between real and render images in an aligned space. Different from Guo
et al. [10] that used RGBD data as model inputs and learns the pervertex normal displacement with UV maps, we use RGB images as inputs and build a UV render layer that builds dense correspondence between UV space and the rendered image space. And These differences are vital to better face recosntruction.Iii Proposed Algorithm
In this section, we illustrate the proposed method. We will give an overview of the proposed framework, and provide a detailed explanation of the loss functions to train the model.
Iiia Framework Overview
Figure 2 shows the pipeline of the proposed framework. It consists of two modules. The first one is the 3DMM encoder and the second one is the detail modeling encoderdecoder network. The 3DMM encoder regresses the 3DMM parameters of an input image. In the detail modeling module, we unwrap both the input image and the corresponding 3D model into UV space. We send these two UV maps into the detail modeling encoderdecoder to measure their difference where the missing details on the 3D face model resides. The output of the decoder is a displacement depth map that contains the missing details. The displacement depth map is then added back to the coarse model to enhance fine details on the surfaces. We are able to achieve the goal of selfsupervision two technical novelties. First, we propose to combine lowlevel photometric loss and highlevel perceptual loss for the coarse model regression. Second, we unwrap both the input image and the regressed 3DMM texture maps into UVspace, and the corresponding UV maps are together sent into the translation network to obtain the detailed displacement map in UVspace.
We exploit the potential of the 3DMM encoder during the training process. We propose multiple loss terms to minimize the difference between the input image and the output image produced by a differentiable render layer from the coarse 3D face model. The difference is measured from hierarchical levels of facial perception ranging from lowlevel photometric difference to highlevel perceptual difference. The learned 3DMM encoder is capable of representing the input image in this parametric model. In the fine detail modeling stage, we mainly use photometric loss to compare the difference between the input image and the output image rendered by a differentiable UV render layer. But due to the limitation in image resolution, the input image often contain noise that might confuse the fine network with facial details. Thus, additional smoothing losses and regularization terms are included to encourage the proposed model to reconstruct smooth faces with fine details.
IiiB 3DMM Encoder
We first revisit the 3DMM and explain the 3DMM encoder. Then, we introduce hierarchical loss terms to exploit the network potentials. The 3DMM parameters for shape include the identity and expression parameters, the reconstructed 3D face model can be written as:
(1) 
where is the vector format of the mean 3D face model, and are the identity bases and the expression bases from [2] and [5], respectively. The 3D reconstruction is formulated as regressing 3DMM parameters and as shown in Eq. 1.
Our 3DMM encoder shares similarity with existing methods [22, 9]. It takes a color face image as input and transforms it progressively to the latent code vector using multiple convolutional layers and nonlinear activations. Specifically, we adopt the VGGFace[15] structure in the 3DMM encoder. As we notice the feature representation discrepancies between 2D face images and 3D face models, we randomly initialize the network parameters and train them from scratch. During the training process, we project the output 3D face model into a 2D face image. The loss functions are mainly designed to measure the difference between the projected face and the input face. The total loss function for training the 3DMMbased coarse model is denoted as:
(2)  
where is the photometric loss, is the landmark consistency loss, is the perceptual identity loss, and is the 3DMM parameter regularization term. The weights {, , , } control the influence of each term and are set as constant values. We will illustrate the detailed loss formulation in the following.
IiiB1 Photometric loss
The photometric loss is set to measure the pixelwise difference between the input face image and the rendered face image. We denote the input face image as and the rendered face image as . The loss function is defined as follows:
(3) 
where the visible pixels on the and is the location of each visible pixel. We compute the photometric loss by averaging the distances for all visible pixels.
IiiB2 Landmark Consistency Loss
The landmark consistency loss measures the distance between the 68 detected landmarks in the input face image and the rendered locations of the 68 key points in the 3D mesh. The 2D landmarks on the images are detected by [4]. The loss function is defined as:
(4) 
where is the th landmark position in the input face image, is the corresponding th landmark position in the rendered face image, and is the number of landmarks. The landmark consistency loss effectively controls the pose and expression of the 3D face model based on the guidance of the input face image.
IiiB3 Perceptual Identity Loss
The perceptual identity loss reflects the perception similarity between two images. We send both the input face image and the rendered face image into the VGGface recognition network [15]
for feature extraction. We denote the extracted CNN features of the input face image as
, the features of the rendered face image as . The perceptual consistency loss is defined as:(5) 
where is the parameters of the VGGface network [15] and is kept fixed during the training process.
IiiB4 Regularization Term
We propose a regularization term for the 3DMM parameters. Since the values of 3DMM parameters subject to normal distribution, we have to prevent their values from deviating from zeros too much. Otherwise, the 3D faces reconstructed from the parameters are distorted. The regularization term is:
(6) 
where and are constant values.
IiiC Detail Modeling
The detail modeling framework is an imagetoimage translation network in UV space. The framework consists of an encoderdecoder network with skip connections. The input image and the reconstructed coarse 3D face model are unwrapped into two UV texture maps of the same resolution. These two UV maps are concatenated and the invisible regions are masked out based on the estimated 3D poses. Then, the concatenated UV maps are fed into the encoderdecoder network. The network produces a displacement depth map. This map is added to the UV position map of the coarse 3DMM to generate a refined UV map, which is wrapped back to a 2D face image by the UV render layer. We compare the output 2D image of the detail modeling network and the input image by a photometric loss. The refined UV position map is used to reconstruct a 3D face with fine details. During training, we add smoothness loss and regularization terms together with the photometric loss to facilitate the network learning process. The smoothness less and regularization terms are set on the displacement depth maps to reduce both artifacts and distortions in the reconstruction process.
The total loss function of the detail modeling network is:
(7)  
where is the photometric loss to measure the pixelwise difference between the input image and warped 2D face image from detail enhanced UV map. The term is the smoothness loss, and is the regularization term on displacement map. The weights {, , } are constant values to balance the influence of each loss term. We will introduce the smoothness loss and the regularization terms in the following:
IiiC1 Smoothness Loss
We propose the smoothness loss on both the UV displacement normal map and the displacement depth map to ensure the similar representation of the neighboring pixels on these maps. Another advantage of the smoothness loss is that it ensure the robustness to mild occlusions. The smoothness loss can be written as:
(8) 
where is the difference measurement on pixel in the UV map. It computes the pixel distance between the original UV normal map and the UV normal map integrated with the displacement depth map. Similarly, computes the pixel distance between the original displacement depth map and the updated displacement depth map. The are vertices in the UV space and is the neighborhood of vertex with a radius of 1. The measures the difference between the UV normal map before and after adding displacement map. The weights and are used to combine these two smoothing losses and they are empirically set as 20 and 10, respectively.
IiiC2 Regularization Term
We propose the regularization terms on both the displacement depth map and the displacement normal map to reduce severe depth changes, which may bring face distortion on the 3D mesh. The regularization term can be written as:
(9) 
where and are set to and , respectively.
IiiD Camera View
The pose parameter in the proposed model is D, including scale , rotation angles(in rads) , and translation . We apply orthogonal projection to project the 3D vertices into 2D. We denote the vertex in 3D as , the projection operation as , the projected 2D points as , respectively. Then we have:
(10) 
where is the rotation matrix computed by , and is the translation vector.
IiiE Rendering Layer
The rendering layer is a modification to [9]. We use spherical harmonics as our lighting model instead of the Phong reflection model. And we use orthogonal projection instead of full perspective projection.
IiiF UV Render Layer
The UV render layer takes two inputs. One is a coarse UV position map that is built by unwrapping the coarse face mesh. The other is a predicted displacement depth map in UV space from the detailed model. Using these two inputs, the UV render layer first computes the detail UV map by adding the displacement to the coarse UV map. Then, for each pixel coordinates, it finds the best correspondence in UV map and fills in the depth values. It finally renders an output image based on other parameters like pose, lighting, and texture from 3DMM regression network. The triangles for the output mesh are redefined as neighboring pixels in the output UV maps. Thus, the UV render layer performs dense rendering compared to the vertex render layer used in the coarse model.
Iv Experiments
In this section, we illustrate the implementation details of the proposed method. Then, we evaluate the proposed method and make comparisons with stateoftheart methods including MoFA [22], Genova18[9] for coarse model, and Tran18 [25], Sela17 [20] for the fine model. More results are provided in the supplementary files. Our implementation will be made available to the public.
Iva Implementation Details
Our network training consists of two stages. In the first stage, we train the 3DMM encoder to obtain 3DMM parameters. The reconstructed 3D face is similar to the input images in the general facial perception. In the second stage, we fix the parameters in the 3DMM encoder and apply an imagetoimage translation network in the UV space to estimate a displacement depth map.
Our training dataset is from the CelebA dataset [14]. Before training, we use the landmark detector [4] to exclude the failure samples. Then, we separate the remaining images into two parts. The first part is the training dataset which contains 162,129 images. The second part is the testing set which contains 19,899 images. The network structure of the 3DMM encoder is the same as that of the VGGface model [15]. We randomly initialize all the weights in our network and train them from scratch.
When projecting the 3D face models into 2D images, we follow [9] to construct a differentiable layer. The weight values {} controlling the total loss for the coarse model shown in equation 2 are set as {, , , }, respectively. We set the weight values {, , } in the regularization terms shown in equation 6 as {, , }, respectively. The weight values {} of the total loss for the fine model shown in equation 7 are set as {, , }. The training process for the coarse model used an initial learning rate as and it decayed every steps at rate . The learning rate to train the fine model was set to and it decayed every steps at rate . The batch size was set to be . We adopted Adam optimizer to train the network on NVIDIA Tesla M40 for over steps for the coarse model and steps for the fine model.
IvB Evaluation on 3DMM Regression
IvB1 Shape Analysis
Method  Loss  Condition  

photo  id  Indoor Cooperative  PTZ Indoor  PTZ Outdoor  
MoFA [22]  ✓    
Genova18 [9]    ✓  
Ours3DMM  ✓  ✓ 
We evaluate the shape reconstruction precision for different configurations on the MICC Florence dataset[1]. In this dataset, videos are taken on the 53 subjects under three different conditions. These three different conditions are defined as Indoor Cooperative, PTZ Indoor and PTZ Outdoor. The ground truth 3D scans are provided for 52 out of the 53 people. We used each video frame as the network input. Before evaluation, we remove the frames where the faces are not detected by the landmark detector [4]. The 3D shape for each video sequence is reconstructed using the average of the shape parameters in the remaining frames. We follow the procedures mentioned in [9] to compute the pointtoplane error between the predicted 3D face models and ground truth scans. Table II lists all the models trained using configurations for comparison. Existing methods [22, 9] use either the photometric loss or the identity loss to train the encoder. Their performance is similar under all the three conditions. When we involve both photometric loss and identity loss during training, the performance is improved compared with [22, 9], which adopt only one of the two losses for training.
We perform the qualitative evaluation in three parts as shape and texture, expression, lighting and albedo. Figure 3 shows the test images and their reconstructed shape and texture results on the CelebA[14] and the LFW[11] datasets. The proposed method is compared with MoFA[22] and Genova18[9] for performance evaluation. The predicted expression parameters of MoFA[22] and the proposed method are excluded during this comparison.
Compared with MoFA [22] and Genova18 [9], the proposed method is able to generate faces with varying global shapes according to the identity difference in the input images. Though changes in overall shapes can also be observed for Genova18[9], they are inconsistent with the input images sometimes. For example, the generated face by Genova18[9] at Row 7 is too short along the vertical axis compared with the corresponding input image.
To show the results more clearly, we manually adjust the generated faces to almost the same size. We notice that some reconstructions by Genova18[9] are extremely large or small, while MoFA[22] and the proposed approach produce faces of more appropriate size. This phenomenon may indicate that the landmark and photometric loss in MoFA[22] and the proposed approach can provide constraints on face size.
When focusing on the texture, we notice that MoFA[22] tends to generate texture with smooth and shallow colors. The results from Genova18[9] are more realistic but does not show sufficient skin color distinction between people from different races. In contrast, the proposed method shows more color diversity for individuals.
Method  MoFA[22]  Ours 

Error 
Mean and standard deviation of pointtopoint error RMSE on FaceWarehouse
[5] for MoFA and the coarse model of the proposed method.IvB2 Expression Analysis
We evaluate the expression reconstruction results on the FaceWarehouse dataset[5]. The dataset contains 150 subjects in 20 different expressions. We compare the proposed method with MoFA[22]. We find the correspondence between BFM 2009 Model[2] and the FaceWarehouse model [5]. Then, we apply rigid transforms to align the predicted meshes and the ground truth scans provided in FaceWarehouse[5]. We compute the pointtopoint errors as the rootmeansquareerror of the distances for the corresponding vertices of the two meshes.
Table III shows the mean and standard deviation of the pointtopoint rmse error on the FaceWarehouse[5] dataset. The proposed method outperforms MoFA[22] significantly, indicating that the proposed method can reconstruct facial expressions with higher precision.
Since Genova18[9] does not estimate expression parameters, we only make comparisons between MoFA[22] and the proposed model for expression on FaceWarehouse dataset[5]. Figure 4 shows the expression reconstruction results of the two methods. When the commonlyseen expression (i.e., smiling with visible teeth) appears on the subjects of the input image as shown in the first row, both MoFA and the proposed method are effective to reconstruct the 3D model. Meanwhile, the 3D model generated by the proposed method contains more identityspecific details. When some uncommon expression appears (i.e., pouted mouths) on the second row, MoFA does not reconstruct the 3D model effectively while the proposed method does. When the expression is extreme (i.e., largelyopen mouths and closed eyes) as shown in the last row, neither of these two methods performs well. However, the proposed method still performs favorably against MoFA.
IvB3 Lighting and Albedo
Figure 5 shows the visualization of the albedo and lighting reconstruction results from MoFA and the proposed method. We set the meshes as white for a clear display. Though the overall color looks similar between these two results, the lighting and albedo are different. The colors in the overlay of MoFA [22] are mostly from lighting, which leads to smooth and fair albedo. In comparison, the albedo for the proposed method is more consistent to the input faces, and the lighting color is close to white, which is more common in the realworld scenarios. The satisfying performance of the proposed approach is owing to the use of lightinginsensitive identity loss, which assists the network in decoupling lighting and albedo.
IvC Evaluation on Final Reconstruction
We evaluate the visual performance of the 3D models reconstructed by different methods. The input testing images are from the CelebA [14] dataset. Figure 6 shows the evaluation results. We observe that the 3D face models generated by Tran18 are often noisy, as high frequency information are spread all over the meshes regardless of the input images. For example, on the third row of fig. 6, the mouth region on the input face is smooth with salient texture, but the corresponding 3D face model for Tran18 is noisy on the mouth. Meanwhile, the wrinkles on the cheeks are not reconstructed well according to the input image. Similar phenomena appears on other 3D face models which are not faithfully represent the input faces. On the other hand, the 3D models generated by Sela17 are erroneous and do not accurately convey the facial components of the input face images. Also, it turns out that sometimes Sela17 fails to generate facelike meshes. In comparison, the proposed method effectively generates 3D models preserving the global shape and structure. Besides, the facial details are enriched on the generated 3D models.
Method  Tran18[25]  Sela17[20]  Ours 

Error(mm) 
Close up  Ours  Tran18  Sela17 

Figure 7 compares the results of 3DMM regression and the fine model. We can see that the fine model can retain more details and express more vivid facial characteristics than using merely the 3DMM model. In addition, we compared the closeups of the reconstruction results by the fine model with Tran18[25] and Sela17[20]. The result is shown in Figure 8. Compared with Tran18 and Sela17, our fine model strengthens the details with less noise or distortion, showing more robustness over the other two approaches. Also, our model closely resembles the input images as well as preserves the detailed wrinkles. As in Row 4, distortion in the results of Tran18 and Sela17 can be easily observed, making the reconstructed mouth unnatural. What’s more, the positions of these wrinkles shifts in Tran18 and Sela17, while our model successfully reconstruct the details.
We perform quantitative evaluation on the MICC Florence dataset [1] and the Face Recognition Grand Challenge V2 (FRGC2) dataset [16]. In MICC, we evaluate the shape reconstruction precision and in FGRC2 we evaluate the depth estimation errors. We first evaluate the shape reconstruction precision on MICC dataset. As we aim to model facial details, we select frontal video frames of each subject for evaluation. The frontal frames only exist in the Indoor Cooperative condition. We compute the pointtopoint error between the reconstructed 3D faces and the ground truth scans. We first crop the ground truth scan to 95mm around the tip of the nose. Then, we run ICP (i.e., iterative closed points) with isotropic scale to find an alignment between the ground truth and the reconstruction. The Pointtopoint distances are then computed for each subject.
Method  Tran18[25]  Sela17[20]  Ours 

Error(mm) 
Table V shows the comparison results. It indicates that the proposed method achieves lower error under both average and standard deviations in the MICC dataset. The lower error demonstrates that the proposed method can reconstruct fine details with higher accuracy and stability than the other methods. During the evaluation process, Sela17 [20] fails to reconstruct 5 subjects, while Tran18 [25] and the proposed method successfully reconstruct all subjects. The failure reconstruction results appear in Sela17 [20] where the faces are distorted or even not reconstructed. Thus, ICP are unstable for Sela17 and the resulting error may be huge. Figure 9 shows some error maps and Figure 10 shows the individual errors. Sela17 is not robust compared with Tran18 and the proposed method with higher error and standard deviation metrics. On the other side, the proposed method stably produces stateoftheart results.
Besides MICC, we also evaluate on the FRGC2 datasets where we estimate the depth of the input face images. There are around 5000 images and the corresponding groundtruth depth maps in this dataset. We evaluate the depth estimation results generated from all the methods. To calculate the depth error, we first scale the depth estimation of each method to fit the ground truth depths in ranges. Then the mean distance between the two depth maps at valid pixel positions provided by a fixed binary mask are computed as depth error. Table IV shows the depth estimation results from these methods. The proposed method achieve lowest mean and standard deviation in depth errors compared with the other approaches. The low depth errors indicate that the proposed model generate 3D faces with higher accuracy.
IvD Application
An additional advantage of employing UV position maps [7] for reconstruction is that it enables easier integration of 3D reconstructions from different views of a same face. The UV maps from different views can be easily combined by a simple blending in UVspace. Thus the combined full UV map can represent a complete 3D face model that are visible in different views. The induced detailed 3D reconstruction are more complete compared to depth map based representations. Figure 11 shows several examples of blending two partial UV maps from two views of a same face.
Extreme expression  Occlusion  Large pose  

Input  Shape  Input  Shape  Input  Shape 
IvE Limitation Analysis
Figure 12 shows some failure cases for the proposed model under difficult scenarios including extreme expression, occlusion and large pose. Since the proposed model are not trained with specific policy to deal with these difficult conditions, whenever the input images show extreme expressions or missing information due to large pose or occlusion, the reconstruction results are often unsatisfactory.
V Conclusion
We propose a detailed 3D face reconstruction framework with selfsupervised learning. We use a coarse 3DMM encoder to reconstruct the general 3D face model and capture facial details in the UV space. When learning the 3DMM encoder, we incorporate multiple loss terms measurement ranging from the pixelwise similarity to the global facial perception. After learning the 3DMM parameters, we unwrap both input face image and 3D model into UV space where all the faces are precisely aligned. The details from the input image are effectively transferred to the 3D model in the detail modeling step as the aligned facial details facilitate the learning process. Experiments on the benchmark datasets indicate the proposed method performs favorably against stateoftheart 3D face modeling approaches.
References
 [1] (2011) The florence 2d/3d hybrid face dataset. In Joint ACM Workshop on Human Gesture and Behavior Understanding, Cited by: §IVB1, §IVC, TABLE II, TABLE V.
 [2] (1999) A morphable model for the synthesis of 3d faces. In sig, Cited by: §I, §II, §IIIB, §IVB2.
 [3] (2003) Face recognition based on fitting a 3d morphable model. tpami. Cited by: §I, §II.
 [4] (2017) How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In iccv, Cited by: §IIIB2, §IVA, §IVB1.
 [5] (2014) Facewarehouse: a 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics. Cited by: §II, §IIIB, Fig. 4, §IVB2, §IVB2, §IVB2, TABLE III.
 [6] (2017) Endtoend 3d face reconstruction with deep neural networks. In cvpr, Cited by: §I, §II.
 [7] (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In eccv, Cited by: §IVD.
 [8] (2013) Reconstructing detailed dynamic face geometry from monocular video.. In sig, Cited by: §II.
 [9] (2018) Unsupervised training for 3d morphable model regression. In cvpr, Cited by: §I, §I, §II, §II, §IIIB, §IIIE, §IVA, §IVB1, §IVB1, §IVB1, §IVB1, §IVB1, §IVB2, TABLE I, TABLE II, §IV.
 [10] (201906) CNNbased realtime dense face reconstruction with inverserendered photorealistic face images. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (6), pp. 1294–1307. External Links: Document, ISSN Cited by: §II.
 [11] (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Cited by: Fig. 3, §IVB1.
 [12] (2017) Large pose 3d face reconstruction from a single image via direct volumetric cnn regression. In iccv, Cited by: §I, §II.
 [13] (2018) InverseFaceNet: deep monocular inverse face rendering. In cvpr, Cited by: §I, §II.
 [14] (2015) Deep learning face attributes in the wild. In iccv, Cited by: Fig. 3, Fig. 5, §IVA, §IVB1, §IVC.
 [15] (2015) Deep face recognition. In bmvc, Cited by: §IIIB3, §IIIB, §IVA.
 [16] (2005) Overview of the face recognition grand challenge. In cvpr, Cited by: §IVC, TABLE IV.
 [17] (2016) 3D face reconstruction by learning from synthetic data. In ‘3dv‘, Cited by: §I, §II.
 [18] (2017) Learning detailed face reconstruction from a single image. In cvpr, Cited by: §I, §II.
 [19] (2005) Estimating 3d shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In cvpr, Cited by: §I, §II.
 [20] (2017) Unrestricted facial geometry reconstruction using imagetoimage translation. In iccv, Cited by: §I, §II, §IVC, §IVC, TABLE IV, TABLE V, §IV.
 [21] (2018) Selfsupervised multilevel face model learning for monocular reconstruction at over 250 hz. In cvpr, Cited by: Fig. 1, §I, §II.

[22]
(2017)
Mofa: modelbased deep convolutional face autoencoder for unsupervised monocular reconstruction
. In iccv, Cited by: §I, §I, §II, §II, §IIIB, Fig. 4, Fig. 5, §IVB1, §IVB1, §IVB1, §IVB1, §IVB1, §IVB2, §IVB2, §IVB2, §IVB3, TABLE I, TABLE II, TABLE III, §IV.  [23] (2016) Face2face: realtime face capture and reenactment of rgb videos. In cvpr, Cited by: §II.
 [24] (2017) Regressing robust and discriminative 3d morphable models with a very deep neural network. In cvpr, Cited by: §I, §I, §II.
 [25] (2018) Extreme 3D face reconstruction: seeing through occlusions. In cvpr, Cited by: §I, §II, §IVC, §IVC, TABLE IV, TABLE V, §IV.
 [26] (2016) Face alignment across large poses: a 3d solution. In cvpr, Cited by: §I, §I, §II.
 [27] (2018) State of the art on monocular 3d face reconstruction, tracking, and applications. In Computer Graphics Forum, Cited by: §II.
Comments
There are no comments yet.