Self-supervised Learning of Detailed 3D Face Reconstruction

10/25/2019 ∙ by Yajing Chen, et al. ∙ 58

In this paper, we present an end-to-end learning framework for detailed 3D face reconstruction from a single image. Our approach uses a 3DMM-based coarse model and a displacement map in UV-space to represent a 3D face. Unlike previous work addressing the problem, our learning framework does not require supervision of surrogate ground-truth 3D models computed with traditional approaches. Instead, we utilize the input image itself as supervision during learning. In the first stage, we combine a photometric loss and a facial perceptual loss between the input face and the rendered face, to regress a 3DMM-based coarse model. In the second stage, both the input image and the regressed texture of the coarse model are unwrapped into UV-space, and then sent through an image-toimage translation network to predict a displacement map in UVspace. The displacement map and the coarse model are used to render a final detailed face, which again can be compared with the original input image to serve as a photometric loss for the second stage. The advantage of learning displacement map in UV-space is that face alignment can be explicitly done during the unwrapping, thus facial details are easier to learn from large amount of data. Extensive experiments demonstrate the superiority of the proposed method over previous work.



There are no comments yet.


page 1

page 4

page 6

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recovery of the 3D human facial geometry from a single color image is an ill-posed problem. Existing methods typically employ a parametric face modeling framework named as 3D morphable model (3DMM) [2]

. In a 3DMM there are a set of facial shapes and texture bases, which are built from real-world 3D face scans. A linear combination of these bases synthesizes a 3D face model. During the training process, a loss function is constructed to measure the difference between the input face image and the 3D face models. The linear coefficients (i.e., 3DMM parameters) can be generated by minimizing the computed loss. While conventional methods learn these coefficients via analysis-by-synthesis optimization

[3, 19], recent studies have shown the effectiveness of regressing 3DMM parameters using CNN based approaches [26, 24, 22, 13, 9].

Fig. 1: Reconstruction of 3D faces of Tewari18[21] and the proposed model.

Learning to regress 3DMM parameters via CNN requires a large amount of data. For methods based on supervised learning, the ground-truth 3DMM parameters are generated by optimization-based fitting [26, 24] or synthetic data generation [17, 6]. The limitations appear that the generated ground-truth labels are not accurate and the synthetic data lacks realism. In comparison, the methods [22, 9] based on self-supervised learning111We in this paper do not distinguish between the term “self-supervised” and “unsupervised”, as both refer to learning without ground-truth annotations in our case. We prefer the term “self-supervised”. do not employ this process while directly learning from unlabeled face images. For example, MoFA [22] learns to regress 3DMM parameters by forcing the rendered images to have similar pixel colors as input images in facial regions. However, enforcing the pixel level similarity does not imply similar facial identities. Genova et al. [9]

rendered several face images from multiple views. They use a face recognition network to measure the perceptual similarity between the input faces and the rendered faces. Although the method is capable of producing 3D models resembling the faces in the input images, it ignores local facial characteristics such as skin colors and facial expressions and leads to unfaithful reconstructions.

In order to model facial details beyond 3DMM, a few deep learning methods recently are proposed. While some methods

[12, 20] represent 3D faces completely without 3DMM, severely degraded results are usually obtained. More robust approaches typically represent 3D faces with detail layers in addition to 3DMM [18, 21, 25]. For example, learned parametric correctives are employed in [21], and 3D detail maps are employed in [18, 25]. Since the learned parametric correctives [21] have very limited expressive capabilities (see Fig. 1), we advocate 3D detail maps for detail modeling. However, existing approaches employing detail maps [18, 25] rely on surrogate ground-truth detail maps computed from traditional approaches, which are error prone and limit the fidelity of the reconstruction.

In this paper, we propose a two-stage framework to regress 3DMM parameters and reconstruct facial details via self-supervised learning. In the first stage, we use a combination of multi-level loss terms to train the 3DMM regression network. These loss terms consist of low-level photometric loss, mid-level facial landmark loss, and high-level facial perceptual loss, which enable the network to preserve both facial appearances and identities. In the second stage, we employ an image-to-image translation network to capture the missing details. We unwrap both the input image and the regressed 3DMM texture maps into UV-space. The corresponding UV maps are together sent into the translation network to obtain the detailed displacement map in UV-space. The displacement map and the 3DMM coarse model together are rendered to a final face image, which is enforced to be photometric consistent with the input face image during training. Finally, the whole network can be trained end-to-end without any annotated labels. The advantage of the detail modeling in UV-space is that all the training face images with different poses are aligned in UV-space, which facilitates the network to capture invariant details in spatial regions around facial components with large amount of data. The main contribution of our work is that we use a self-supervised approach to solve a challenging task of detailed 3D face reconstruction from a single RGB image and we achieve very high quality results. We conduct extensive experiments and analysis to show the effectiveness of the proposed method. Compared with state-of-the-art approaches, the 3D face models produced by our method are generally more faithful to the input face images.

Ii Related Work

In this section, we briefly perform a literature survey on single-view 3D face reconstruction methods. These methods can be categorized as the optimization based, the supervised learning and the self-supervised learning based methods. A more complete review can be found in [27].

3DMM by Optimization.

The 3D morphable model (3DMM) is proposed in [2] to reconstruct a 3D face by a linear combination of shape and texture blenshapes. These blendshapes (i.e., bases) are extracted by PCA on aligned 3D face scans. Later, Cao et al. [5] bring facial expressions into 3DMM and introduce a bilinear face model named FaceWarehouse. Since then, reconstructing a 3D face from an input image can be formulated as generating the optimal 3DMM parameters including shape, expression, and texture coefficients, such that the model-induced image is similar to the input image in predefined feature spaces. Under this formulation, the analysis-by-synthesis optimization framework [3, 19, 8, 23] is commonly adopted. However, these optimization-based approaches are parameter sensitive. Unrealistic appearances exist on the generated 3D model.

3DMM by Supervised Learning.

Methods based on supervised learning requires ground-truth 3DMM labels by either optimization-based fitting or synthetic data rendering. Zhu et al. [26] and Tran et al. [24] use 3DMM parameters generated by optimization-based approaches as ground-truth to learn their CNN models. The performance of these methods are limited by unreliable labels. On the other hand, other approaches [6, 17, 13] try to utilize synthetic data rendered with random 3DMM parameters for supervised learning. Dou et al. [6] propose to use synthetic face images and corresponding 3D scans together for network learning. Richardson et al. [17] train a 3DMM regression network with only synthetic rendered face images. Kim et al. [13] show that training with synthetic data can be adapted to real data with the bootstrapping algorithm. However, the performances of these methods are limited by the unrealistic input and the 3D face models do not resemble the input images.

3DMM by Self-supervised Learning.

Self-supervised methods derive supervisions by using input images without labels. MoFA [22]

uses a pixel-wise photometric loss to ensure the rendered image induced by the estimated 3DMM parameters to be similar to the input image. However, the photometric loss makes the network attend to pixel-wise similarity between rendered image and input image, while the personal identity of the input face are ignored. Recently, Genova

et al. [9] propose to enforce the feature similarity between the rendered image and the input image via a fixed-weight face recognition network. The facial identity can be preserved while the low-level feature similarity is missing (e.g., illumination, skin color, and facial expressions). This is because Genova et al. [9] are designed to predict 3DMM parameters from illumination and expression invariant facial feature embeddings instead of original face images.

Detail Modeling beyond 3DMM.

Due to the limited expressive power of 3DMM, some approaches try to model facial details with additional layers built upon 3DMM. Examples following this line include the depth maps [18] or bump maps [25], and trainable corrective models [21]. Besides, some other work employs non-parametric 3D representations [12, 20]

to gain more degree of freedoms, but usually are less robust than the methods built upon 3DMM. We in this paper focus on 3D representations with a coarse 3DMM model with additional detail layers. For the detail layer representation, we prefer detail maps rather than trainable corrective models

[21] due to more expressive power. Note that existing detail map based approaches [18, 25] are all based on supervised learning with surrogate ground-truth detail maps computed with traditional methods, while our approach are completely unsupervised.

In this paper, we proposed a self-supervised model to learn 3D shape and texture in a coarse-to-fine procedure. In the coarse model, different from MoFA [22] and Genova et al. [9]

, we combine the use of low-level photometric loss and high-level perceptual loss to provide multi-level supervision for the coarse model. In the fine model, we use the unwrapped input image and 3D model in the UV space as inputs to the neural network, and make the network learn the detailed information from the difference between real and render images in an aligned space. Different from Guo

et al. [10] that used RGBD data as model inputs and learns the per-vertex normal displacement with UV maps, we use RGB images as inputs and build a UV render layer that builds dense correspondence between UV space and the rendered image space. And These differences are vital to better face recosntruction.

Fig. 2:

Proposed pipeline. We use a 3DMM encoder to transform an input face image into a latent code vector to regress the 3DMM parameters. We unwrap both the input image and the reconstructed 3D model into UV space and estimate a displacement depth map. Then, the 3DMM-based coarse model and the displacement depth map are used to generate a 3D face with fine details.

Iii Proposed Algorithm

In this section, we illustrate the proposed method. We will give an overview of the proposed framework, and provide a detailed explanation of the loss functions to train the model.

Iii-a Framework Overview

Figure 2 shows the pipeline of the proposed framework. It consists of two modules. The first one is the 3DMM encoder and the second one is the detail modeling encoder-decoder network. The 3DMM encoder regresses the 3DMM parameters of an input image. In the detail modeling module, we unwrap both the input image and the corresponding 3D model into UV space. We send these two UV maps into the detail modeling encoder-decoder to measure their difference where the missing details on the 3D face model resides. The output of the decoder is a displacement depth map that contains the missing details. The displacement depth map is then added back to the coarse model to enhance fine details on the surfaces. We are able to achieve the goal of self-supervision two technical novelties. First, we propose to combine low-level photometric loss and high-level perceptual loss for the coarse model regression. Second, we unwrap both the input image and the regressed 3DMM texture maps into UV-space, and the corresponding UV maps are together sent into the translation network to obtain the detailed displacement map in UV-space.

We exploit the potential of the 3DMM encoder during the training process. We propose multiple loss terms to minimize the difference between the input image and the output image produced by a differentiable render layer from the coarse 3D face model. The difference is measured from hierarchical levels of facial perception ranging from low-level photometric difference to high-level perceptual difference. The learned 3DMM encoder is capable of representing the input image in this parametric model. In the fine detail modeling stage, we mainly use photometric loss to compare the difference between the input image and the output image rendered by a differentiable UV render layer. But due to the limitation in image resolution, the input image often contain noise that might confuse the fine network with facial details. Thus, additional smoothing losses and regularization terms are included to encourage the proposed model to reconstruct smooth faces with fine details.

Iii-B 3DMM Encoder

We first revisit the 3DMM and explain the 3DMM encoder. Then, we introduce hierarchical loss terms to exploit the network potentials. The 3DMM parameters for shape include the identity and expression parameters, the reconstructed 3D face model can be written as:


where is the vector format of the mean 3D face model, and are the identity bases and the expression bases from [2] and [5], respectively. The 3D reconstruction is formulated as regressing 3DMM parameters and as shown in Eq. 1.

Our 3DMM encoder shares similarity with existing methods [22, 9]. It takes a color face image as input and transforms it progressively to the latent code vector using multiple convolutional layers and nonlinear activations. Specifically, we adopt the VGG-Face[15] structure in the 3DMM encoder. As we notice the feature representation discrepancies between 2D face images and 3D face models, we randomly initialize the network parameters and train them from scratch. During the training process, we project the output 3D face model into a 2D face image. The loss functions are mainly designed to measure the difference between the projected face and the input face. The total loss function for training the 3DMM-based coarse model is denoted as:


where is the photometric loss, is the landmark consistency loss, is the perceptual identity loss, and is the 3DMM parameter regularization term. The weights {, , , } control the influence of each term and are set as constant values. We will illustrate the detailed loss formulation in the following.

Iii-B1 Photometric loss

The photometric loss is set to measure the pixel-wise difference between the input face image and the rendered face image. We denote the input face image as and the rendered face image as . The loss function is defined as follows:


where the visible pixels on the and is the location of each visible pixel. We compute the photometric loss by averaging the -distances for all visible pixels.

Iii-B2 Landmark Consistency Loss

The landmark consistency loss measures the -distance between the 68 detected landmarks in the input face image and the rendered locations of the 68 key points in the 3D mesh. The 2D landmarks on the images are detected by [4]. The loss function is defined as:


where is the -th landmark position in the input face image, is the corresponding -th landmark position in the rendered face image, and is the number of landmarks. The landmark consistency loss effectively controls the pose and expression of the 3D face model based on the guidance of the input face image.

Iii-B3 Perceptual Identity Loss

The perceptual identity loss reflects the perception similarity between two images. We send both the input face image and the rendered face image into the VGG-face recognition network [15]

for feature extraction. We denote the extracted CNN features of the input face image as

, the features of the rendered face image as . The perceptual consistency loss is defined as:


where is the parameters of the VGG-face network [15] and is kept fixed during the training process.

Iii-B4 Regularization Term

We propose a regularization term for the 3DMM parameters. Since the values of 3DMM parameters subject to normal distribution, we have to prevent their values from deviating from zeros too much. Otherwise, the 3D faces reconstructed from the parameters are distorted. The regularization term is:


where and are constant values.

Iii-C Detail Modeling

The detail modeling framework is an image-to-image translation network in UV space. The framework consists of an encoder-decoder network with skip connections. The input image and the reconstructed coarse 3D face model are unwrapped into two UV texture maps of the same resolution. These two UV maps are concatenated and the invisible regions are masked out based on the estimated 3D poses. Then, the concatenated UV maps are fed into the encoder-decoder network. The network produces a displacement depth map. This map is added to the UV position map of the coarse 3DMM to generate a refined UV map, which is wrapped back to a 2D face image by the UV render layer. We compare the output 2D image of the detail modeling network and the input image by a photometric loss. The refined UV position map is used to reconstruct a 3D face with fine details. During training, we add smoothness loss and regularization terms together with the photometric loss to facilitate the network learning process. The smoothness less and regularization terms are set on the displacement depth maps to reduce both artifacts and distortions in the reconstruction process.

The total loss function of the detail modeling network is:


where is the photometric loss to measure the pixel-wise difference between the input image and warped 2D face image from detail enhanced UV map. The term is the smoothness loss, and is the regularization term on displacement map. The weights {, , } are constant values to balance the influence of each loss term. We will introduce the smoothness loss and the regularization terms in the following:

Iii-C1 Smoothness Loss

We propose the smoothness loss on both the UV displacement normal map and the displacement depth map to ensure the similar representation of the neighboring pixels on these maps. Another advantage of the smoothness loss is that it ensure the robustness to mild occlusions. The smoothness loss can be written as:


where is the difference measurement on pixel in the UV map. It computes the pixel distance between the original UV normal map and the UV normal map integrated with the displacement depth map. Similarly, computes the pixel distance between the original displacement depth map and the updated displacement depth map. The are vertices in the UV space and is the neighborhood of vertex with a radius of 1. The measures the difference between the UV normal map before and after adding displacement map. The weights and are used to combine these two smoothing losses and they are empirically set as 20 and 10, respectively.

Iii-C2 Regularization Term

We propose the regularization terms on both the displacement depth map and the displacement normal map to reduce severe depth changes, which may bring face distortion on the 3D mesh. The regularization term can be written as:


where and are set to and , respectively.

Iii-D Camera View

The pose parameter in the proposed model is D, including scale , rotation angles(in rads) , and translation . We apply orthogonal projection to project the 3D vertices into 2D. We denote the vertex in 3D as , the projection operation as , the projected 2D points as , respectively. Then we have:


where is the rotation matrix computed by , and is the translation vector.

Iii-E Rendering Layer

The rendering layer is a modification to [9]. We use spherical harmonics as our lighting model instead of the Phong reflection model. And we use orthogonal projection instead of full perspective projection.

Iii-F UV Render Layer

The UV render layer takes two inputs. One is a coarse UV position map that is built by unwrapping the coarse face mesh. The other is a predicted displacement depth map in UV space from the detailed model. Using these two inputs, the UV render layer first computes the detail UV map by adding the displacement to the coarse UV map. Then, for each pixel coordinates, it finds the best correspondence in UV map and fills in the depth values. It finally renders an output image based on other parameters like pose, lighting, and texture from 3DMM regression network. The triangles for the output mesh are redefined as neighboring pixels in the output UV maps. Thus, the UV render layer performs dense rendering compared to the vertex render layer used in the coarse model.

Iv Experiments

In this section, we illustrate the implementation details of the proposed method. Then, we evaluate the proposed method and make comparisons with state-of-the-art methods including MoFA [22], Genova18[9] for coarse model, and Tran18 [25], Sela17 [20] for the fine model. More results are provided in the supplementary files. Our implementation will be made available to the public.

Input MoFA[22] Ours-3DMM MoFA[22] Genova18[9] Ours-3DMM
Shape Texture Shape Texture Shape Texture
Fig. 3: Shape and texture reconstructions of 3D faces by three approaches on the CelebA[14] and LFW[11] datasets.

Iv-a Implementation Details

Our network training consists of two stages. In the first stage, we train the 3DMM encoder to obtain 3DMM parameters. The reconstructed 3D face is similar to the input images in the general facial perception. In the second stage, we fix the parameters in the 3DMM encoder and apply an image-to-image translation network in the UV space to estimate a displacement depth map.

Our training dataset is from the CelebA dataset [14]. Before training, we use the landmark detector [4] to exclude the failure samples. Then, we separate the remaining images into two parts. The first part is the training dataset which contains 162,129 images. The second part is the testing set which contains 19,899 images. The network structure of the 3DMM encoder is the same as that of the VGG-face model [15]. We randomly initialize all the weights in our network and train them from scratch.

When projecting the 3D face models into 2D images, we follow [9] to construct a differentiable layer. The weight values {} controlling the total loss for the coarse model shown in equation 2 are set as {, , , }, respectively. We set the weight values {, , } in the regularization terms shown in equation 6 as {, , }, respectively. The weight values {} of the total loss for the fine model shown in equation 7 are set as {, , }. The training process for the coarse model used an initial learning rate as and it decayed every steps at rate . The learning rate to train the fine model was set to and it decayed every steps at rate . The batch size was set to be . We adopted Adam optimizer to train the network on NVIDIA Tesla M40 for over steps for the coarse model and steps for the fine model.

Iv-B Evaluation on 3DMM Regression

Iv-B1 Shape Analysis

Method Loss Condition
photo id Indoor Cooperative PTZ Indoor PTZ Outdoor
MoFA [22] -
Genova18 [9] -
TABLE II: Point-to-plane error on the MICC Florence dataset[1] for coarse 3D face reconstructing methods.

We evaluate the shape reconstruction precision for different configurations on the MICC Florence dataset[1]. In this dataset, videos are taken on the 53 subjects under three different conditions. These three different conditions are defined as Indoor Cooperative, PTZ Indoor and PTZ Outdoor. The ground truth 3D scans are provided for 52 out of the 53 people. We used each video frame as the network input. Before evaluation, we remove the frames where the faces are not detected by the landmark detector [4]. The 3D shape for each video sequence is reconstructed using the average of the shape parameters in the remaining frames. We follow the procedures mentioned in [9] to compute the point-to-plane error between the predicted 3D face models and ground truth scans. Table II lists all the models trained using configurations for comparison. Existing methods [22, 9] use either the photometric loss or the identity loss to train the encoder. Their performance is similar under all the three conditions. When we involve both photometric loss and identity loss during training, the performance is improved compared with [22, 9], which adopt only one of the two losses for training.

We perform the qualitative evaluation in three parts as shape and texture, expression, lighting and albedo. Figure 3 shows the test images and their reconstructed shape and texture results on the CelebA[14] and the LFW[11] datasets. The proposed method is compared with MoFA[22] and Genova18[9] for performance evaluation. The predicted expression parameters of MoFA[22] and the proposed method are excluded during this comparison.

Compared with MoFA [22] and Genova18 [9], the proposed method is able to generate faces with varying global shapes according to the identity difference in the input images. Though changes in overall shapes can also be observed for Genova18[9], they are inconsistent with the input images sometimes. For example, the generated face by Genova18[9] at Row 7 is too short along the vertical axis compared with the corresponding input image.

To show the results more clearly, we manually adjust the generated faces to almost the same size. We notice that some reconstructions by Genova18[9] are extremely large or small, while MoFA[22] and the proposed approach produce faces of more appropriate size. This phenomenon may indicate that the landmark and photometric loss in MoFA[22] and the proposed approach can provide constraints on face size.

When focusing on the texture, we notice that MoFA[22] tends to generate texture with smooth and shallow colors. The results from Genova18[9] are more realistic but does not show sufficient skin color distinction between people from different races. In contrast, the proposed method shows more color diversity for individuals.

Method MoFA[22] Ours

Mean and standard deviation of point-to-point error RMSE on FaceWarehouse

[5] for MoFA and the coarse model of the proposed method.

Iv-B2 Expression Analysis

We evaluate the expression reconstruction results on the FaceWarehouse dataset[5]. The dataset contains 150 subjects in 20 different expressions. We compare the proposed method with MoFA[22]. We find the correspondence between BFM 2009 Model[2] and the FaceWarehouse model [5]. Then, we apply rigid transforms to align the predicted meshes and the ground truth scans provided in FaceWarehouse[5]. We compute the point-to-point errors as the root-mean-square-error of the distances for the corresponding vertices of the two meshes.

Table III shows the mean and standard deviation of the point-to-point rmse error on the FaceWarehouse[5] dataset. The proposed method outperforms MoFA[22] significantly, indicating that the proposed method can reconstruct facial expressions with higher precision.

Since Genova18[9] does not estimate expression parameters, we only make comparisons between MoFA[22] and the proposed model for expression on FaceWarehouse dataset[5]. Figure 4 shows the expression reconstruction results of the two methods. When the commonly-seen expression (i.e., smiling with visible teeth) appears on the subjects of the input image as shown in the first row, both MoFA and the proposed method are effective to reconstruct the 3D model. Meanwhile, the 3D model generated by the proposed method contains more identity-specific details. When some uncommon expression appears (i.e., pouted mouths) on the second row, MoFA does not reconstruct the 3D model effectively while the proposed method does. When the expression is extreme (i.e., largely-open mouths and closed eyes) as shown in the last row, neither of these two methods performs well. However, the proposed method still performs favorably against MoFA.

Input MoFA[22] Ours Input MoFA[22] Ours
Fig. 4: Expression reconstruction on the FaceWarehouse dataset[5] for MoFA[22] and the coarse model of the proposed method.
Input MoFA [22] Ours
Overlay Light Albedo Overlay Light Albedo
Fig. 5: Lighting and albedo reconstruction for MoFA [22] and the coarse model of the proposed method on CelebA[14].

Iv-B3 Lighting and Albedo

Figure 5 shows the visualization of the albedo and lighting reconstruction results from MoFA and the proposed method. We set the meshes as white for a clear display. Though the overall color looks similar between these two results, the lighting and albedo are different. The colors in the overlay of MoFA [22] are mostly from lighting, which leads to smooth and fair albedo. In comparison, the albedo for the proposed method is more consistent to the input faces, and the lighting color is close to white, which is more common in the real-world scenarios. The satisfying performance of the proposed approach is owing to the use of lighting-insensitive identity loss, which assists the network in decoupling lighting and albedo.

Iv-C Evaluation on Final Reconstruction

Fig. 6: Visual comparison of the generated 3D face models.

We evaluate the visual performance of the 3D models reconstructed by different methods. The input testing images are from the CelebA [14] dataset. Figure 6 shows the evaluation results. We observe that the 3D face models generated by Tran18 are often noisy, as high frequency information are spread all over the meshes regardless of the input images. For example, on the third row of fig. 6, the mouth region on the input face is smooth with salient texture, but the corresponding 3D face model for Tran18 is noisy on the mouth. Meanwhile, the wrinkles on the cheeks are not reconstructed well according to the input image. Similar phenomena appears on other 3D face models which are not faithfully represent the input faces. On the other hand, the 3D models generated by Sela17 are erroneous and do not accurately convey the facial components of the input face images. Also, it turns out that sometimes Sela17 fails to generate face-like meshes. In comparison, the proposed method effectively generates 3D models preserving the global shape and structure. Besides, the facial details are enriched on the generated 3D models.

Method Tran18[25] Sela17[20] Ours
TABLE IV: Depth error on FRGC2[16] for three different fine-detail reconstructing approaches.
Fig. 7: Visual comparison between reconstruction by coarse and fine model.
Close up Ours Tran18 Sela17
Fig. 8: Comparison on close-ups for different detail reconstructing approaches.

Figure 7 compares the results of 3DMM regression and the fine model. We can see that the fine model can retain more details and express more vivid facial characteristics than using merely the 3DMM model. In addition, we compared the close-ups of the reconstruction results by the fine model with Tran18[25] and Sela17[20]. The result is shown in Figure 8. Compared with Tran18 and Sela17, our fine model strengthens the details with less noise or distortion, showing more robustness over the other two approaches. Also, our model closely resembles the input images as well as preserves the detailed wrinkles. As in Row 4, distortion in the results of Tran18 and Sela17 can be easily observed, making the reconstructed mouth unnatural. What’s more, the positions of these wrinkles shifts in Tran18 and Sela17, while our model successfully reconstruct the details.

We perform quantitative evaluation on the MICC Florence dataset [1] and the Face Recognition Grand Challenge V2 (FRGC2) dataset [16]. In MICC, we evaluate the shape reconstruction precision and in FGRC2 we evaluate the depth estimation errors. We first evaluate the shape reconstruction precision on MICC dataset. As we aim to model facial details, we select frontal video frames of each subject for evaluation. The frontal frames only exist in the Indoor Cooperative condition. We compute the point-to-point error between the reconstructed 3D faces and the ground truth scans. We first crop the ground truth scan to 95mm around the tip of the nose. Then, we run ICP (i.e., iterative closed points) with isotropic scale to find an alignment between the ground truth and the reconstruction. The Point-to-point distances are then computed for each subject.

Method Tran18[25] Sela17[20] Ours
TABLE V: Point-to-point error on MICC Florence[1] for different fine-detail reconstructing approaches.
Fig. 9: Error maps on MICC for three different fine-detail reconstructing approaches.
Fig. 10: Barplots on MICC for different fine-detail reconstructing approaches.

Table V shows the comparison results. It indicates that the proposed method achieves lower error under both average and standard deviations in the MICC dataset. The lower error demonstrates that the proposed method can reconstruct fine details with higher accuracy and stability than the other methods. During the evaluation process, Sela17 [20] fails to reconstruct 5 subjects, while Tran18 [25] and the proposed method successfully reconstruct all subjects. The failure reconstruction results appear in Sela17 [20] where the faces are distorted or even not reconstructed. Thus, ICP are unstable for Sela17 and the resulting error may be huge. Figure 9 shows some error maps and Figure 10 shows the individual errors. Sela17 is not robust compared with Tran18 and the proposed method with higher error and standard deviation metrics. On the other side, the proposed method stably produces state-of-the-art results.

Besides MICC, we also evaluate on the FRGC2 datasets where we estimate the depth of the input face images. There are around 5000 images and the corresponding ground-truth depth maps in this dataset. We evaluate the depth estimation results generated from all the methods. To calculate the depth error, we first scale the depth estimation of each method to fit the ground truth depths in ranges. Then the mean distance between the two depth maps at valid pixel positions provided by a fixed binary mask are computed as depth error. Table IV shows the depth estimation results from these methods. The proposed method achieve lowest mean and standard deviation in depth errors compared with the other approaches. The low depth errors indicate that the proposed model generate 3D faces with higher accuracy.

Fig. 11: Though the proposed method is trained with single-view images, it can be adopted for multi-view reconstruction by merging two partial UV position maps into a full UV position map.

Iv-D Application

An additional advantage of employing UV position maps [7] for reconstruction is that it enables easier integration of 3D reconstructions from different views of a same face. The UV maps from different views can be easily combined by a simple blending in UV-space. Thus the combined full UV map can represent a complete 3D face model that are visible in different views. The induced detailed 3D reconstruction are more complete compared to depth map based representations. Figure 11 shows several examples of blending two partial UV maps from two views of a same face.

Extreme expression Occlusion Large pose
Input Shape Input Shape Input Shape
Fig. 12: Limitations of the proposed method. The proposed model have trouble dealing with partial occlusion, extreme poses and expressions.

Iv-E Limitation Analysis

Figure 12 shows some failure cases for the proposed model under difficult scenarios including extreme expression, occlusion and large pose. Since the proposed model are not trained with specific policy to deal with these difficult conditions, whenever the input images show extreme expressions or missing information due to large pose or occlusion, the reconstruction results are often unsatisfactory.

V Conclusion

We propose a detailed 3D face reconstruction framework with self-supervised learning. We use a coarse 3DMM encoder to reconstruct the general 3D face model and capture facial details in the UV space. When learning the 3DMM encoder, we incorporate multiple loss terms measurement ranging from the pixel-wise similarity to the global facial perception. After learning the 3DMM parameters, we unwrap both input face image and 3D model into UV space where all the faces are precisely aligned. The details from the input image are effectively transferred to the 3D model in the detail modeling step as the aligned facial details facilitate the learning process. Experiments on the benchmark datasets indicate the proposed method performs favorably against state-of-the-art 3D face modeling approaches.


  • [1] A. D. Bagdanov, A. Del Bimbo, and I. Masi (2011) The florence 2d/3d hybrid face dataset. In Joint ACM Workshop on Human Gesture and Behavior Understanding, Cited by: §IV-B1, §IV-C, TABLE II, TABLE V.
  • [2] V. Blanz and T. Vetter (1999) A morphable model for the synthesis of 3d faces. In sig, Cited by: §I, §II, §III-B, §IV-B2.
  • [3] V. Blanz and T. Vetter (2003) Face recognition based on fitting a 3d morphable model. tpami. Cited by: §I, §II.
  • [4] A. Bulat and G. Tzimiropoulos (2017) How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In iccv, Cited by: §III-B2, §IV-A, §IV-B1.
  • [5] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou (2014) Facewarehouse: a 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics. Cited by: §II, §III-B, Fig. 4, §IV-B2, §IV-B2, §IV-B2, TABLE III.
  • [6] P. Dou, S. K. Shah, and I. A. Kakadiaris (2017) End-to-end 3d face reconstruction with deep neural networks. In cvpr, Cited by: §I, §II.
  • [7] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In eccv, Cited by: §IV-D.
  • [8] P. Garrido, L. Valgaerts, C. Wu, and C. Theobalt (2013) Reconstructing detailed dynamic face geometry from monocular video.. In sig, Cited by: §II.
  • [9] K. Genova, F. Cole, A. Maschinot, A. Sarna, D. Vlasic, and W. T. Freeman (2018) Unsupervised training for 3d morphable model regression. In cvpr, Cited by: §I, §I, §II, §II, §III-B, §III-E, §IV-A, §IV-B1, §IV-B1, §IV-B1, §IV-B1, §IV-B1, §IV-B2, TABLE I, TABLE II, §IV.
  • [10] Y. Guo, j. zhang, J. Cai, B. Jiang, and J. Zheng (2019-06) CNN-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (6), pp. 1294–1307. External Links: Document, ISSN Cited by: §II.
  • [11] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Cited by: Fig. 3, §IV-B1.
  • [12] A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropoulos (2017) Large pose 3d face reconstruction from a single image via direct volumetric cnn regression. In iccv, Cited by: §I, §II.
  • [13] H. Kim, M. Zollhöfer, A. Tewari, J. Thies, C. Richardt, and C. Theobalt (2018) InverseFaceNet: deep monocular inverse face rendering. In cvpr, Cited by: §I, §II.
  • [14] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In iccv, Cited by: Fig. 3, Fig. 5, §IV-A, §IV-B1, §IV-C.
  • [15] O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015) Deep face recognition. In bmvc, Cited by: §III-B3, §III-B, §IV-A.
  • [16] P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, and W. Worek (2005) Overview of the face recognition grand challenge. In cvpr, Cited by: §IV-C, TABLE IV.
  • [17] E. Richardson, M. Sela, and R. Kimmel (2016) 3D face reconstruction by learning from synthetic data. In ‘3dv‘, Cited by: §I, §II.
  • [18] E. Richardson, M. Sela, R. Or-El, and R. Kimmel (2017) Learning detailed face reconstruction from a single image. In cvpr, Cited by: §I, §II.
  • [19] S. Romdhani and T. Vetter (2005) Estimating 3d shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In cvpr, Cited by: §I, §II.
  • [20] M. Sela, E. Richardson, and R. Kimmel (2017) Unrestricted facial geometry reconstruction using image-to-image translation. In iccv, Cited by: §I, §II, §IV-C, §IV-C, TABLE IV, TABLE V, §IV.
  • [21] A. Tewari, M. Zollhöfer, P. Garrido, F. Bernard, H. Kim, P. Pérez, and C. Theobalt (2018) Self-supervised multi-level face model learning for monocular reconstruction at over 250 hz. In cvpr, Cited by: Fig. 1, §I, §II.
  • [22] A. Tewari, M. Zollhöfer, H. Kim, P. Garrido, F. Bernard, P. Pérez, and C. Theobalt (2017)

    Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction

    In iccv, Cited by: §I, §I, §II, §II, §III-B, Fig. 4, Fig. 5, §IV-B1, §IV-B1, §IV-B1, §IV-B1, §IV-B1, §IV-B2, §IV-B2, §IV-B2, §IV-B3, TABLE I, TABLE II, TABLE III, §IV.
  • [23] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner (2016) Face2face: real-time face capture and reenactment of rgb videos. In cvpr, Cited by: §II.
  • [24] A. T. Tran, T. Hassner, I. Masi, and G. Medioni (2017) Regressing robust and discriminative 3d morphable models with a very deep neural network. In cvpr, Cited by: §I, §I, §II.
  • [25] A. T. Tran, T. Hassner, I. Masi, E. Paz, Y. Nirkin, and G. Medioni (2018) Extreme 3D face reconstruction: seeing through occlusions. In cvpr, Cited by: §I, §II, §IV-C, §IV-C, TABLE IV, TABLE V, §IV.
  • [26] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li (2016) Face alignment across large poses: a 3d solution. In cvpr, Cited by: §I, §I, §II.
  • [27] M. Zollhöfer, J. Thies, P. Garrido, D. Bradley, T. Beeler, P. Pérez, M. Stamminger, M. Nießner, and C. Theobalt (2018) State of the art on monocular 3d face reconstruction, tracking, and applications. In Computer Graphics Forum, Cited by: §II.