Digital Twin: Acquiring High-Fidelity 3D Avatar from a Single Image

by   Ruizhe Wang, et al.
West Virginia University

We present an approach to generate high fidelity 3D face avatar with a high-resolution UV texture map from a single image. To estimate the face geometry, we use a deep neural network to directly predict vertex coordinates of the 3D face model from the given image. The 3D face geometry is further refined by a non-rigid deformation process to more accurately capture facial landmarks before texture projection. A key novelty of our approach is to train the shape regression network on facial images synthetically generated using a high-quality rendering engine. Moreover, our shape estimator fully leverages the discriminative power of deep facial identity features learned from millions of facial images. We have conducted extensive experiments to demonstrate the superiority of our optimized 2D-to-3D rendering approach, especially its excellent generalization property on real-world selfie images. Our proposed system of rendering 3D avatars from 2D images has a wide range of applications from virtual/augmented reality (VR/AR) and telepsychiatry to human-computer interaction and social networks.


page 8

page 11

page 12

page 13

page 14

page 15

page 16

page 17


GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction

In the past few years, a lot of work has been done towards reconstructin...

CaricatureShop: Personalized and Photorealistic Caricature Sketching

In this paper, we propose the first sketching system for interactively p...

Fast-GANFIT: Generative Adversarial Network for High Fidelity 3D Face Reconstruction

A lot of work has been done towards reconstructing the 3D facial structu...

FNeVR: Neural Volume Rendering for Face Animation

Face animation, one of the hottest topics in computer vision, has achiev...

Accurate Face Rig Approximation with Deep Differential Subspace Reconstruction

To be suitable for film-quality animation, rigs for character deformatio...

Photorealistic Facial Texture Inference Using Deep Neural Networks

We present a data-driven inference method that can synthesize a photorea...

NeuralHDHair: Automatic High-fidelity Hair Modeling from a Single Image Using Implicit Neural Representations

Undoubtedly, high-fidelity 3D hair plays an indispensable role in digita...

1 Introduction

Figure 1: Sample outputs of our proposed avatar generation approach. From left to right: input image, inferred shape model with low polygon count, composite model with UV diffuse map

Acquiring high quality 3D avatars is an essential task in many vision applications including VR/AR, teleconferencing, virtual try-on, computer games, special effect, and so on. A common practice, adopted by most professional production studios, is to manually

create avatars from 3D scans or photo references by skillful artists. This process is often time consuming and labor intensive because each model requires days of manual processing and touching up. It is desirable to automate the process of 3D avatar generation by leveraging rapid advances in computer vision/graphics and image/geometry processing. There has been a flurry of works on generating 3D avatars from handheld video

[19], Kinect [1] and mesh model [45] in the open literature.

Developing a fully automatic system for generating 3D avatar from a single image is challenging because the estimation of both facial shape and texture map involves intrinsically ambiguous composition of light, shape and surface material. Conventional wisdom attempts to address this issue by inverse rendering, which formulates image decomposition as an optimization problem and estimates the parameters best fitting the observed image [4, 2, 31]

. More recently, several deep learning based approaches have been proposed - either in a supervised setup to directly regress the parameterized 3D face model

[50, 11, 43] or in an unsupervised fashion [39, 16, 32] with the help of a differentiable rendering process. However, these existing methods usually assume over-simplified lighting, shading and skin surface models, which does not take real-world complexities (e.g., sub-surface scattering, shadows caused by self-occlusion and complicated skin reflectance field [9]) into account. Consequently, the recovered 3D avatar often does not faithfully reflect the actual face presented in the image.

To meet those challenges, we propose a novel semi-supervised approach to utilize synthetically-rendered, photo-realistic facial images augmented from a prioritized 3D facial scan dataset. Upon collecting and processing 482 neutral facial scans with a medical grade 3D facial scanner, we perform shape augmentation and utilize a high-fidelity rendering engine to create a large collection of photo-realistic facial images. To the best of our knowledge, this work is the first attempt to leverage photo-realistic facial image synthesis for accurate face shape inference.

For facial geometry estimation, we propose to first extract deep facial identity features [37, 28], trained on millions of images, which encodes each face into a unique latent representation, and regress the vertex coordinates of a generic 3D head model. To better capture facial landmarks for texture projection, the vertex coordinates are further refined in a non-rigid manner by jointly optimizing over camera intrinsic, head pose, facial expression and a per-vertex corrective field. Our final generated model consists of a shape model with low polygon counts but a high-resolution texture map with sharp details, which allows efficient rendering even on mobile devices (as shown in Fig. 1).

At the system level, 3D avatars created by our approach are similar to those of Pinscreen [18, 48] and [14]. For Pinscreen avatar, the shape model is reconstructed via an analysis-by-synthesis method [4]; while our shape model is directly regressed from deep facial identity features, hence reaching higher shape similarity. For avatar, both shape and texture models utilize a collection of 10,000 facial scans [6]; while our semi-supervised method only uses 482 scans and is still capable of achieving similar shape reconstruction accuracy and higher resolution UV-texture map from an input selfie.

Our key contributions can be summarized as follows:

A system for generating a high-fidelity UV-textured 3D avatar from a single image which can be efficiently rendered in real time even on mobile devices.

Training a shape estimator on the synthetic photo-realistic images by using pre-trained deep facial identity features. The trained networks demonstrate excellent generalization properties on real-world images.

Extensive qualitative and quantitative evaluation of the proposed method against other state-of-the-art face modeling techniques demonstrates its superiority (i.e., higher shape similarity and texture resolution).

2 Related Works

3D Face Representation. 3D Morphable Model (3DMM) [4]

uses Principal Component Analysis (PCA) on aligned 3D neutral faces to reduce the dimension of 3D face representation making the face fitting problem more tractable. The FaceWareHouse technique  

[7] enhances the original PCA-based neutral face model with expressions by applying multi-linear analysis [44] to a large collection of 4D facial scans captured with RGB-D sensors. The quality of multi-linear model was further improved in [5] by jointly optimizing the model and the group-wise registration of 3D scans. In [6], a Large Scale Facial Model with 10,000 faces was generated to maximize the coverage of gender and ethnics. The training data was further enlarged in [27], which created a linear shape space trained from 4D scans of 3800 human heads. More recently, a non-linear model was proposed in [41] from a large set of unconstrained face images without the necessity of collecting 3D face scans.

Fitting via Inverse Rendering. Inverse rendering [2, 4] formulates 3D face modeling as an optimization problem over the entire parameter space seeking the best fitting for the observed image. In addition to pixel intensity values, other constraints such as facial landmarks and edge contours, are exploited for more accurate fitting [31]. More recently, GanFit [14]

used a generative neural network for facial texture modeling and utilized an additional facial identity loss function in the optimization formulation. The inverse rendering based modeling approach has been widely used in many applications

[48, 17, 40, 15].

Supervised Shape Regression.Convolutional Neural Network (CNN) based approaches have been proposed to directly map an input image to the parameters of a 3D face model such as 3DMM [10, 50, 22, 42, 49]. In [20], a volumetric representation was learned from an input image. In [34], an input color image was mapped to a depth image using an image translation network. In [11], a network was proposed to jointly reconstruct the 3D facial structure and provide dense alignment in the UV space. The work of [43] took a layered approach toward decoupling low-frequency geometry from its mid-level details estimated by a shape-from-shading approach. It is worth mentioning that many CNN-based approaches use facial shape estimated by inverse rendering as the ground truth during training.

Unsupervised Learning.

Most recently, face modeling from images via unsupervised learning becomes popular because it affords almost unlimited amount of data for training. An image formation layer was introduced in

[39] as the decoder jointly working with an auto-encoder architecture for end-to-end unsupervised training. SfSNet [35] explicitly decomposes an input image into albedo, normal and lighting components, which are then composed back to approximate the original input image. 3DMM parameters were first directly learned in  [16] from facial identity encoding and then the problem of parameter optimization was formulated in an unsupervised fashion by introducing a differentiable renderer and a facial identity loss on the rendered facial image. A multi-level face model, (i.e. 3DMM with corrective field) was developed in [38] following an inverse rendering setup that explicitly models geometry, reflectance and illumination per vertex. RingNet [32] employed a similar idea as the triplet loss for encoding all images of the same subject to the same latent shape representation.

Deep Facial Identity Feature.

Recent advances in face recognition

[37, 28, 33] attempt to encode all facial images of the same subject under different conditions into identical feature representations, namely deep facial identity features. Several attempts have been made to utilize this robust feature representation for face modeling. GanFit [14] used an additional deep facial identity loss to the commonly used landmark and pixel intensity losses. In [16], 3DMM parameters were directly learned from deep facial features. Although our shape regression network is similar to theirs, the choice of training data is different. Unlike their unsupervised setting, we opt to work with supervision by synthetically rendered facial images.

3 Proposed Method

3.1 Overview

An overview of the proposed method is shown in Fig. 2. To facilitate facial image synthesis (Sec. 3.2) for training a shape regression neural network (Sec. 3.3), we have collected and processed a prioritized 3D face dataset, from which we can sample augmented 3D face shape with UV-texture to render a large collection of photo-realistic facial images. During testing, the input image is first used to directly regress the 3D vertex coordinates of a 3D face model with the given topology, which are furthered refined to fit the input image with a per-vertex non-rigid deformation approach (Sec. 3.4). Upon accurate fitting, selfie texture is projected to the UV space to infer a complete texture map (Sec. 3.5).

Figure 2: Overview of the proposed approach. During training, we learn a shape regression neural network on photo-realistic synthetic facial images. During testing, we infer a low polygon count shape model with a UV diffuse map generated from the projected texture.

3.2 Photo-Realistic Facial Synthesis

3D Scan Database. The most widely used Basel Face Model (BFM) [29] has two major drawbacks. First, it consists of 200 subjects but mainly Caucasian, which might lead to biased face shape estimation. Second, each face is represented by a dense model with high polygon count, per-vertex texture appearance and frontal face only, which limits its use for production-level real-time rendering. To overcome these limitations, we have collected a total of 512 subjects using a professional-grade multi-camera stereo scanner (3dMD LLC, Atlanta 111 across different gender and ethnicity as shown in Table 1.

A face representation containing a head model of 2925 vertices and a diffuse map sized by is used. We take a non-rigid alignment approach [7] of deforming a generic head model to match the captured facial scan. Then we transfer the texture onto the generic model’s UV space. With further manual artistic touch up, we obtain the final high-fidelity diffuse map.

Gender/ Ethnicity White Asian Black Total
Male 82 / 5 178 / 5 8 / 5 268 / 15
Female 45 / 5 164 / 5 5 / 5 214 / 15
Total 127 / 10 342 / 10 13 / 10 482 / 30
Table 1: The distribution of gender and ethnicity in our database. Note that we randomly select 5 subjects for each group for testing and the rest subjects are used for training and validation.

Shape Augmentation. 482 subjects are far from enough to cover all possible facial shape variations. While it is expensive to collect thousands of high-quality facial scans, we adopt an alternative shape augmentation approach to improve the generalization ability of the trained neural network. First, we adopt a recent deformation representation (DR) [46, 13] to model a 3D facial mesh . DR feature encodes the th vertex as a vector. Hence the DR feature of the entire mesh is represented as a vector . Please see the supplementary material on how to compute a DR feature from and vice versa.

Upon obtaining a set of DR features as where N is the total number of subjects, we follow [21] to sample new DR features. More specifically, we sample a vector in Polar coordinates, where

observes a uniform distribution

and follows uniform distribution . We calculate its corresponding Cartesian coordinates

and interpolate the sampled DR features as

, from which we further calculate the corresponding facial mesh. In our experiments, we use and only select samples from the same gender and ethnicity. We generate 10,000 new 3D faces with a ratio of across Asian/Caucasian/Black and a ratio of across Male/Female. For each new sampled face, we assign its UV texture by choosing that is the closest 3D face in the same ethnicity and gender from existing 482 subjects.

Synthetic Rendering. We use an off-the-shelf high quality rendering engine V-ray 222 With artistic assistance, we set up a shader graph to render photo-realistic facial images given a custom diffuse map and a generic specular map. We manually set up 30 different lighting conditions and further randomize head rotation in roll, yaw and pitch. The background of rendered images are randomized with a large collection of indoor and outdoor images. We opt not to render eye models and mask out the eye areas when testing by using detected local eye landmarks. Please see supplementary material for more details.

3.3 Regressing Vertex Coordinates

Our shape regression network consists of a feature encoder and a shape decoder. Deep facial identity feature is known for its robustness under varying conditions such as lighting, head pose and facial expression, providing a naturally ideal option for the encoded feature. Although any off-the-shelf facial recognition network would be sufficient for our task, we propose to adopt Light CNN-29V2 [47] due to its good balance between network size and encoding efficiency. A pre-trained Light CNN-29V2 model is used to encode an input image into a 256-dimensional feature vector. We have used a weighted per-vertex L1 loss: weight of 5 for vertices on the facial area (within a radius of 95mm from the nose tip) and weight of 1 for other vertices.

For shape decoder, we have used three fully connected (FC) layers, with the output size of 128, 200 and 8,775 respectively. The last FC layer directly predicts concatenated vertex coordinates of a generic head model consisting of 2,925 points, and it is initialized with 200 pre-computed PCA components explaining more than 99% of the variance observed in the 10,000 augmented 3D facial shapes. When compared with unsupervised learning

[16], our accessibility to a high-quality prioritized 3D face scan dataset makes it possible to achieve higher accuracy by supervision.

3.4 Non-rigid Deformation

3D vertex coordinates generated by the shape regression neural network is not directly applicable to texture projection because facial images usually contain unknown factors such as camera intrinsic, head pose and facial expression. Meanwhile, since shape regression predicts the overall facial shape, local parts such as eyes, nose and mouth are not accurately reconstructed; but they are equally important to quality perception when comparing against the original face image. We propose to utilize facial landmarks detected in a coarse-to-fine fashion and formulate non-rigid deformation as an optimization problem that jointly optimizes over camera intrinsic, camera extrinsic, facial expression and a per-vertex corrective field.

Problem Formulation. To handle facial expressions, we transfer the expression blendshape model in FaceWarehouse [7] to the same head topology with artist’s assistance as . In addition, we introduce a per-vertex correction field to cover out of space non-rigid deformation. Finally, a 3D face is reconstructed as . Camera extrinsic transforms the face from its canonical reference coordinate system to the camera coordinate system. It has a 3-DoF vector for translation and a 3-DoF quaternion representation for rotation. Camera intrinsic projects the 3D model to the image plane. During the optimization, we have found that using a scale factor to update the intrinsic matrix by leads to the best numerical stability. Here are all initialized from the size of the input image as , , and . Putting things together, we can represent the overall parameterized vector by .

Landmark Term. We employ a global-to-local method for facial landmark localization. For global inference, we first detect the standard 68 facial landmarks, and use this initial estimation to crop local areas including eyes, nose, and mouth - i.e., a total of 4 cropped images. Then we perform fine-scale local inference on the cropped images (Please see the supplementary material for more details). The landmark localization approach produces a set of facial landmarks where . We propose to minimize the distance between the predicted landmarks on the 3D model and the detected landmarks,


where samples a 3D vertex from given a production-ready and sparse triangulation on barycentric coordinates , and are perspective projection and rigid transformation operators respectively, is the distance between two outermost eye landmarks and is used to normalize the eye distance to 100. We pre-select on and follow the sliding scheme [8] to update the barycentric coordinates of the 17 facial contour landmarks at each iteration.

Corrective Field Regularization. To enforce a smooth and small per-vertex corrective field, we combine the following two losses,


The first loss is used to regularize a smooth deformation by maintaining the Laplacian operator on the deformed mesh (please refer to [36] for more details). indicates the estimated facial expression blendshape weights from the last iteration and is a fixed value. The second loss is used to enforce a small corrective field and is used to balance the two terms.

Other Regularization Terms. We further regularize on facial expression, focal length scale factor, and rotation component of camera extrinsic as follows,



is the vector of eigenvalues of the facial expression covariance matrix obtained via PCA.

and are regularization parameters.

Summary. Our total loss function is given by


where and are used to balance relative importance of the three terms. is optimized by Gauss-Newton approach over parameters for a total of iterations. For the initial parameter vector , and are initialized as all- vectors, and are estimated from the EPnP approach [26], and is initialized to be .

3.5 Texture Processing

Upon non-rigid deformation, we project selfie texture to the UV space of the generic model using the estimated camera intrinsic, head pose, facial expression and per-vertex correction. While usually only the frontal area on a selfie is visible, we recover textures on other areas, e.g., back of head and neck, by using the UV texture of one of the 482 subjects that is closest to the query subject. We define closeness as L1 loss on the distance between LightCNN-29V2 embeddings, i.e., through face recognition. Finally given a foreground projected texture and a background default texture, we blend them using the Poisson Image Editing [30].

4 Experimental Results

4.1 Implementation Details

For shape regression, we use Adam optimizer with a learning rate of 0.0001 and the momentum ,

for 500 epochs. We train on a total of 10,000 synthetically rendered facial images with a batch size of 64. For non-rigid deformation, we use a total of

iterations. When minimizing Eq. (4), we use and . In Eq. (2), we set , and in Eq. (3) we set and .

4.2 Database and Evaluation Setup

Stirling/ ESRC 3D Faces Database The ESRC [12] is the latest public 3D faces database captured by a Di3D camera system. The database also provides several images captured from different viewpoints under various lighting condition. We select those subjects who have both 3D scan and a frontal neutral face for evaluation. There are total 129 subjects (62 male and 67 female) for testing. Note that in this dataset, around of people are Caucasian.

JNU-Validation Database The JNU-Validation Database is a part of the JNU 3D face Database collected by the Jiangnan University [25]. It has 2D images of Asians and their 3D face scans captured by 3dMD. Since the validation database was not used during training, we consider it as a test database for Asians. The 2D images of each subject are in range of . To minimize the impact of imbalance data, we select three frontal images of each subject for quantitative comparison.

Our Test Data Since there is no public database available for testing, which shall cover all the gender and races, we randomly pick five subjects from the six group in Table 1 and form a total 30 subjects as the evaluation database. The other 482 scans are used as for data augmentation and training/validation stage for both geometry and texture. Each subject has two testing images: a selfie captured by a Samsung Galaxy S7 and an image captured from a Sony a7R DSLR camera by a photographer.

Evaluation Setup We compared our method with several state-of-the-art-methods including 3DMM-CNN [42], Extreme 3D Face (E3D) [43], PRNet [11], RingNet [32], and GanFit [14]. The reconstructed model detail of each methods are shown in Table 2

. Note that for our method and RingNet, both eyes, teeth and tongue and their model holders are removed before comparison. Because the evaluation metric is using the point-to-plane error, unrelated data will increase the over all error. Although removing those parts will also slightly increase the error (e.g., no data in the eyes area to compare), the introduced error is much smaller than the error of directly using the original models.

Ours RingNet[32] GanFit [14] PRNet[11] E3D[43] 3DMM-CNN[42]
Full Head Yes Yes No No No No
Vertex 2.9K (2.7K) 5.0K (3.8K) 53.2K 43.7K 155K 47.0K
Face 5.8K (5.3K) 10.0 K (7.4K) 105.8K 86.9K 150K 93.3K
Table 2: The geometric complexity of our method and other method. Note that except E3D, the other methods used the same topology for their reconstructed model. The number inside the parentheses in both our method and RingNet are the details of head models after unrelated mesh removal.

4.3 Quantitative Comparison

Evaluation Metric: To align the reconstructed model with ground truth, we followed the step of [42, 16, 14] and the challenge [12]. Since the topology of each method is fixed, seven pre-selected vertex index is first used to roughly align the reconstructed model to the ground truth and then the model was further refined by iterative closest point (ICP) [3]. The position of vertex of the tip of the nose is chosen to be the center of the ground truth and reconstructed models. Given a threshold mm, we discard those vertex , where . To evaluate the reconstructed model with ground truth, we used the Average Root Mean Square Error (ARMSE) 333 as suggested by the 3DFAW Challenge 444 , where it computes the closest point-to-mesh distance between the ground truth and predicted model and vice versa.

ESRC and JNU-validation Dataset: In Figure 3, we have chosen and computed the ARMSE for each reconstructed model and ground truth. Note that the annotation provided by ESRC database only has the seven landmark for alignment, thus instead of using the tip of nose, we use the average of the 7 landmark as the center of face. In ESRC, our result is better than other methods when and our performance is more resilient as increases. This indicates that our method can better replicate the shape of the entire head than other methods. In JNU-validation database, since other methods are trained from a Caucasian-dominated 3DMM model, while the other races are also considered during our augmented stage, we can achieve much smaller reconstructed error at every value.

(a) ESRC
(b) JNU-Validation
Figure 3: The quantitative results of our method compare to 3DMM-CNN [42], E3D [43], PRNet [11] and RingNet [32] on both ESRC and JNU-Validation database.

Our Test Dataset: In Figure 4 (a), the centers of each error-bar are the average of the ARMSE from the 60 reconstructed meshes. The range of the errorbar is , where

is the standard error. It is shown that our reconstructed models is slightly better than GanFit and significantly better than other methods. It is worth mentioning that our vertex number is only

of RingNet and less than of other methods. In Figure 4 (b), the cropped mesh of the ground truth and each methods are shown under different threshold of . To utilize the reconstructed models for real-world application, we believe that is the best value because it captured the entire head instead of the frontal face. We further investigate the performance under different races and the results are shown in Figure 4 (c). Our method can correctly replicate the model to under mm of error in all ethnicity, while other methods such as RingNet and PRNet are very sensitive to the ethnicity differences. Although GanFit performed slightly better than our method on White and Black races, the overall performance is not as good as ours because they are not able to recover the Asian geometries well. It is worth noting that we used 10000 synthetic images augmented from less than 500 scan data, which is only 5% of the data used in GanFit. To fairly visualize the error between methods without the effect of different topology, we find the closest point-to-plane distance from ground truth to reconstructed model and generate the heat-map for each method in Figure 5.

Figure 4: The quantitative results of our method compare to E3D [43], 3DMM-CNN [42], PRnet [11], RingNet [32] and GanFit [14]. (a) The overall performance of each method. (b) The qualitative comparison of cropped meshes with ground truth in . (c) The evaluation results on different ethnicity.
Figure 5: The heatmap visualization of reconstructed models in . Vertices colored red fall above the 5mm error tolerance, while blue vertices are those which lie within the tolerance.

4.4 Ablation Study

To demonstrate the effectiveness of the individual modules in the proposed approach, we modify one variable at a time and compare with the following alternatives:

No Augmentation (No-Aug): Without any augmentation, we simply repetitively sample 10,000 faces from 482 subjects.

Categorized-PCA Sampling (C-PCA): Instead of DR Feature based sampling, we propose a PCA based sampling method. We train a shape PCA model from 482 subjects, and for each group in Table 1, a Gaussian random vector is used to create weights of the principal shape components, where and are the mean vector and co-variance matrix of those coefficients in the group. We sample 10,000 faces with this augmentation approach.

Game engine Rendering-Unity: Instead of using high-quality photo-realistic renderer, we use Unity, a standard game rendering engine, to synthesize facial images. The quality of rendered images are comparatively lower than V-ray. We keep the DR feature based augmentation approach and rendered exactly the same 10000 synthetic faces mentioned in Section 3.2.

In Fig. 6, our proposed approach outperforms all other alternatives. It is expected that without data augmentation (i.e., No-Aug), the reconstructed error is the worst among all methods. The difference between C-PCA and our method proves that DR sampling augmentation creates more natural synthetic faces for training. The results between Unity and our method shows that the quality of rendered images plays an important role in bridging the gap between real and synthetic images.

Figure 6: The quantitative results of No-Aug, C-PCA, Unity and our method. The proposed method achieve the best performance at all time.
4.4.1 Qualitative Comparison

Figure 7 shows our shape estimation method on frontal face images side-by-side with the state-of-the-arts in MoFA test database. We picked the same images shown in GanFit [14]. Our method creates accurate face geometry, while also capturing discriminate features which allow the identity of each face to be easily distinguishable from the others. Meanwhile, as shown in Table 2, our result maintains a low geometric complexity. This allows our avatars to be production ready even in demanding cases such as on mobile platforms. In Figure 8, we choose a few celebrity to verify the geometry accuracy of our method comparing to others. In Figure 9, we demonstrate our final results with blended diffuse maps in Section 3.5.

Figure 7: The qualitative comparison of our method with PRNet [11], MoFA [39], RingNet [32], GanFit [14] and E3D [43]. Our method accurately reconstructs the geometry, while maintaining a much lower vertices count, which is more suitable for production.
Figure 8: The showcase of our reconstruction results of several celebrities comparing to RingNet [32], PRNet [11] and E3D [43].
Figure 9: Our final results with blended diffuse maps.

5 Conclusions and Future Works

In this paper, we demonstrated a supervised learning approach for estimating high quality 3D face shape with photo-realistic high-resolution diffuse map. To facilitate facial image synthesis, we have collected and processed a prioritized 3D face database, from which we can sample augmented 3D face shape with UV-texture to render a large collection of photo-realistic facial images. Unlike previous approaches, our method leverages the discriminative power of an off-the-shelf face recognition neural network trained on millions of synthesized photo-realistic facial images.

We have demonstrated the transferable proficiency of the proposed method from the objective of accurate face recognition to fully reconstruct the facial geometry based on a single selfie. While training on synthetically generated facial imagery, we have observed strong generalization power when tested on real-world images. This opens up opportunities in many interesting applications including VR/AR, teleconferencing, virtual try-on, computer games, special effect, and so on.

Section 3.2. Scan Pre-processing

As shown in Fig. 10, we process a raw textured 3D facial scan data to generate our 3D face representation that consists of a shape model with low polygon count and a high-resolution diffuse map for preserving details.

Figure 10: Top row: left is the raw facial scan with dense topology, and right is the model with UV texture; Bottom row: left is the processed face model with sparse topology, and right is the model with UV texture.

Section 3.2. Deformation Representation

Here we give a detailed formulation of the Deformation Representation (DR) feature. DR feature encodes local deformation around each vertex of with respect to a reference mesh into a vector. We use the mean face of all 482 processed facial models as the reference mesh.

Encode from . We denote the - vertex as and respectively. The deformation gradient in the closest neighborhood of the - vertex from the reference model to the deformed model is defined by the affine transformation matrix that minimizes the following energy


where is the cotangent weight depending on the reference model to handle irregular tessellation. With polar decomposition, is decomposed into a rotation component and a scaling/shear component such that . The rotation matrix can be represented with a rotation axis and rotation angle pair, and we further convert them to the matrix logarithm representation:


Finally the DR feature for is represented by where

is the identity matrix. Since

and is symmetric, has 9 DoF.

Recover from . Given the DR feature and the reference mesh , we first recover the affine transformation for each vertex. Then we try to recover the optimal that minimizes:


For each , we obtain it by solving which gives


The resulting equations for all lead to a linear system which can be written as . By specifying the position of one vertex, we can get the single solution to the equation to fully recover

Section 3.4. Landmark Localization

To achieve higher landmark localization accuracy, we have developed a coarse-to-fine approach. First, we predict all facial landmarks from the detected facial bounding box. Then, given the initial landmarks, we crop the eye, nose, and mouth areas for the second stage fine-scale landmark localization. Fig. 11 shows our landmark mark-up as well as the bounding boxes used for the fine scale landmark localization stage. We have used a regression forest based approach [24] as the base landmark predictor and we train 4 landmark predictors in total, i.e., for overall face, eye, nose and mouth.

(a) Input Image
(b) Landmark Detection of each parts
Figure 11: Our landmark mark-up consists of 104 points, i.e., face contour (1-17), eye brows (18-27), left eye (28-47), right eye (48-67), nose (68-84) and mouth (85-104). (a) Coarse detection of all landmarks and corresponding bounding boxes for fine scale detection. (b) Separate fine-scale detection result of local areas.

Section 4.4. Different Rendering Quality

In this section, first we illustrate the 30 different manually created lighting conditions used for high-quality Vray rendering as shown in Fig. 12. Then we provide several synthetic face images rendered from Vray and Unity as shown in Fig. 13. Note that, for both rendering method, we randomized the head pose, environment map, lighting condition, and the field of view (FOV) to mimic the selfie in the real world. We don’t render eye models, and as a result, we mask out the eye area with detected facial landmarks during test time as mentioned in Section 3.2.

Figure 12: Different lighting conditions for photo-realistic rendering augmentation
(a) V-ray rendering samples
(b) Unity rendering samples
Figure 13: The synthetic facial images from (a) Maya V-ray and (b) Unity

Section 4.5. More Qualitative Results

In this section, we provide more comparison results that cannot be included in the paper due to page limits. For GanFit [14], we have requested them to run the reconstructed results of our test data. Thus, we are only able to show the qualitative comparison with GanFit in our test database. For those images/selfies in the other database, we have compared our results with those papers whose codes are available online including RingNet [32], PRNet [11], Extreme3D [43] and 3DMM-CNN [42].

More Qualitative Results of Our Data

In Fig. 14, we provided the qualitative results of each categories. The first and second columns are the input image and the ground truth. Instead of showing the cropped mesh, we decided to show the whole models for each method in Fig. 14. It is worth noting that our reconstructed full head model is ready to be deployed for different applications.

Figure 14: Qualitative results on our test dataset. From left to right, input image, ground truth, our method, GanFit [14], RingNet [32], E3D [43], 3DMM-CNN [42], and PRnet [11].

Qualitative ESRC and JUN-Validate

Due to the paper limitation, we are not able to show the qualitative result of ESRC and JUN-validate Dataset. As shown in Figs. 15, we can still see the similar results we claimed in the paper that the proposed method can correctly replicate the 3D models from single selfies with much lower polygon.

(a) ESRC Male
(b) ESRC Female
(c) JUN-Validation Database
Figure 15: Qualitative results on the ESRC and JNU-Validation datasets. From left to right, input image, ground truth, our method, 3DMM-CNN[42], E3D [43], PRnet [11], RingNet [32].

More Qualitative Results of MoFA In Fig. 16, we have requested the results from MoFA [39] for side-by-side comparisons. Although the quality of reconstructed models are not as good as the results from other database due to the image resolution, large head pose variation, occlusion such as hair and glasses, our model is still considerably better than other methods.

(a) MoFA Male
(b) MoFa Female
Figure 16: Qualitative results of MoFa dataset. From left to right, input image, our method, RingNet [32], PRnet [11], 3DMM-CNN[42], and MoFA [39].

More Celebrity-In-the-Wild Results In Figs. 17 - 20, we present the results of several celebrities and compare our method not only for geometry but also in appearance. Note that by projecting the selfie to a high-resolution UV texture, our reconstructed models has photo-realistic appearance while 3DMM-CNN [42] and PRNet [11] used vertex color results in limited texture reapplication.

Figure 17: Qualitative results of our method compare to RingNet [32], PRnet [11], E3D [43], and 3DMM-CNN [42].
Figure 18: Qualitative results of our method compare to RingNet [32], PRnet [11], E3D [43], and 3DMM-CNN [42].
Figure 19: Qualitative results of our method compare to RingNet [32], PRnet [11], E3D [43], and 3DMM-CNN [42].
Figure 20: Qualitative results of our method compare to RingNet [32], PRnet [11], E3D [43], and 3DMM-CNN [42].

Application - Audio-driven Avatar Animation

Our automatically generated head model is ready for different applications. Here we demonstrate a case of automatic lip syncing driven by a raw waveform audio input as shown in Fig. 21. For data collection and deep neural network structure, we adopt a similar pipeline as that of [23] to drive the reconstructed model. All the animation blendshapes are transferred to our generic topology. Please refer to our video for more details.

Figure 21: Audio driven lip syncing on our production ready head model


  • [1] K. Aitpayev and J. Gaber (2012) Creation of 3d human avatar using kinect. Asian Transactions on Fundamentals of Electronics, Communication & Multimedia 1 (5), pp. 12–24. Cited by: §1.
  • [2] O. Aldrian and W. A. Smith (2012) Inverse rendering of faces with a 3d morphable model. IEEE transactions on pattern analysis and machine intelligence 35 (5), pp. 1080–1093. Cited by: §1, §2.
  • [3] B. Amberg, S. Romdhani, and T. Vetter (2007) Optimal step nonrigid icp algorithms for surface registration. In

    Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on

    pp. 1–8. Cited by: §4.3.
  • [4] V. Blanz and T. Vetter (1999) A morphable model for the synthesis of 3d faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’99, New York, NY, USA, pp. 187–194. External Links: ISBN 0-201-48560-5, Link, Document Cited by: §1, §1, §2, §2.
  • [5] T. Bolkart and S. Wuhrer (2015-12) A groupwise multilinear correspondence optimization for 3d faces. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3604–3612. External Links: Document, ISSN 2380-7504 Cited by: §2.
  • [6] J. Booth, A. Roussos, S. Zafeiriou, A. Ponniahy, and D. Dunaway (2016-06) A 3d morphable model learnt from 10,000 faces. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5543–5552. External Links: Document, ISSN 1063-6919 Cited by: §1, §2.
  • [7] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou (2014-03) FaceWarehouse: a 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20 (3), pp. 413–425. External Links: Document, ISSN 1077-2626 Cited by: §2, §3.2, §3.4.
  • [8] C. Cao, Q. Hou, and K. Zhou (2014) Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on graphics (TOG) 33 (4), pp. 43. Cited by: §3.4.
  • [9] P. Debevec, T. Hawkins, C. Tchou, H. Duiker, W. Sarokin, and M. Sagar (2000) Acquiring the reflectance field of a human face. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 145–156. Cited by: §1.
  • [10] P. Dou, S. K. Shah, and I. A. Kakadiaris (2017) End-to-end 3d face reconstruction with deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5908–5917. Cited by: §2.
  • [11] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In ECCV, Cited by: §1, §2, Figure 3, Figure 4, Figure 7, Figure 8, §4.2, Table 2, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Section 4.5. More Qualitative Results, Section 4.5. More Qualitative Results.
  • [12] Z. Feng, P. Huber, J. Kittler, P. Hancock, X. Wu, Q. Zhao, P. Koppen, and M. Raetsch (2018-05) Evaluation of dense 3d reconstruction from 2d face images in the wild. In 2018 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018), pp. 780–786. External Links: Document, ISSN Cited by: §4.2, §4.3.
  • [13] L. Gao, Y. Lai, J. Yang, Z. Ling-Xiao, S. Xia, and L. Kobbelt (2019) Sparse data driven mesh deformation. IEEE transactions on visualization and computer graphics. Cited by: §3.2.
  • [14] B. Gecer, S. Ploumpis, I. Kotsia, and S. Zafeiriou (2019) GANFIT: generative adversarial network fitting for high fidelity 3d face reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1155–1164. Cited by: §1, §2, §2, Figure 4, Figure 7, §4.2, §4.3, §4.4.1, Table 2, Figure 14, Section 4.5. More Qualitative Results.
  • [15] Z. Geng, C. Cao, and S. Tulyakov (2019) 3d guided fine-grained face manipulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9821–9830. Cited by: §2.
  • [16] K. Genova, F. Cole, A. Maschinot, A. Sarna, D. Vlasic, and W. T. Freeman (2018) Unsupervised training for 3d morphable model regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8377–8386. Cited by: §1, §2, §2, §3.3, §4.3.
  • [17] L. Hu, S. Saito, L. Wei, K. Nagano, J. Seo, J. Fursund, I. Sadeghi, C. Sun, Y. Chen, and H. Li (2017-11) Avatar digitization from a single image for real-time rendering. Vol. 36, ACM, New York, NY, USA. External Links: ISSN 0730-0301, Document Cited by: §2.
  • [18] L. Hu, S. Saito, L. Wei, K. Nagano, J. Seo, J. Fursund, I. Sadeghi, C. Sun, Y. Chen, and H. Li (2017) Avatar digitization from a single image for real-time rendering. ACM Transactions on Graphics (TOG) 36 (6), pp. 195. Cited by: §1.
  • [19] A. E. Ichim, S. Bouaziz, and M. Pauly (2015) Dynamic 3d avatar creation from hand-held video input. ACM Transactions on Graphics (ToG) 34 (4), pp. 45. Cited by: §1.
  • [20] A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropoulos (2017) Large pose 3d face reconstruction from a single image via direct volumetric cnn regression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1031–1039. Cited by: §2.
  • [21] Z. Jiang, Q. Wu, K. Chen, and J. Zhang (2018) Disentangled representation learning for 3d face shape. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §3.2.
  • [22] A. Jourabloo and X. Liu (2016) Large-pose face alignment via cnn-based dense 3d model fitting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4188–4196. Cited by: §2.
  • [23] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen (2017-07) Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. 36 (4), pp. 94:1–94:12. External Links: ISSN 0730-0301, Link, Document Cited by: Application - Audio-driven Avatar Animation.
  • [24] V. Kazemi and J. Sullivan (2014) One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1867–1874. Cited by: Section 3.4. Landmark Localization.
  • [25] P. Koppen, Z. Feng, J. Kittler, M. Awais, W. Christmas, X. Wu, and H. Yin (2018-02) Gaussian mixture 3d morphable face model. Pattern Recogn. 74 (C), pp. 617–628. External Links: ISSN 0031-3203, Link, Document Cited by: §4.2.
  • [26] V. Lepetit, F. Moreno-Noguer, and P. Fua (2009) Epnp: an accurate o (n) solution to the pnp problem. International journal of computer vision 81 (2), pp. 155. Cited by: §3.4.
  • [27] T. Li, T. Bolkart, Michael. J. Black, H. Li, and J. Romero (2017) Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 36 (6). External Links: Link Cited by: §2.
  • [28] O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015-09) Deep face recognition. In Proceedings of the British Machine Vision Conference (BMVC), G. K. L. Tam (Ed.), pp. 41.1–41.12. External Links: Document, ISBN 1-901725-53-7, Link Cited by: §1, §2.
  • [29] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter (2009-Sep.) A 3d face model for pose and illumination invariant face recognition. In 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 296–301. External Links: Document, ISSN Cited by: §3.2.
  • [30] P. Pérez, M. Gangnet, and A. Blake (2003) Poisson image editing. ACM Transactions on graphics (TOG) 22 (3), pp. 313–318. Cited by: §3.5.
  • [31] S. Romdhani and T. Vetter (2005) Estimating 3d shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2, pp. 986–993. Cited by: §1, §2.
  • [32] S. Sanyal, T. Bolkart, H. Feng, and M. J. Black (2019) Learning to regress 3d face shape and expression from an image without 3d supervision. External Links: 1905.06817 Cited by: §1, §2, Figure 3, Figure 4, Figure 7, Figure 8, §4.2, Table 2, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Section 4.5. More Qualitative Results.
  • [33] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §2.
  • [34] M. Sela, E. Richardson, and R. Kimmel (2017)

    Unrestricted facial geometry reconstruction using image-to-image translation

    arxiv. Cited by: §2.
  • [35] S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs (2018) SfSNet: learning shape, reflectance and illuminance of facesin the wild’. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6296–6305. Cited by: §2.
  • [36] O. Sorkine, D. Cohen-Or, Y. Lipman, M. Alexa, C. Rössl, and H. Seidel (2004) Laplacian surface editing. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing, pp. 175–184. Cited by: §3.4.
  • [37] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: §1, §2.
  • [38] A. Tewari, M. Zollhöfer, P. Garrido, F. Bernard, H. Kim, P. Pérez, and C. Theobalt (2018) Self-supervised multi-level face model learning for monocular reconstruction at over 250 hz. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2549–2559. Cited by: §2.
  • [39] A. Tewari, M. Zollhöfer, H. Kim, P. Garrido, F. Bernard, P. Pérez, and C. Theobalt (2017)

    MoFA: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction

    2017 IEEE International Conference on Computer Vision (ICCV), pp. 3735–3744. Cited by: §1, §2, Figure 7, Figure 16, Section 4.5. More Qualitative Results.
  • [40] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner (2016) Face2face: real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387–2395. Cited by: §2.
  • [41] L. Tran and X. Liu (2018-06) Nonlinear 3d face morphable model. In IEEE Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT. Cited by: §2.
  • [42] A. Tuan Tran, T. Hassner, I. Masi, and G. Medioni (2017) Regressing robust and discriminative 3d morphable models with a very deep neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5163–5172. Cited by: §2, Figure 3, Figure 4, §4.2, §4.3, Table 2, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Section 4.5. More Qualitative Results, Section 4.5. More Qualitative Results.
  • [43] A. Tuan Tran, T. Hassner, I. Masi, E. Paz, Y. Nirkin, and G. Medioni (2018) Extreme 3d face reconstruction: seeing through occlusions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3935–3944. Cited by: §1, §2, Figure 3, Figure 4, Figure 7, Figure 8, §4.2, Table 2, Figure 14, Figure 15, Figure 17, Figure 18, Figure 19, Figure 20, Section 4.5. More Qualitative Results.
  • [44] D. Vlasic, M. Brand, H. Pfister, and J. Popović (2005-07) Face transfer with multilinear models. ACM Trans. Graph. 24 (3), pp. 426–433. External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • [45] L. Wang, W. Han, F. K. Soong, and Q. Huo (2011) Text driven 3d photo-realistic talking head. In Twelfth Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [46] Q. Wu, J. Zhang, Y. Lai, J. Zheng, and J. Cai (2018) Alive caricature from 2d to 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7336–7345. Cited by: §3.2.
  • [47] X. Wu, R. He, Z. Sun, and T. Tan (2018) A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security 13 (11), pp. 2884–2896. Cited by: §3.3.
  • [48] S. Yamaguchi, S. Saito, K. Nagano, Y. Zhao, W. Chen, K. Olszewski, S. Morishima, and H. Li (2018) High-fidelity facial reflectance and geometry inference from an unconstrained image. ACM Transactions on Graphics (TOG) 37 (4), pp. 162. Cited by: §1, §2.
  • [49] H. Yi, C. Li, Q. Cao, X. Shen, S. Li, G. Wang, and Y. Tai (2019) MMFace: a multi-metric regression network for unconstrained face reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7663–7672. Cited by: §2.
  • [50] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li (2016) Face alignment across large poses: a 3d solution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 146–155. Cited by: §1, §2.