Parsing and recovering 3D face models from images have been a hot research topic in both computer vision and computer graphics due to its many applications. As learning based methods have become the mainstream in face tracking, recognition, reconstruction and synthesis, 3D face datasets becomes increasingly important. While there are numerous 2D face datasets, the few 3D datasets lack in 3D details and scale. As such, learning-based methods that rely on the 3D information suffer.
Existing 3D face datasets capture the face geometry using sparse camera array[8, 19, 34] or active depth sensor such as Kinect and coded light. These setups limit the quality of the recovered faces. We captured the 3D face model using a dense 68-camera array under controlled illumination, which recovers the 3D face model with wrinkle and pore level detailed shapes, as shown in Figure 1. In addition to shape quality, our dataset provides considerable amount of scans for study. We invited 938 people between the ages of 16 and 70 as subjects, and each subject is guided to perform 20 specified expressions, generating 18,760 high quality 3D face models. The corresponding color images and subjects’ basic information (such as age and gender) are also recorded.
Based on the high fidelity raw data, we build a powerful parametric model to represent the detailed face shape. All the raw scans are firstly transformed to a topologically uniformed base model representing the rough shape and a displacement map representing detailed shape. The transformed models are further used to build bilinear models in identity and expression dimension. Experiments show that our generated bilinear model exceeds previous methods in representative ability.
Using FaceScape dataset, we study how to predict a detailed riggable face model from a single image. Prior methods are able to estimate rough blendshapes where no wrinkle and subtle features are recovered. The main problem is how to predict the variation of small-scale geometry caused by expression changing, such as wrinkles. We propose the dynamic details which can be predicted from a single image by training a deep neural network on FaceScape dataset. Cooperated with bilinear model fitting method, a full system to predict detailed riggable model is presented. Our system consists of three stages: base model fitting, displacement map prediction and dynamic details synthesis. As shown in Figure1, our method predicts detailed 3D face model which contains subtle geometry, and achieves high accuracy due to the powerful bilinear model generated from FaceScape dataset. The predicted model can be rigged to various expressions with plausible detailed geometry.
Our contributions are summarized as following:
We present a large-scale 3D face dataset, FaceScape, consisting of 18,760 extremely detailed 3D face models. All the models are processed to topologically uniformed base models for rough shape and displacement maps for detailed shape. The data are released free for non-commercial research.
We model the variation of detailed geometry acrossing expressions as dynamic details, and propose to learn the dynamic detail from FaceScape using a deep neural network.
A full pipeline is presented to predict detailed riggable face model from a single image. Our result model can be rigged to various expressions with plausible geometric details.
2 Related Work
3D Face Dataset. 3D face datasets are of great value in face-related research areas. Existing 3D face datasets could be categorized according to the acquisition of 3D face model. Model fitting datasets[33, 60, 23, 5, 7] fit the 3D morphable model to the collected images, which makes it convenient to build a large-scale dataset on the base of wild faces. The major problem of the fitted 3D model is the uncertainty of accuracy and the lack of detailed shape. To obtain the accurate 3D face shape, a number of works reconstructed the 3D face using active method including depth sensor or scanner[53, 52, 3, 38, 37, 12, 17], while the other works built sparse multi-view camera system[54, 18]. Traditional depth sensors and 3D scanners suffer from the limited spatial resolution, so they can’t recover detailed facial geometry. The sparse multi-view camera system suffers from the unstable and inaccurate reconstruction[39, 56, 55]. The drawbacks of these methods limit the quality of 3D face model in previous datasets. Different from the datasets above, FaceScape obtained the 3D face model from a dense multi-view system with 68 DSLR cameras, which provides extremely high quality face models. The parameters measuring 3D model quality are listed in Table LABEL:tab:datasets. Our dataset outperforms previous works on both model quality and data amount. Note that Table LABEL:tab:datasets doesn’t list the datasets which provide only parametric model but no source 3D models[8, 6, 33, 30].
|Dataset||Sub. Num||Exp. Num||Vert. Num||Image/Texture Resolution||Source|
|BU-3DFE||100||25||10k-20k||/ -||structure light|
|BU-4DFE||101||6(video)||10k-20k||/ -||structure light|
|BJUT-3D||500||1-3||200k||/ -||laser scanner|
|Bosphorus||105||35||35k||/ -||structure light|
|D3DFACS||10||38AU(video)||30k||- /||multi-view system(6)|
|BP4D-Spontanous||41||27AU(video)||37k||/ -||multi-view system(3)|
|FaceScape (Ours)||938||20||2m||4k-8k / 40964096||multi-view system(68)|
3D Morphable Model.
3DMM is a statistical model which transforms the shape and texture of the faces into a vector space representation. As 3DMM inherently contains the explicit correspondences from model to model, it is widely used in model fitting, face synthesis, image manipulations, etc. The recent research on 3DMM can be generally divided into two directions. The first direction is to separate the parametric space to multiple dimensions like identity, expression and visemes, so that the model could be controlled by these attributes separately[49, 12, 29, 26]. The models in expression dimension could be further transformed to a set of blendshapes, which can be rigged to generate individual-specific animation. Another direction is to enhance the representation power of 3DMM by using deep neural network to present 3DMM bases [2, 42, 45, 47, 46, 16].
Single-view shape Prediction. Predicting 3D shape from a single image is a key problem to many applications like view synthesis[22, 57, 58] and stereoscopic video generation[13, 24]. The emergence of 3DMM has simplified the single-view reconstruction of face to a model fitting problem, which could be well solved by fitting facial landmarks and other features[36, 43] or regressing the parameter of 3DMM with a deep neural network[20, 60]. However, fitting 3DMM is difficult in recovering small details from the input image due to the limited representation power. To solve this problem, several recent works adopt the multi-layer refinement structures. Richardson et al.  and Sela et al.  both proposed to firstly predict a rough facial shape and render it to the depth map, then refine the depth map to enhance the details from the registered source image. Sengupta et al.  proposed to train the SfSNet on combination of labeled synthetic data and unlabeled in-the-wild data to estimate plausible detailed shape in unconstrained images. Tran et al.  proposed to predict a bump map to represent the wrinkle-level geometry base on a rough base model. Huynh et al. 
utilized image-to-image network and super-resolution network to recover the mesoscopic facial geometry in the form of displacement map. Chenet al.  also tried to predict the displacement map with a conditional GAN based on the 3DMM model, which enables to recover detailed shape from an in-the-wild image.
Our work advances the state of the art in multiple aspects. In dataset, our FaceScape is by far the largest with the highest quality. A detailed quantitative comparison with previous datasets are made in Table LABEL:tab:datasets. In 3D face prediction, previous works focus on enhancing the static detailed facial shape, while we study the problem of recovering an animable model from a single image. We demonstrate for the first time that a detailed and rigged 3D face model can be recovered from a single image. The rigged model exhibits expression-depended geometric details such as wrinkles.
3.1 3D Face Capture
We use a multi-view 3D reconstruction system to capture the raw mesh model for the datasets. The multi-view system consists of 68 DSLR cameras, 30 of which capture 8K images focusing on front side, and the other cameras capture 4K level images for the side part. The camera shutters are synced to be triggered within ms. We spend six months to invite 938 people to be our capturing subjects. The subjects are between 16 and 70 years old, and are mostly from Asia. We follow FaceWarehouse which asks each subject to perform 20 specific expressions including neutral expression for capturing. The total reconstructed number reach to roughly 18,760, which is the largest amount comparing to previous expression controlled 3D face datasets. The reconstructed model is triangle mesh with roughly 2 million vertices and 4 million triangle faces. The meta information for each subject is recorded, including age, gender, and job (by voluntary). We show the statistical information about the subjects in our dataset in Figure 2, and a comparison with prior 3D face datasets in Table LABEL:tab:datasets.
3.2 Topologically Uniformed Model
We down-sample the raw recovered mesh into rough mesh with less triangle faces, namely base shape, and then build 3DMM for these simplified meshes. Firstly, we roughly register all the meshes to the template face model by aligning 3D facial landmarks, then the NICP is used to deform the templates to fit the scanned meshes. The deformed meshes can be used to represent the original scanned face with minor accuracy loss, and more importantly, all of the deformed models share the uniform topology. The detailed steps to register all the raw meshes are described in the supplementary material.
After obtaining the topology-uniformed base shape, we use displacement maps in UV space to represent middle and fine scale details that are not captured by the base model due to the small number of vertices and faces. We find the surface points of base mesh corresponding to the pixels in the displacement map, then inverse-project the points to the raw mesh along normal direction to find its corresponding points. The pixel values of the displacement map is set to the signed distance from the point on base mesh to its corresponding point.
We use base shapes to represent rough geometry and displacement maps to represent detailed geometry, which is a two-layer representation for our extremely detailed face shape. The new representation takes roughly of the original mesh data size, while maintaining the mean absolute error to be less than mm.
3.3 Bilinear Model
Bilinear model is firstly proposed by Vlasic et al. , which is a special form of 3D morphable model to parameterize face models in both identity and expression dimensions. The bilinear model can be linked to a face-fitting algorithm to extract identity, and the fitted individual-specific model can be further transformed to riggable blendshapes. Here we describe how to generate bilinear model from our topologically uniformed models. Given 20 registered meshes in different expressions, we use the example based facial rigging algorithm to generate 52 blendshapes based on FACS for each person. Then we follow the previous methods[49, 12] to build the bilinear model from generated blendshapes in the space of 26317 vertices 52 expressions
938 identities. Specifically, we use Tucker decomposition to decompose the large rank-3 tensor to a small core tensorand two low dimensional components for identity and expression. New face shape can be generated given the the identity parameter and expression parameter as:
where is the vertex position of the generated mesh.
The superiority in quality and quantity of FaceScape makes the generated bilinear model own higher representation power. We evaluate the representation power of our model by fitting it to scanned 3D meshes not part of the training data. We compare our model to FaceWarehouse(FW) and FLAME by fitting them to our self-captured test set, which consists of 1000 high quality meshes from 50 subjects performing 20 different expressions each. FW has 50 identity parameters and 47 expression parameters, so we use the same number of parameters for fair comparison. To compare with FLAME which has 300 identity parameters and 100 expression parameters, we use 300 identity parameters and all 52 expression paremeters. Figure 3 shows the cumulative reconstruction error. Our bilinear face model achieves much lower fitting error than FW using the same number of parameters and also outperform FLAME using even less expression parameters. The visually comparison in Figure 3 shows ours model could produce more mid-scale details than FW and FLAME, leading to more realistic fitting results.
4 Detailed Riggable Model Prediction
As reviewed in the related works in Section 2, existing methods have succeed in recovering extremely detailed 3D facial model from a single image. However, these recovered models are not riggable in expression space, since the recovered detail is static to the specific expression. Another group of works try to fit a parametric model to the source image, which will obtain an expression-riggable model, but the recovered geometry stays in the rough stage.
The emerge of FaceScape dataset makes it possible to estimate detailed and riggable 3D face model from a single image, as we can learn the dynamic details from the large amount of detailed facial models. We show our pipeline in Figure 5 to predict a detailed and riggable 3D face model from a single image. The pipeline consists of three stages: base model fitting, displacement map prediction and dynamic details synthesis. We will explain each stage in detail in the following sections.
4.1 Base Model Fitting
The bilinear model for base shape is inherently riggable as the parametric space is separated into identity dimension and expression dimension, so the rough riggable model can be generated by regressing the parameters of identity for the bilinear model. Following , we estimate parameters corresponding to a given image by optimizing an objective function consisting of three parts. The first part is landmark alignment term. Assuming the camera is weak perspective, the landmark alignment term is defined as the distance between the detected 2D landmark and its corresponding vertex projected on the image space. The second part is pixel-level consistency term measuring how well the input image is explained by a synthesized image. The last part is regularization term which formulates identity, expression, and albedo parameters as multivariate Gaussians. The final objective function is given by:
where , and are the regularization terms of expression, identity and albedo, respectively. , , and are the weights of different terms.
After obtaining the identity parameter , individual-specific blendshapes can be generated as:
where is the expression parameter corresponding to blendshape from Tucker decomposition.
4.2 Displacement Map Prediction
Detailed geometry is expressed by displacement maps for our predicted model. In contrast to the static detail which is only related to the specific expression in a certain moment, dynamic detail expresses the geometry details in varying expressions. Since the single displacement map cannot represent the dynamic details, we try to predict multiple displacement maps for 20 basic expressions in FaceScape using a deep neural network.
We observed that the displacement map in a certain expression could be decoupled into static part and dynamic part. The static part tends to keep static in different expressions, and is mostly related to the intrinsic feature like pores, nevus, and organs. The dynamic part varies in different expressions, and is related to the surface shrinking and stretching. We use a deforming map to model the surface motion, which is defined as the difference of vertices’ 3D position from source expression to target expression in the UV space. As shown in Figure 4
, we can see the variance between displacement maps is strongly related to the deforming map, and the static features in displacement maps are related to the texture. So we feed motion maps and textures to a CNN to predict the displacement map for multiple expressions.
We use pix2pixHD as the backbone of our neural network to synthesize high resolution displacement maps. The input of the network is the stack of deforming map and texture in UV space, which can be computed from the recovered base model. Similar to , the combination of adversarial loss and feature matching loss
is used to train our net with the loss function formulated as:
where is the generator, and are discriminators that have the same LSGAN architecture but operate at different scales, is the weight of feature matching loss.
4.3 Dynamic Detail Synthesis
Inspired by, we synthesize displacement map for an arbitrary expression corresponding to specific blendshape weight , using a weighted linear combination of generated displacement maps in neutral expression and in other 19 key expressions:
where is the weight mask with the pixel value between and , is element-wise multiplication operation. To calculate the weight mask, considering the blendshape expressions change locally, we first compute an activation mask in UV space for each blendshape mesh as:
where is the pixel value at position of the th activation mask, and is the corresponding vertices position on blendshape mesh and neutral blendshape mesh , respectively. The activation masks are further normalized between 0 and 1. Given the activation mask for each of the 51 blendshape meshes, the th weight mask is formulated as a linear combination of the activation masks weighted by the current blendshape weight and fixed blendshape weight corresponding to the th key expression:
where is the th element of . is given by .
5.1 Implement Detail
We use 888 people in our dataset as training data with a total of 17760 displacement maps, leaving 50 people for testing. We use the Adam optimizer to train the network with learning rate as . The input textures and output displacement maps’ resolution of our network is both . We use 50 identity parameters, 52 expression parameters and 100 albedo parameters for our parametric model in all experiments.
5.2 Evaluation of 3D Model Prediction
The predicted riggable 3D faces are shown in Figure 6. To show riggable feature of the recovered facial model, we rig the model to 5 specific expressions. We can see the results of rigged models contain the photo-realistic detailed wrinkles, which cannot be recovered by previous methods. The point-to-plane reconstruction error is computed between our model and the ground-truth shape. The mean error is reported in Table 2. More results and the generated animations are shown in the supplementary material.
5.3 Ablation Study
W/O dynamic detail. We try to use only one displacement map from source image for rigged expressions, and the other parts remain the same. As shown in Figure 9, we find that the rigged model with dynamic detail shows the wrinkles caused by various expressions, which are not found in W/O dynamic method.
5.4 Comparisons to Prior Works
We show the predicted results of our result and other works in Figure 7. The comparison of detail prediction is shown in Figure 8. As most of the detailed face predicted by other works cannot be directly rigged to other expressions, we only show the face shape in the source expression. Our results are visually better than previous methods, and also quantitatively better in the heat map of error. We consider the major reason for our method to perform the best in accuracy is the strong representation power of our bilinear model, and the predicted details contribute to the visually plausible detailed geometry.
We present a large-scale detailed 3D facial dataset, FaceScape. Comparing to previous public large-scale 3D face datasets, FaceScape provides the highest geometry quality and the largest model amount. We explore to predict a detailed riggable 3D face model from a single image, and achieve high fidelity in dynamic detail synthesis. We believe the release of FaceScape will spur the future researches including 3D facial modeling and parsing.
This work was supported by the grants – NSFC 61627804 / U1936202, USDA 2018-67021-27416, JSNSF BK20192003, and a grant from Baidu Research.
-  (2007) Optimal step nonrigid icp algorithms for surface registration. In CVPR, pp. 1–8. Cited by: §3.2, Figure 8.
-  (2018) Modeling facial geometry using compositional vaes. In CVPR, pp. 3877–3886. Cited by: §2.
-  (2009) BJUT-3d large scale 3d face database and information processing. Journal of Computer Research and Development 6, pp. 020. Cited by: Table 1, §2.
-  (1999) A morphable model for the synthesis of 3d faces.. In Siggraph, Vol. 99, pp. 187–194. Cited by: §2.
-  (2017) 3D face morphable models ”in-the-wild”. In CVPR, pp. 5464–5473. Cited by: §2.
-  (2018) Large scale 3d morphable models. IJCV 126 (2-4), pp. 233–254. Cited by: §2.
-  (2018) 3D reconstruction of âin-the-wildâ?faces in images and videos. PAMI 40 (11), pp. 2638–2652. Cited by: §2.
-  (2016) A 3d morphable model learnt from 10,000 faces. In CVPR, pp. 5543–5552. Cited by: §1, §2.
-  (2013) Online modeling for realtime facial animation. ToG 32 (4), pp. 40. Cited by: §4.3.
-  (2014) Displaced dynamic expression regression for real-time facial tracking and animation. ToG 33 (4), pp. 43. Cited by: §4.3.
-  (2013) 3D shape regression for real-time facial animation. ToG 32 (4), pp. 41. Cited by: §4.3.
-  (2013) Facewarehouse: a 3d facial expression database for visual computing. TVCG 20 (3), pp. 413–425. Cited by: §1, Table 1, §2, §2, §3.1, §3.3, §3.3.
-  (2011) Semi-automatic 2d-to-3d conversion using disparity propagation. ToB 57 (2), pp. 491–499. Cited by: §2.
Joint face detection and facial motion retargeting for multiple faces. In CVPR, pp. 9719–9728. Cited by: §4.3.
-  (2019) Photo-realistic facial details synthesis from single immage. arXiv preprint arXiv:1903.10873. Cited by: §2, Table 2.
-  (2019) MeshGAN: non-linear 3d morphable models of faces. arXiv preprint arXiv:1903.10384. Cited by: §2.
-  (2018) 4dfab: a large scale 4d database for facial expression analysis and biometric applications. In CVPR, pp. 5117–5126. Cited by: Table 1, §2.
-  (2011) A facs valid 3d dynamic action unit database with applications to 3d dynamic morphable facial modeling. In ICCV, pp. 2296–2303. Cited by: Table 1, §2.
-  (2017) A 3d morphable model of craniofacial shape and texture variation. In ICCV, pp. 3085–3093. Cited by: §1.
-  (2017) End-to-end 3d face reconstruction with deep neural networks. In CVPR, pp. 5908–5917. Cited by: §2.
-  (1978) Facial action coding system: a technique for the measurement of facial movement. Cited by: §3.3.
-  (2016) Deepstereo: learning to predict new views from the world’s imagery. In CVPR, pp. 5515–5524. Cited by: §2.
-  (2018) Cnn-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. PAMI 41 (6), pp. 1294–1307. Cited by: §2.
-  (2014) Toward naturalistic 2d-to-3d conversion. TIP 24 (2), pp. 724–733. Cited by: §2.
-  (2018) Mesoscopic facial geometry inference using deep neural networks. In CVPR, pp. 8407–8416. Cited by: §2.
-  (2019) Disentangled representation learning for 3d face shape. In CVPR, pp. 11957–11966. Cited by: §2.
-  (2010) Example-based facial rigging. In ToG, Vol. 29, pp. 32. Cited by: §2, §3.3.
-  (2013) Realtime facial animation with on-the-fly correctives.. ToG 32 (4), pp. 42–1. Cited by: §4.3.
-  (2017) Learning a model of facial shape and expression from 4d scans. ToG 36 (6), pp. 194. Cited by: §2, §3.3.
-  (2017) Gaussian process morphable models. PAMI 40 (8), pp. 1860–1873. Cited by: §2.
-  (2017) Least squares generative adversarial networks. In ICCV, pp. 2794–2802. Cited by: §4.2.
-  (2018) PaGAN: real-time avatars using dynamic textures.. ToG 37 (6), pp. 258–1. Cited by: §4.3.
-  (2009) A 3d face model for pose and illumination invariant face recognition. In IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 296–301. Cited by: §1, §2.
-  (2005) Overview of the face recognition grand challenge. In CVPR, pp. 947–954. Cited by: §1.
-  (2017) Learning detailed face reconstruction from a single image. In CVPR, pp. 5553–5562. Cited by: §2.
-  (2005) Estimating 3d shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In CVPR, Vol. 2, pp. 986–993. Cited by: §2.
-  (2015) Multimodal biometric database dmcsv1 of 3d face and hand scans. In MIXDES, pp. 93–97. Cited by: §2.
-  (2008) Bosphorus database for 3d face analysis. In European Workshop on Biometrics and Identity Management, pp. 47–56. Cited by: Table 1, §2.
-  (2006) A comparison and evaluation of multi-view stereo reconstruction algorithms. In CVPR, Vol. 1, pp. 519–528. Cited by: §2.
Unrestricted facial geometry reconstruction using image-to-image translation. In ICCV, pp. 1576–1585. Cited by: §2.
-  (2018) SfSNet: learning shape, reflectance and illuminance of faces in the wild’. In CVPR, pp. 6296–6305. Cited by: §2.
-  (2018) Self-supervised multi-level face model learning for monocular reconstruction at over 250 hz. In CVPR, pp. 2549–2559. Cited by: §2.
-  (2016) Face2face: real-time face capture and reenactment of rgb videos. In CVPR, pp. 2387–2395. Cited by: §2, §4.1.
-  (2018) Extreme 3d face reconstruction: seeing through occlusions.. In CVPR, pp. 3935–3944. Cited by: §2, Table 2.
-  (2019) Towards high-fidelity nonlinear 3d face morphable model. In CVPR, pp. 1126–1135. Cited by: §2.
-  (2018) Nonlinear 3d face morphable model. In CVPR, pp. 7346–7355. Cited by: §2.
-  (2019) On learning 3d face morphable model from in-the-wild images. PAMI. Cited by: §2.
-  (2012) Lightweight binocular facial performance capture under uncontrolled lighting.. ToG 31 (6), pp. 187–1. Cited by: Figure 7.
-  (2005) Face transfer with multilinear models. In ToG, Vol. 24, pp. 426–433. Cited by: §2, §3.3.
-  (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, pp. 8798–8807. Cited by: §4.2.
-  (2011) Realtime performance-based facial animation. In ToG, Vol. 30, pp. 77. Cited by: §4.3.
-  (2008) A high-resolution 3d dynamic facial expression database, 2008. In FG, Vol. 126. Cited by: Table 1, §2.
-  (2006) A 3d facial expression database for facial behavior research. In FG, pp. 211–216. Cited by: Table 1, §2.
-  (2014) Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing 32 (10), pp. 692–706. Cited by: Table 1, §2.
-  (2016) Video-based outdoor human reconstruction. TCSVT 27 (4), pp. 760–770. Cited by: §2.
-  (2017) The role of prior in image based 3d modeling: a survey. Frontiers of Computer Science 11 (2), pp. 175–191. Cited by: §2.
-  (2018) View extrapolation of human body from a single image. In CVPR, pp. 4450–4459. Cited by: §2.
-  (2019) Detailed human shape estimation from a single image by hierarchical mesh deformation. In CVPR, pp. 4491–4500. Cited by: §2.
-  (2017) Face alignment in full pose range: a 3d total solution. PAMI. Cited by: Table 2.
-  (2016) Face alignment across large poses: a 3d solution. In CVPR, pp. 146–155. Cited by: §2, §2.