This repository contains the dataset including the pair of 2D face image and its corresponding 3D face geometry model.
With the powerfulness of convolution neural networks (CNN), CNN based face reconstruction has recently shown promising performance in reconstructing detailed face shape from 2D face images. The success of CNN-based methods relies on a large number of labeled data. The state-of-the-art synthesizes such data using a coarse morphable face model, which however has difficulty to generate detailed photo-realistic images of faces (with wrinkles). This paper presents a novel face data generation method. Specifically, we render a large number of photo-realistic face images with different attributes based on inverse rendering. Furthermore, we construct a fine-detailed face image dataset by transferring different scales of details from one image to another. We also construct a large number of video-type adjacent frame pairs by simulating the distribution of real video data. With these nicely constructed datasets, we propose a coarse-to-fine learning framework consisting of three convolutional networks. The networks are trained for real-time detailed 3D face reconstruction from monocular video as well as from a single image. Extensive experimental results demonstrate that our framework can produce high-quality reconstruction but with much less computation time compared to the state-of-the-art. Moreover, our method is robust to pose, expression and lighting due to the diversity of data.READ FULL TEXT VIEW PDF
Obtaining a high-quality frontal face image from a low-resolution (LR)
Reconstructing the detailed geometric structure of a face from a given i...
In this paper, we use deep neural networks for inverting face sketches t...
Embedding 3D morphable basis functions into deep neural networks opens g...
In this paper, we explore the task of generating photo-realistic face im...
Recent works have shown how realistic talking face images can be obtaine...
Photo-realistic visualization and animation of expressive human faces ha...
This repository contains the dataset including the pair of 2D face image and its corresponding 3D face geometry model.
This paper considers the problem of dense 3D face reconstruction from monocular video as well as from a single face image. Single-image based 3D face reconstruction can be considered as a special case of video based reconstruction. It also plays an essential role. Actually image-based 3D face reconstruction itself is a fundamental problem in computer vision and graphics, and has many applications such as face recognition[5, 54] and face animation [23, 53]. Video-based dense face reconstruction and tracking or facial performance capturing has a long history  also with many applications such as facial expression transfer [52, 53] and face replacement [30, 12, 16]. Traditional facial performance capture methods usually require complex hardware and significant user intervention [57, 21] to achieve a sufficient reality and therefore are not suitable for consumer-level applications. Commodity RGB-D camera based methods [56, 33, 6, 52] have demonstrated real-time reconstruction and animation results. However, RGB-D devices, such as Microsoft’s Kinect, are still not that common and not of high resolution, compared to RGB devices.
Recently, several approaches have been proposed for RGB video based facial performance captureing [8, 7, 53, 18, 45, 22]. Compared to image-based 3D face reconstruction that is considered as an ill-pose and challenging task due to the ambiguities caused by insufficient information conveyed in 2D images, video-based 3D reconstruction and tracking is even more challenging especially when the reconstruction is required to be real-time, fine-detailed and robust to pose, facial expression, lighting, etc. These proposed approaches only partially comply with the requirements. For example,  and  learn facial geometry while not recovering facial appearance property, such as albedo.  can reconstruct personalized face rig of high-quality, but their optimization-based method is time-consuming and needs about 3 minutes per frame.  achieves real-time face reconstruction and facial reenactment through data-parallel optimization strategy, but their method cannot recover fine-scale details such as wrinkles and also requires facial landmark inputs.
In this paper, we present a solution to tackle all these problems by utilizing the powerfulness of convolutional neural networks (CNN). CNN based approaches have been proposed for face reconstruction from a single image [41, 42, 54, 51, 24], but CNN is rarely explored for video-based dense face reconstruction and tracking, especially for real-time reconstruction. Inspired by the state-of-the-art single-image based face reconstruction method , which employs two cascaded CNNs (coarse-layer CNN and fine-layer CNN) to reconstruct a detailed 3D facial surface from a single image, we develop a dense face reconstruction and tracking framework. The framework includes a new network architecture called 3DFaceNet for online real-time dense face reconstruction from monocular video (supporting a single-image input as well), and optimization-based inverse rendering for offline generating large-scale training datasets.
In particular, our proposed 3DFaceNet consists of three convolutional networks: a coarse-scale single-image network (named Single-image CoarseNet for the first frame or the single image case), a coarse-scale tracking network (Tracking CoarseNet) and a fine-scale network (FineNet). For single-image based reconstruction, compared with , the key uniqueness of our framework lies in the photo-realistic datasets we generate for training CoarseNet and FineNet.
It is known that one major challenge for CNN-based methods lies in the difficulty to obtain a large number of labelled training data. For our case, there is no publicly available dataset that can provide large-scale face images with their corresponding high-quality 3D face models. For training CoarseNet,  and  resolve the training data problem by directly synthesizing face images with randomized parametric face model parameters. Nevertheless, due to the low dimensionality of the parametric face model, albedo and random background synthesized, the rendered images in [41, 42] are not photo-realistic. In contrast, we propose to create realistic face images by starting from real photographs and manipulating them after an inverse rendering procedure. For training FineNet, because of no dataset with detailed face geometry, 
uses an unsupervised training by adopting the shading energy as the loss function. However, to make back-propagation trackable, employs the first-order spherical harmonics to model the lighting, which makes the final detailed reconstruction not so accurate. On the contrary, we propose a novel approach to transfer different scales of details from one image to another. With the constructed fine-detailed face image dataset, we can train FineNet in a fully supervised manner, instead of the unsupervised way in , and thus can produce more accurate reconstruction results. Moreover, for training our coarse-scale tracking network for the video input case, we consider the coherence between adjacent frames and simulate adjacent frames according to the statistics learned from real facial videos for training data generation.
Contributions. In summary, the main contributions of this paper lie in the following five aspects:
the optimization-based face inverse rendering that recovers accurate geometry, albedo, lighting from a single image, with which we can generate a large number of photo-realistic face images with different attributes to train our networks.
a large photo-realistic face image dataset with the labels of the parametric face model parameters and the pose parameters, which are generated based on our proposed inverse rendering. This dataset facilitates the training of our Single-image CoarseNet and makes our method robust to expressions and poses.
a large photo-realistic fine-scale face image dataset with detailed geometry labels, which are generated by our proposed face detail transfer approach. This fine-scale dataset facilitates the training of our FineNet.
a large dataset for training Tracking CoarseNet, where we extend the Single-image CoarseNet training data by simulating their previous frames according to the statistics learned from real facial videos.
the proposed 3DFaceNet that is trained with our built large-scale diverse synthetic data and is thus able to reconstruct the fine-scale geometry, albedo and lighting well in real time from monocular RGB video as well a single image. Our system is robust to large poses, extreme expressions and fast moving faces.
To the best of our knowledge, the proposed framework is the first work that achieves real-time dense 3D face reconstruction and tracking from monocular video. It might open up a new venue of research in the field of 3D assisted face video analysis. Moreover, the optimization-based face inverse rendering approach provides a novel, efficient way to generate various large-scale synthetic dataset by appropriate adaptation. Our elaborately-generated datasets will also benefit the face analysis related research that usually requires large amounts of training data.
3D face reconstruction and facial performance capturing have been studied extensively in computer vision and computer graphics communities. For conciseness, we only review the most relevant works here.
Low-dimensional Face Models. Model-based approaches for face shape reconstruction have grown in popularity over the last decade. Blanz and Vetter 
proposed to represent a textured 3D face with principal components analysis (PCA), which provides an effective low-dimensional representation in terms of latent variables and corresponding basis vectors. The model has been widely used in various computer vision tasks, such as face recognition [5, 54], face alignment [60, 34, 27], and face reenactment . Although such a model is able to capture the global structure of a 3D face from a single image  or multiple images , the facial details like wrinkles and folds are not possible to be captured. In addition, the reconstructed face models rely heavily on training samples. For example, a face shape is difficult to be reconstructed if it is far away from the span of the training samples. Thus, similar to , we only use the low-dimensional model in our coarse layer to reconstruct a rough geometry and we refine the geometry in our fine layer.
Shape-from-shading (SFS). SFS  makes use of the rendering principle to recover the underlying shape from shading observations. The performance of SFS largely depends on constraints or priors. For 3D face reconstruction, in order to achieve plausible results, the prior knowledge about the geometry must be applied. For instance, in order to reduce the ambiguity and the complexity of SFS, the symmetry of the human face has often been employed [49, 58, 59]. Kemelmacher et al.  used a reference model prior to align with the face image and then applied SFS to refine the reference model to better match the image. Despite the improved performance of this technique, its capability to capture global face structure is limited.
The generation of a face image depends on several factors: face geometry, albedo, lighting, pose and camera parameters. Face inverse rendering refers to the process of estimating all these factors from a real face image, which can then be manipulated to render new images. Inverse rendering is similar to SFS with the difference that inverse rendering aims to estimate all the rendering parameters while SFS mainly cares about reconstructing the geometry. Aldrian et al. did face inverse rendering with a parametric face model using a multilinear approach, where the face geometry and the albedo are encoded on parametric face model. In , the geometry is first estimated based on the detected landmarks, and then the albedo and the lighting are iteratively estimated by solving the rendering equation. However, since the landmark constraint is a sparse constraint, the reconstructed geometry may not fit the face image well.  fits a 3D face in a multi-layer approach and extracts a high-fidelity parameterized 3D rig that contains a generative wrinkle formation model capturing the person-specific idiosyncrasies.  presents an algorithm for fully automatically fitting a 3D Morphable Model to a single image using landmarks and edge features. 
introduces a framework to fit a parametric face model with Bayesian inference. and  estimate an occlusion map and fit a statistical model to a face image with an EM-like probabilistic estimation process.  adopts the similar approach to recover the 3D face model with geometry details. While these methods provide impressive results, they are usually time-consuming due to complex optimization.
Face Capture from RGB Videos. Recently, a variety of methods have been proposed to do 3D face reconstruction with monocular RGB video. Most of them use a 3D Morphable Model [53, 18, 22] or a multi-linear face model [9, 8, 48, 7, 45] as a prior.  reconstructs the dense 3D face from a monocular video sequence by a variational approach, which is formulated as estimating dense low-rank smooth 3D shapes for each frame of the video sequence.  adapts a generic template to a static 3D scan of an actor’s face, then fits the blendshape model to monocular video off-line, and finally extracts surface detail by shading-based shape refinement under general lighting.  uses a similar tracking approach and achieves impressive results based on global energy optimization of a set of selected keyframes.  fits a 3D face in a multi-layer approach and extracts a high-fidelity parameterized 3D rig that contains a generative wrinkle formation model capturing the person-specific idiosyncrasies. Although all these methods provide impressive results, they are time-consuming and are not suitable for real-time face video reconstruction and editing. [9, 8] adopt a learning-based regression model to fit a generic identity and expression model to a RGB face video in real-time and  extends this approach by also regressing fine-scale face wrinkles.  presents a method for unconstrained real-time 3D facial performance capture through explicit semantic segmentation in the RGB input.  tracks face by fitting 3D Morphable Model to the detected landmarks. Although they are able to reconstruct and track 3D face in real-time, they do not estimate facial appearance. Recently,  presented an approach for real-time face tracking and facial reenactment, but the method is not able to recover fine-scale details and requires external landmark inputs. In contrast, our method is the first work that can do real-time reconstruction of face geometry at fine details as well as real-time recovery of albedo, lighting and pose parameters.
Learning-based Single-image 3D Face Reconstruction. With the powerfulness of convolution neural networks, deep learning based methods have been proposed to do 3D face reconstruction from one single image. [27, 60, 31] use 3D Morphable Model (3DMM)  to represent 3D faces and use CNN to learn the 3DMM and pose parameters.  follows the method and uses synthetic face images generated by rendering textured 3D faces encoded on 3DMM with random lighting and pose for training data. However, the reconstruction results of these methods do not contain geometry details. Besides learning the 3DMM and pose parameters,  extends these methods by also learning detailed geometry in an unsupervised manner.  proposes to regress robust and discriminative 3DMM with a very deep neural network and uses it for face recognition.  proposes to use an analysis-by-synthesis energy function as the loss function during network training [4, 53].  proposes to directly regress volumes with CNN for a single face image. Although these methods utilize the powerfulness of CNNs, they all concentrate on images and do not account for videos. In comparison, we focus on monocular face video input and reconstruct face video in real-time by using CNNs.
This section describes some background information, particularly on the face representations and the face rendering process considered in our work.
The rendering process of a face image depends on several factors: face geometry, albedo, lighting, pose and camera parameters. We encode 3D face geometry into two layers: a coarse-scale shape and fine-scale details. While the coarse-scale shape and albedo are represented by a parametric textured 3D face model, the fine-scale details are represented by a pixel depth displacement map. The face shape is represented via a mesh of vertices with fixed connectivity as a vector , where denotes the position of vertex ().
Parametric face model. We use 3D Morphable Model (3DMM)  as the parametric face model to encode 3D face geometry and albedo on a lower-dimensional subspace, and extend the shape model to also cover facial expressions by adding delta blendshapes. Specifically, the parametric face model describes 3D face geometry and albedo with PCA (principle component analysis):
where and denote respectively the shape and the albedo of
the average 3D face, and are the principle axes extracted from a set of textured 3D meshes with a neutral expression, represents the principle axes trained on the offsets between the expression meshes and the neutral meshes of individual persons, and , and are the corresponding coefficient vectors that characterize a specific 3D face model. For diversity and mutual complement, we use the Basel Face Model (BFM)  for and and FaceWarehouse  for .
As 3DMM is a low-dimensional model, some face details such as wrinkles and dimples cannot be expressed by 3DMM. Thus, we encode the geometry details in a displacement along the depth direction for each pixel.
Rendering process. For camera parametrization, following , we use the weak perspective model to project the 3D face onto the image plane:
where and are the locations of vertex in the world coordinate system and in the image plane, respectively, is the scale factor, is the rotation matrix constructed from Euler angles and is the translation vector.
To model the scene lighting, we assume the face to be a Lambertian surface. The global illumination is approximated using the spherical harmonics (SH) basis functions . Then, the irradiance of a vertex with surface normal and scalar albedo is expressed as :
where is the SH basis functions computed with normal , and is the SH coefficients. We use the first bands of SHs for the illumination model. Thus, the rendering process depends on the parameter set , where denotes RGB channels’ SH illumination coefficients.
Given the parametric face model and the parameter set , a face image can be rendered as follows. First, a textured 3D mesh is constructed using Eq. (1) and Eq. (2). Then we do a rasterization via Eq. (3
). Particularly, in the rasterization, for every pixel in the face region of the 2D image, we obtain the underlying triangle index on the 3D mesh and its barycentric coordinates. In this way, for every pixel in the face region, we obtain its normal by using the underlying triangle’s normal, and its albedo value by barycentrically interpolating the albedos of the vertices of the underlying triangle. Finally, with the normal, the albedo and the lighting, the color of a pixel can be rendered using Eq. (4).
To achieve real-time face video reconstruction and tracking, we need real-time face inverse rendering. However, reconstructing detailed 3D face using traditional optimization-based methods  is far from real-time. To address this problem, we develop a novel CNN based framework to achieve real-time detailed face inverse rendering. Specifically, we use two CNNs for each frame, namely CoarseNet and FineNet. The first one estimates coarse-scale geometry, albedo, lighting and pose parameters altogether, and the second one reconstructs the fine-scale geometry encoded on pixel level.
Fig. 1 shows the entire system pipeline. It can be seen that there are two types of CoarseNet: Single-image CoarseNet and Tracking CoarseNet. Tracking CoarseNet makes use of the predicted parameters of the previous frame, while Single-image CoarseNet is for the first frame case where there is no previous frame available. Such Single-image CoarseNet could be applied to other key frames as well to avoid any potential drifting problem if needed. The combination of all the networks including Single-image CoarseNet, Tracking CoarseNet and FineNet, makes up a complete framework for real-time dense 3D face reconstruction from monocular video. Note that the entire framework can be easily degenerated to the solution for dense 3D face reconstruction from a single image by combining only Single-image CoarseNet with FineNet.
We would like to point out that although we advocate the CNN based solution, it still needs to work together with optimization based inverse rendering methods. This is because CNN requires large amount of data with labels, which is usually not available, and optimization based inverse rendering methods are a natural solution for generating labels (optimal parameters) and synthesizing new images offline. Thus, our proposed dense face reconstruction and tracking framework includes both optimization based inverse rendering and the two-stage CNN based solution, where the former is for offline training data generation and the latter is for real-time online operations. In the subsequent sections, we first introduce our optimization based inverse face rendering, which will be used to construct training data for CoarseNet and FineNet; and then we present our three convolutional networks.
Inverse rendering is an inverse process of image generation. That is, given a face image, we want to estimate a 3D face with albedo, lighting condition, pose and projection parameters simultaneously. Since directly estimating these unknowns with only one input image is an ill-posed problem, we use the parametric face model as a prior. Fig. 2 illustrates our developed inverse rendering, which consists of three stages: parametric face model fitting, geometry refinement and albedo blending. The first stage is to recover the lighting, a coarse geometry and the albedo based on the parametric face model. The second stage is to further recover the geometry details. The third stage is to blend the albedo so as to make the rendered image closer to the input image. Via the developed inverse rendering, we are able to extract different rendering components of real face images, and then by varying these different components we can create large-scale photo-realistic face images to facilitate the subsequent CNN based training.
The purpose of model fitting is to estimate the coarse face geometry, albedo, lighting, pose and projection parameters from a face image . That is to estimate . For convenience, we group these parameters into the following sets , , and . The fitting process is based on the analysis-by-synthesis strategy [4, 53], and we seek a solution that by minimizes the difference between the input face image and the rendered image with . Specifically, we minimize the following objective function:
where is a photo-consistency term, is a landmark term and is a regularization term, and and are tradeoff parameters. The photo-consistency term, aiming to minimize the difference between the input face image and the rendered image, is defined as
where is the rendered image, is the input image, and is the set of all pixels in the face region. The landmark term aims to make the projected vertices close to the corresponding landmarks in the image plane:
where is the set of landmarks, is a landmark position in the image plane, is the corresponding vertex location in the fitted 3D face and . The regularization term aims to ensure that the fitted parametric face model parameters are plausible:
is the standard deviation of the corresponding principal direction. Here we use 100 principle components for identityalbedo, and 79 for expression. In our experiments, we set to be 10 and to be . Eq. (5) is minimized via Gauss-Newton iteration.
As the parametric face model is a low-dimensional model, some face details such as wrinkles and dimples are not encoded in parametric face model. Thus, the purpose of the second stage is to refine the geometry by adding the geometry details in a displacement along the depth direction for every pixel. In particular, by projecting the fitted 3D face with parameter , we can obtain a depth value for every pixel in the face region. Let be all stacked depth values of pixels, be all stacked displacements and be all new depth values. Given new depth values , the normal at pixel can be computed using the normal of triangle , where is the coordinates of pixel at the camera system. Inspired by , we estimate using the following objective function:
where is the same as that in Eq. (5), is to encourage small displacements, the Laplacian of displacements is to make the displacement smooth, and and are tradeoff parameters. We use norm for the smooth term as it allows preserving sharp discontinuities while removing noise. We set to be and to be 0.3 in our experiments. Eq. (9) is minimized by using an iterative reweighing approach .
Similar to the geometry, the albedo encoded in the parametric face model (denoted as ) in stage 1 is also smooth because of the low dimension. For photo-realistic rendering, we extract a fine-scale albedo as
where represents the elementwise division operation, is the color of the input image and is the normal computed from the refined geometry. However, the fine-scale albedo might contain some geometry details due to imperfect geometry refinement. To avoid this, we linearly blend and , i.e. , with different weights at different regions. Particularly, in the regions where geometry details are likely to appear such as forehead and eye corners, we make the blended albedo close to by setting to be 0.65, while in the other regions we encourage the blended albedo close to by setting to be 0.35. Around the border of the regions is set continuously from 0.35 to 0.65. Finally, we use this blended albedo as in Eq. (4) for our subsequent data generation process.
In this section, we describe how to train a coarse-layer CNN (called Single-image CoarseNet) that can output the parametric face model parameters (corresponding to a coarse shape) and the pose parameters from the input of a single face image or an independent video frame. Although the network structure of Single-image CoarseNet is similar to that of [60, 42], we use our uniquely constructed training data and loss function, which are elaborated below.
To train Single-image CoarseNet, we need a large-scale dataset of face images with ground-truth 3DMM parameters and pose parameters. Recently,  proposed to synthesize a large number of face images by varying the 3DMM parameters fitted from a small number of real face images.  focuses on the face alignment problem. The color of the synthesized face images are directly copied from the source images without considering the underlying rendering process, which makes the synthesized images not photo-realistic and thus unsuitable for high-quality 3D face reconstruction. Later,  follows the idea of using synthetic data for learning detailed 3D face reconstruction and directly renders a large number of face images by varying the existing 3DMM parameters with random texture, lighting, and reflectance. However, since 3DMM is a low-dimensional model and the albedo is also of low frequency, the synthetic images in  are not photo-realistic as well, not to mention the random background used in the rendered images. In addition, the synthetic images in  are not available to the public.
Therefore, in this paper, we propose to use our developed inverse rendering described in Sec. V to synthesize photo-realistic images at large scale, which well addresses the shortcoming of the synthetic face images generated in [60, 42]. In particular, we choose 4000 face images (dataset A), in which faces are not occluded, from 300W  and Multi-pie . For each of the 4000 images, we use our optimization based inverse rendering method to obtain the parameter set . Then, to make our coarse-layer network robust to expression and pose, we render new face images by randomly changing the pose parameters and the expression parameter , each of which leads to a new parameter set . By doing the rasterization with , we can obtain the normals of all pixels in the new face region as described in Sec. III. With these normals and the albedos obtained according to Sec. V-C, a new face is then rendered using Eq. (4). We also warp the background region of the source image to fit the new face region by using the image meshing . Fig. 3 shows an example of generating three synthetic images from an input real images by simultaneously changing the expression and pose parameters. In this way, we generate a synthetic dataset of totally 80,000 face images for the Single-image CoarseNet training by randomly varying the expression and the pose parameters 20 times for each of the 4000 real face images.
The input to our Single-image CoarseNet is a face image, and the output is the parameters related to the shape of 3D face and the projection, i.e. . The network is based on the Resnet-18  with the modification of changing the output number of the fully-connected layer to 185 (100 for identity, 79 for expression, 3 for rotation, 2 for translation and 1 for scale). The input image size is .
As pointed out in , different parameters in have different influence to the estimated geometry. Direct MSE (mean square error) loss on might not lead to good geometry reconstruction.  uses a weighted MSE loss, where the weights are based on the projected vertex distances.  uses 3D vertex distances to measure the loss from the geometry parameters and MSE for the pose parameters. Considering these vertex based distance measures are calculated on the vertex grid, which might not well measure how the parameters fit the input face image, in this work we use a loss function that computes the distance between the ground-truth parameters and the network output parameters at the per-pixel level.
In particular, we first do the rasterization with the ground-truth parameters to get the underlying triangle index and the barycentric coordinates for each pixel in the face region. With this information, we then construct the pixels’ 3D average , base and base by barycentrically interpolating the corresponding rows in , , , respectively. In this way, given parameters , we can project all the corresponding 3D locations of the pixels onto the image plane using
Then the loss between the ground-truth parameters and the network output parameters is defined as:
Note that there is no need to compute since it corresponds to the original pixel locations in the image plane.
For better convergence, we further separate the loss in Eq. (12) into the pose-dependent loss as
where represents the pose parameters, and the geometry-dependent loss as
where represents the geometry parameters. In Eq. (13) and Eq. (14), (resp., ) refers to the projection with the ground-truth geometry (resp., pose) parameters and the network estimated pose (resp., geometry) parameters.
The final loss is a weighted sum of the two losses:
where is the tradeoff parameter. We set for balancing the two losses and we assume is a constant when computing the derivatives for back propagation.
The purpose of Tracking CoarseNet is to predict the current frame’s parameters, given not only the current video frame but also the previous frame’s parameters. As there does not exist large-scale dataset that captures the correlations among adjacent video frames, our Tracking CoarseNet also faces the problem of no sufficient well-labelled training data. Similarly, we synthesize training data for Tracking CoarseNet. However, it is non-trivial to reuse the ()-th frame’s parameters to predict -th frame’s parameters. Directly using all the previous frame’s parameters as the input to Tracking CoarseNet will introduce too many uncertainties during training, which results in huge complexity in synthesizing adjacent video frames for training, and make the training hard to converge and the testing unstable. Through vast experiments, we find that only utilizing the previous frame’s pose parameters is a good way to inherit the coherence while keeping the network trainable and stable.
Specifically, the input to the tracking network is the -th face frame cropped by the frame’s landmarks and a Projected Normalized Coordinate Code (PNCC)  rendered using the frame’s pose parameters , and the mean 3D face in Eq. (1). The output of the tracking network is parameters , where denotes the difference between the current frame and the previous frame. Note that here the output also includes albedo and lighting parameters, which could be used for different video editing applications.
The network structure is the same as Single-image CoarseNet except that the output number of the fully-connected layer is 312 (100 for identity, 79 for expression, 3 for rotation, 2 for translation, 1 for scale, 100 for albedo and 27 for lighting coefficients). In addition to the loss terms and defined in Eq. (13) and Eq. (14) respectively, Tracking CoarseNet also uses another term for and that measures the distance between the rendered image and the input frame:
where is the rendered face image with the groundtruth geometry and pose, and the estimated albedo and lighting, and is the input face frame. In this way, the final total loss becomes a weighted sum of the three losses:
where and are the tradeoff parameters to balance the three losses, and we assume and are constant when computing the derivatives for back propagation.
Training data generation for Tracking CoarseNet. To train Tracking CoarseNet, large-scale adjacent video frame pairs with ground-truth parameters are needed as training data. Again, there is no such public dataset. To address this problem, we propose to simulate adjacent video frames, i.e., to generate the previous frame for each of the 80,000 synthesized images used in the Single-image CoarseNet training. Randomly varying the parameter set for a training image does not capture the tight correlations among adjacent frames. Thus, we propose to do simulation by analysing the distribution of the previous frame’s parameters given the current -th frame from real videos. Considering our tracking network only makes use of the previous frame’s pose parameters, we just need to obtain the distribution of and given and . Particularly, we assume each parameter in and
follows normal distribution. We extract about 160,000 adjacent frame pairs from the 300-VW video dataset and use our Single-image CoarseNet to get the parameters for fitting the normal distribution. Finally, for each of the 80,000 synthesized images, we can simulate its previous frame by generating and according to the obtained normal distribution. Examples of several simulated pairs with the previous frame’s PNCC and the current image are shown in Fig. 4.
In this section, we present our solution on how to train a fine-layer CNN (called FineNet). The input to FineNet is a coarse depth map stacked with the face image. The coarse depth map is generated by using the method described in Sec. V-B with the parameters estimated by either Single-image CoarseNet or Tracking CoarseNet. The output of our FineNet is a per-pixel displacement map. Again, the key challenge here is that there is no fine-scale face dataset available that can provide a large number of detailed face geometries with their corresponding 2D images, as pointed out in . In addition, the existing morphable face models such as 3DMM cannot capture the fine-scale face details.  bypasses this challenge by converting the problem into an unsupervised setting, i.e. relating the output depth map to the 2D image by using the shading energy as the loss function. However, to make the back-propagation trackable under the shading energy, they have to use first-order spherical harmonics to model the lighting, which is not accurate.
In our work, instead of doing unsupervised training , we go for fully supervised training of FineNet, i.e. directly constructing a large-scale detailed face dataset based on our developed inverse rendering and a novel face detail transfer approach, which will be elaborated below. Note that our FineNet architecture is based on the U-Net  and we use Euclidean distance as the loss function.
Our synthesized training data for FineNet is generated by transferring the displacement map from a source face image with fine-scale details such as wrinkles and folds to other target face images without the details. Fig. 5 gives such an example. In particular, we first apply our developed inverse rendering in Sec. V on both images. Then we find correspondences between the source image pixels and the target image pixels using the rasterization information described in Sec. III. That is, for a pixel in the target face region, if its underlying triangle is visible in the source image, we find its corresponding 3D location on the target 3D mesh by barycentric interpolation, and then we project the 3D location onto the source image plane using Eq. (3) to get the corresponding pixel . With these correspondences, the original source displacement and the original target displacement , a new displacement for the target image is generated by matching its gradients with the scaled source displacement gradient in the intersected region by solving the following poisson problem:
where and is a scale factor within the range so as to create different displacement fields. After that, we add into the coarse target depth to get the final depth map. Then the normals of the target face pixels are updated as in Sec. V-B. With the updated normals, a new face image is rendered using Eq. (4).
We would like to point out that besides generating a large number of detailed face images to train the network, there are also other benefits to do such detail transfer. First, by rendering the same type of detail information under different lighting conditions, we can train our FineNet to be robust to lighting. Second, by changing the scale of the displacement randomly, our method can be trained to be robust to different scales of details.
For the details of the dataset construction, we first download 1000 real face images (dataset B) that contain rich geometry details from internet. Then, we transfer the details from dataset B to the 4000 real face images in dataset A, the one used in constructing synthetic data for Single-image CoarseNet. For every image in A, we randomly choose 30 images in B for transferring. In this way, we construct a synthesized fine-detailed face image dataset of totally 120,000 images.
In this section, we conduct qualitative and quantitative evaluation on the proposed detailed 3D face reconstruction and tracking framework and compare it with the state-of-the-art methods.
Experimental setup and runtime.
We train the CNNs via the CAFFE framework. Single-image CoarseNet takes the input of a color face image with size , and Tracking CoarseNet and FineNet respectively take the inputs of (a color image and a PNCC) and (a gray image and its coarse depth). We train all the networks using Adam solver with the mini-batch size of 100 and 30k iterations. The base learning rate is set to be .
The CNN based 3D face reconstruction and tracking are implemented in C++ and tested on various face images and videos. All experiments were conducted on a desktop PC with a quad-core Intel CPU i7, 4GB RAM and NVIDIA GTX 1070 GPU. As for the running time for each frame, it takes 5 ms for CoarseNet and 15 ms for FineNet.
CoarseNet vs. FineNet. Our approach is to progressively and continuously estimate the detailed facial geometry, albedo and lighting parameters from a monocular face video. Fig. 6 shows the tracking output results of the two stages. The results of CoarseNet include the smooth geometry and the corresponding rendered face image shown in the middle column. The FineNet further predicts the pixel level displacement given in the last column. We can see that CoarseNet produces smooth geometry and well matched rendered face images, which show the good recovery of pose, albedo, lighting and projection parameters, and FineNet nicely recovers the geometry details such as wrinkles. A complete reconstruction results of all the video frames are given in the accompanying video or via the link: https://youtu.be/dghlMXxD-rk.
Single-image CoarseNet vs. Tracking CoarseNet. Given a RGB video, a straightforward way for dense face tracking is to treat all frames as independent face images, and apply our Single-image CoarseNet on each frame, followed by applying FineNet. Thus, we give a comparison of our proposed Tracking CoarseNet, which estimates the differences of the pose parameters w.r.t. the previous frame, with the baseline that simply uses our Single-image CoarseNet on each frame. As demonstrated in Fig. 7, Tracking CoarseNet achieves more robust tracking than the baseline, since it well utilizes the guidance from the previous frame’s pose.
Comparisons with dense face tracking methods. We compare our method with the state-of-the-art monocular video based dense face tracking methods [48, 18, 22].  performs 3D face reconstruction in an iterative manner. In each iteration, they first reconstruct coarse-scale facial geometry from sparse facial features and then refine the geometry via shape from shading.  employs a multi-layer approach to reconstruct fine-scale details. They encode different scales of 3D face geometry on three different layers and do optimization for each layer.  reconstructs the 3D face shape by only fitting the 2D landmarks via 3DMM, and we can observe that  can only produce smooth face reconstruction. As shown in Fig. 8, our method produces visually better results compared to , and comparable results compared to  and .
Different from optimization based methods, our learning based approach is much faster while obtaining comparable or better results. Our method is several orders of magnitude faster than the state-of-the-art optimization-based approach , i.e., 5 ms for CoarseNet and 15 ms for FineNet with our hardware setting, while 175.5s reported in their paper . It needs to be pointed out that the existing optimization based dense tracking methods need facial landmark constraints. Therefore, they might not reconstruct well for faces with large poses and extreme expressions. On the other hand, we do large-scale photo-realistic image synthesis that includes many challenging data with well labelled parameters, and thus we can handle those challenging cases as demonstrated in Fig. 9.
Quantitative results of face reconstruction from monocular video. For quantitative evaluation, we test on the FaceCap dataset . The dataset consists of 200 frames along with 3D meshes constructed using the binocular approach. We compare our proposed inverse rendering approach and our learning based solutions including Tracking CoarseNet and Tracking CoarseNet+FineNet. For each method, we register the depth cloud to the groundtruth 3D mesh and compare point to point distance. Table I shows the average point-to-point distance results. It can be seen that our proposed inverse rendering achieves an average distance of 1.81mm, which is quite accurate. It demonstrates the suitability of using the inverse rendering results for constructing the training data. On the other hand, our CoarseNet+FineNet achieves an average distance of 2.08mm, which is comparable to that of the inverse rendering but with much faster processing speed (25ms vs 8s per frame). Some samples are shown in Fig. 10. In addition, the reconstruction accuracy by CoarseNet+FineNet outperforms the one by CoarseNet alone. Since the face region containing wrinkles is only a small part of the whole face region, the difference is not significant since the accuracy statistics is computed over a large face region. By comparing the reconstruction accuracy on a small region that contains wrinkles, the improvement is more obvious, as shown in Fig. 11.
|Average point-to-point distance (mm)|
For the quantitative comparison with the state-of-the-art monocular video based face tracking method , we evaluate the geometric accuracy of the reconstruction of a video frame with rich face details (note that  did not provide the results for the entire video). Fig. 12 shows the results, where our method achieves a mean error of 1.96mm compared to the groundtruth 3D face shape generated by the binocular facial performance capture proposed in . We can see that the result of our learning based face tracking method is quite close to the groundtruth, and is comparable (1.96mm vs. 1.8mm) to that of the complex optimization based approach  but with much faster processing speed.
Visual results of our single-image based reconstruction. To evaluate the single-image based reconstruction performance, we show the reconstruction results of our method (Single-image CoarseNet+FineNet) on some images from AFLW  dataset, VGG-Face dataset  and some face images downloaded from internet. The three rows in Fig. 13 from top to bottom respectively show the projected 3D meshes reconstructed by our method under large poses, extreme expressions and face images with detailed wrinkles, which demonstrate that our method is robust to all of them.
Comparisons with inverse rendering. Similar to the video input scenario, directly using our developed inverse rendering approach can also reconstruct detailed geometries from a single image, but our learning-based method does provide some advantages. First, unlike the inverse rendering approach, our learning-based method does not need face alignment information. Therefore, the learning-based method is more robust to input face image with large pose, as shown in Fig. 14. Second, once the two CNNs are trained, our learning method is much faster to reconstruct a face geometry from a single input image. Third, as we render the same type of wrinkles under different lightings and directly learn the geometry in a supervised manner, our method is more robust to lighting, as illustrated in Fig. 15. The reason why the learning based method can do better in these scenarios lies in the large numbers of diverse training data we construct, which facilitate the learning of the two networks, while the inverse rendering approach only explores the information from each single image.
Comparisons with state-of-the-art single-image based face reconstruction. We compare our method with [42, 3, 24, 46, 14] on single-image based face reconstruction. We thank the authors of  for providing us the same 11 images listed in , as well as their results of another 8 images supplied by us. We show the reconstruction results of 4 images in Fig. 16 and the full comparisons on all the 19 images are given in the accompanying material. It can be observed that our method produces more convincing reconstruction results in both the global geometry (see the mouth regions) and the fine-scale details (see the forehead regions). The reconstruction results of the methods [3, 24, 46, 14] are generated using the source codes provided by the authors222https://github.com/waps101/3DMM_edges333https://github.com/AaronJackson/vrn444https://github.com/unibas-gravis/basel-face-pipeline555https://github.com/unibas-gravis/scalismo-faces.
The reasons why our method produces better results than  are threefold: 1) For CoarseNet training,  only renders face region and uses random background, while our rendering is based on real images and the synthesized images are more photo-realistic. For FineNet training, we render images with fine-scale details, and train FineNet in a supervised manner, while  trains FineNet in an unsupervised manner. 2) For easy back propagation,  adopts the first-order spherical harmonics (SH) to model lighting, while we use the second-order SH, which can reconstruct more accurate geometry details. 3) Our proposed loss function in CoarseNet better fits the goal and calculating the parameters in pixel level can achieve more stable and faster convergence. We did an experiment to compare our loss function in Eq. (17) with the one used in . Specifically, we used the two loss functions separately to train CoarseNet with 15000 iterations and batch size 100. Table II shows the results of the test errors under different metrics on the test set (about 700 AFLW images). We can see that no matter which metric is used, either our defined metrics ( and ), or the metrics employed in  (MSE for pose parameters and vertex distance for geometry parameters), our method always achieves lower testing errors than , which demonstrates the effectiveness of the defined loss function for training.
|in Eq. (13)||26.35||7.69|
|in Eq. (14)||5.53||4.23|
|MSE (pose parameters)||1.91||0.56|
|Mean vertex distance (geometry parameters)||5.18||4.55|
Quantitative results of single-image based dense face reconstruction. For quantitative evaluation, we compare our method with the landmark-based method  and the learning-based method  on the Spring2004range subset of Face Recognition Grand Challenge dataset V2 . The Spring2004range has 2114 face images and their corresponding depth images. We use the face alignment method  to detect facial landmarks as the input of . For comparison, we project the reconstructed 3D face on the depth image, and use both Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) metrics to measure the difference between the reconstructed depth and the ground truth depth on the valid pixels. We discard some images in which the projected face regions are very far away from the the real face regions for any of the three methods, which leads to a final 2100 images being chosen for the comparisons. The results are shown in Table III. It can be seen that our method outperforms the other two recent methods in both RMSE and MAE. The results of  and  are generated by directly running their released codes in public.
|Method||RMSE [mm]||MAE [mm]|
Note that we are not able to perform a quantitative comparison with the state-of-the-art method , since their code is not released. Their reported MAE value for the Spring2004range dataset is lower than what we obtain in Table III. We believe it is due to the masks they used in their MAE computation, which are unfortunately not available to us. Although we cannot give a quantitative comparison, the visual comparison shown in Fig. 16 clearly demonstrates the superior face reconstruction performance of our method.
We have presented a coarse-to-fine CNN framework for real-time textured dense 3D face reconstruction and tracking from monocular RGB video as well as from a single RGB image. The training data to our convolutional networks are constructed by the optimization based inverse rendering approach. Particularly, we construct the training data by varying the pose and expression parameters, detail transfer as well as simulating the video-type adjacent frame pairs. With the well constructed large-scale training data, our framework recovers the detailed geometry, albedo, lighting, pose and projection parameters in real-time. We believe that our well constructed datasets including 2D face images, 3D coarse face models, 3D fine-scale face models, and multi-view face images of the same person could be applied to many other face analysis problems like face pose estimation, face recognition and face normalization.
Our work has limitations. Particularly, like many recent 3D face reconstruction works [48, 23, 18], we assume Lambertian surface reflectance and smoothly varying illumination in our inverse rendering procedure, which may lead to inaccurate fitting for face images with specular reflections or self-shadowing. It is worth to investigate more powerful formulation to handle general reflectance and illumination.
We thank Thomas Vetter et al. and Kun Zhou et al. for allowing us to use their 3D face datasets. This work was supported by the National Key R&D Program of China (No. 2016YFC0800501), the National Natural Science Foundation of China (No. 61672481).
IEEE Conference on Computer Vision and Pattern Recognition, pages 1272–1279, 2013.
On the early history of the singular value decomposition.SIAM review, 35(4):551–566, 1993.
MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction.In The IEEE International Conference on Computer Vision (ICCV), 2017.