The D Morphable Model (DMM) is a statistical model of D facial shape and texture in a space where there are explicit correspondences . The morphable model framework provides two key benefits: first, a point-to-point correspondence between the reconstruction and all other models, enabling “morphing”, and second, modeling underlying transformations between types of faces (male to female, neutral to smile, etc.).
DMM has been widely applied in numerous areas including, but not limited to, computer vision[1, 2, 3], computer graphics [4, 5, 6, 7], human behavioral analysis [8, 9] and craniofacial surgery .
Traditionally, DMM is learnt through supervision
by performing dimension reduction, typically Principal Component Analysis (PCA), on a training set of co-capturedD face scans and D images. To model highly variable D face shapes, a large amount of high-quality D face scans is required. However, this requirement is expensive to fulfill as acquiring face scans is very laborious, in both data capturing and post-processing stage. The first DMM  was built from scans of
subjects with a similar ethnicity/age group. They were also captured in well-controlled conditions, with only neutral expressions. Hence, it is fragile to large variances in the face identity. The widely used Basel Face Model (BFM) is also built with only subjects in neutral expressions. Lack of expression can be compensated using expression bases from FaceWarehouse  or BD-FE , which are learned from the offsets to the neutral pose. After more than a decade, almost all existing models use no more than training scans. Such small training sets are far from adequate to describe the full variability of human faces . Until recently, with a significant effort as well as a novel automated and robust model construction pipeline, Booth et al.  build the first large-scale DMM from scans of subjects.
Second, the texture model of DMM is normally built with a small number of D face images co-captured with D scans, under well-controlled conditions. Despite there is a considerable improvement of D acquisition devices in the last few years, these devices still cannot operate in arbitrary in-the-wild conditions. Therefore, all the current D facial datasets have been captured in the laboratory environment. Hence, such models are only learnt to represent the facial texture in similar, rather than in-the-wild, conditions. This substantially limits its application scenarios.
Finally, the representation power of DMM is limited by not only the size or type of training data but also its formulation. The facial variations are nonlinear in nature. E.g., the variations in different facial expressions or poses are nonlinear, which violates the linear assumption of PCA-based models. Thus, a PCA model is unable to interpret facial variations sufficiently well. This is especially true for facial texture. For all current DMM models, their low-dimension albedo subspace faces the same problem of lacking facial hair, e.g., beards. To reduce the fitting error, it compensates unexplainable texture by alternating surface normal, or shrinking the face shape . Either way, linear DMM-based applications often degrade their performances when handling out-of-subspace variations.
Given the barrier of DMM in its data, supervision and linear bases, this paper aims to revolutionize the paradigm of learning DMM by answering a fundamental question:
Whether and how can we learn a nonlinear D Morphable Model of face shape and albedo from a set of in-the-wild D face images, without collecting D face scans?
If the answer were yes, this would be in sharp contrast to the conventional DMM approach, and remedy all aforementioned limitations. Fortunately, we have developed approaches to offer positive answers to this question. With the recent development of deep neural networks, we view that it is the right time to undertake this new paradigm of DMM learning. Therefore, the core of this paper is regarding how to learn this new DMM, what is the representation power of the model, and what is the benefit of the model to facial analysis.
We propose a novel paradigm to learn a nonlinear DMM model from a large in-the-wild D face image collection, without acquiring D face scans, by leveraging the power of deep neural networks captures variations and structures in complex face data. As shown in Fig. 1, starting with an observation that the linear
DMM formulation is equivalent to a single layer network, using a deep network architecture naturally increases the model capacity. Hence, we utilize two convolution neural network decoders, instead of two PCA spaces, as the shape and albedo model components, respectively. Each decoder will take a shape or albedo parameter as input and output the denseD face mesh or a face skin reflectant. These two decoders are essentially the nonlinear DMM.
Further, we learn the fitting algorithm to our nonlinear DMM, which is formulated as a CNN encoder. The encoder network takes a face image as input and generates the shape and albedo parameters, from which two decoders estimate shape and albedo.
The D face and albedo would perfectly reconstruct the input face, if the fitting algorithm and DMM are well learnt. Therefore, we design a differentiable rendering layer to generate a reconstructed face by fusing the D face, albedo, lighting, and the camera projection parameters estimated by the encoder. Finally, the end-to-end learning scheme is constructed where the encoder and two decoders are learnt jointly to minimize the difference between the reconstructed face and the input face. Jointly learning the DMM and the model fitting encoder allows us to leverage the large collection of in-the-wild D images without relying on D scans. We show significantly improved shape and facial texture representation power over the linear DMM. Consequently, this also benefits other tasks such as D face alignment, D reconstruction, and face editing.
A preliminary version of this work was published in 2018 IEEE Conference on Computer Vision and Pattern Recognition. We extend it in numerous ways: 1) Instead of having lighting embedded in texture, we split texture into albedo and shading. Truthfully modeling the lighting help to improve the shape modeling as it can help to guide the surface normal learning. This results in better performance in followed tasks: alignment and reconstruction, as demonstrated in our experiment section. 2) We propose to present the shape component in the
D UV space, which helps to reserve spatial relation among its vertices. This also allows us to use a CNN, rather than an expensive multi-layer perceptron, as the shape decoder. 3) To ensure plausible reconstruction, we employ multiple constraints to regularize the model learning.
In summary, this paper makes the following contributions:
We learn a nonlinear DMM model, fully models shape, albedo and lighting, that has greater representation power than its traditional linear counterpart.
Both shape and albedo are represented as D images, which help to maintain spatial relations as well as leverage CNN power in image synthesis.
We jointly learn the model and the model fitting algorithm via weak supervision, by leveraging a large collection of D images without D scans. The novel rendering layer enables the end-to-end training.
The new DMM further improves performance in related tasks: face alignment, face reconstruction and face editing.
2 Prior Work
Linear DMM. Blanz and Vetter  propose the first generic D face model learned from scan data. They define a linear subspace to represent shape and texture using principal component analysis (PCA) and show how to fit the model to data. Since this seminal work, there has been a large amount of effort on improving DMM modeling mechanism. In , the dense correspondence between facial mesh is solved with a regularised form of optical flow. However, this technique is only effective in a constrained setting, where subjects share similar ethnicities and ages. To overcome this challenge, Patel and Smith  employ a Thin Plate Splines (TPS) warp  to register the meshes into a common reference frame. Alternatively, Paysan et al.  use a Nonrigid Iterative Closest Point (ICP)  to directly align D scans. In a different direction, Amberg et al.  extended Blanz and Vetter’s PCA-based model to emotive facial shapes by adopting an additional PCA modeling of the residuals from the neutral pose. This results in a single linear model of both identity and expression variation of D facial shape. Vlasic et al.  use a multilinear model to represent the combined effect of identity and expression variation on the facial shape. Later, Bolkart and Wuhrer  show how such a multilinear model can be estimated directly from the D scans using a joint optimization over the model parameters and groupwise registration of D scans.
Improving Linear DMM. With PCA bases, the statistical distribution underlying DMM is Gaussian. Koppen et al.  argue that single-mode Gaussian can’t well represent real-world distribution. They introduce the Gaussian Mixture DMM that models the global population as a mixture of Gaussian subpopulations, each with its own mean, but shared covariance. Booth et al. [23, 24] aim to improve texture of DMM to go beyond controlled settings by learning “in-the-wild” feature-based texture model. On another direction, Tran et al. learn to regress robust and discriminative DMM representation, by leveraging multiple images from the same subject. However, all works are still based on statistical PCA bases. Duong et al. 
address the problem of linearity in face modeling by using Deep Boltzmann Machines. However, they only work withD face and sparse landmarks; and hence cannot handle faces with large-pose variations or occlusion well. Concurrent to our work, Tewari et al.  learn a (potentially non-linear) corrective model on top of a linear model. The final model is a summation of the base linear model and the learned corrective model, which contrasts to our unified model. Furthermore, our model has an advantage of using D representation of both shape and albedo, which maintains spatial relations between vertices and leverages CNN power for image synthesis. Finally, thanks for our novel rendering layer, we are able to employ perceptual, adversarial loss to improve the reconstruction quality.
D Face Alignment. D Face Alignment [28, 29] can be cast as a regression problem where D landmark locations are regressed directly . For large-pose or occluded faces, strong priors of DMM face shape have been shown to be beneficial . Hence, there is increasing attention in conducting face alignment by fitting a D face model to a single D image [32, 33, 34, 35, 36, 37, 38]. Among the prior works, iterative approaches with cascade of regressors tend to be preferred. At each cascade, there is a single [39, 31] or even two regressors  used to improve its prediction. Recently, Jourabloo and Liu  propose a CNN architecture that enables the end-to-end training ability of their network cascade. Contrasted to aforementioned works that use a fixed DMM model, our model and model fitting are learned jointly. This results in a more powerful model: a single-pass encoder, which is learned jointly with the model, achieves state-of-the-art face alignment performance on AFLW  benchmark dataset.
D Face Reconstruction. Face reconstruction creates a D face model from an image collection [41, 42] or even with a single image [43, 44]. This long-standing problem draws a lot of interest because of its wide applications. DMM also demonstrates its strength in face reconstruction, especially in the monocular case. This problem is a highly under-constrained, as with a single image, present information about the surface is limited. Hence, D face reconstruction must rely on prior knowledge like DMM . Statistical PCA linear DMM is the most commonly used approach. Besides DMM fitting methods [45, 46, 47, 48, 49, 50], recently, Richardson et al.  design a refinement network that adds facial details on top of the DMM-based geometry. However, this approach can only learn D depth map, which loses the correspondence property of DMM. The follow up work by Sela et al. try to overcome this weakness by learning a correspondence map. Despite having some impressive reconstruction results, both these methods are limited by training data synthesized from the linear DMM model. Hence, they fail to handle out-of-subspace variations, e.g., facial hair.
Unsupervised learning in DMM. Collecting large-scale D scans with detailed labels for learning DMM is not an easy task. A few work try to use large-scale synthetic data as in [43, 52], but they don’t generalize well as there still be a domain gap with real images. Tewari et al.  is among the first work attempting to learn DMM fitting from unlabeled images. They use an unsupervised loss which compares projected textured face mesh with the original image itself. The sparse landmark alignment is also used as an auxiliary loss. Genova et al. 
further improve this approach by comparing reconstructed images and original input using higher-level features from a pretrained face recognition network. Compared to these work, our work has a different objective of learning anonlinear DMM.
3 The Proposed Nonlinear Dmm
In this section, we start by introducing the traditional linear DMM and then present our novel nonlinear DMM model.
3.1 Conventional Linear Dmm
, provide parametric models for synthesizing faces, where faces are modeled using two components: shape and albedo (skin reflectant). In, Blanz et al. propose to describe the D face space with PCA:
where is a D face mesh with vertices, is the mean shape, is the shape parameter corresponding to a D shape bases . The shape bases can be further split into , where is trained from D scans with neutral expression, and is from the offsets between expression and neutral scans.
The albedo of the face is defined within the mean shape , which describes the R, G, B colors of corresponding vertices. is also formulated as a linear combination of basis functions:
where is the mean albedo, is the albedo bases, and is the albedo parameter.
The DMM can be used to synthesize novel views of the face. Firstly, a D face is projected onto the image plane with the weak perspective projection model:
where is the projection function leading to the D positions of D rotated vertices , is the scale factor, is the orthographic projection matrix, is the rotation matrix constructed from three rotation angles (pitch, yaw, roll), and
is the translation vector. While the project matrixis of the size of
, it has six degrees of freedom, which is parameterized by a-dim vector . Then, the D image is rendered using texture and an illumination model such as Phong reflection model  or Spherical Harmonics .
3.2 Nonlinear Dmm
As mentioned in Sec. 1, the linear DMM has the problems such as requiring
D face scans for supervised learning, unable to leverage massive in-the-wild face images for learning, and the limited representation power due to the linear bases. We propose to learn a nonlinearDMM model using only large-scale in-the-wild D face images.
3.2.1 Problem Formulation
In linear DMM, the factorization of each of components (shape, albedo) can be seen as a matrix multiplication between coefficients and bases. From a neural network’s perspective, this can be viewed as a shallow network with only one fully connected layer
and no activation function. Naturally, to increase the model’s representation power, the shallow network can be extended to a deep architecture. In this work, we design a novel learning scheme to joint learn a deepDMM model and its inference (or fitting) algorithm.
Specifically, as shown in Fig. 2, we use two deep networks to decode the shape, albedo parameters into the D facial shape and albedo respectively. To make the framework end-to-end trainable, these parameters are estimated by an encoder network, which is essentially the fitting algorithm of our DMM. Three deep networks join forces for the ultimate goal of reconstructing the input face image, with the assistant of a physically-based rendering layer. Fig. 2 visualizes the architecture of the proposed framework. Each component will be present in following sections.
Formally, given a set of D face images , we aim to learn an encoder : that estimates the projection , lighting parameter , shape parameters , and albedo parameter , a D shape decoder : that decodes the shape parameter to a D shape , and an albedo decoder : that decodes the albedo parameter to a realistic albedo , with the objective that the rendered image with , , , and can well approximate the original image. Mathematically, the objective function is:
where is the rendering layer (Sec. 3.2.3).
3.2.2 Albedo & Shape Representation
Fig. 3 illustrates three possible albedo representations. In traditional DMM, albedo is defined per vertex (Fig. 3(a)). This representation is also adopted in recent work such as [49, 27]. There is an albedo intensity value corresponding to each vertex in the face mesh. Despite widely used, this representation has its limitations. Since D vertices are not defined on a D grid, this representation is mostly parameterized as a vector, which not only loses the spatial relation of its vertices, but also prevents it to leverage the convenience of deploying CNN on D albedo. In contrast, given the rapid progress in image synthesis, it is desirable to choose a D image, e.g., a frontal-view face image in Fig. 3(b), as an albedo representation. However, frontal faces contain little information of two sides, which would lose many albedo information for side-view faces.
In light of these consideration, we use an unwrapped D texture as our texture representation (Fig. 3(c)). Specifically, each D vertex is projected onto the UV space using cylindrical unwarp. Assuming that the face mesh has the top pointing up the axis, the projection of onto the UV space is computed as:
where are constant scale and translation scalars to place the unwrapped face into the image boundaries. Here, per-vertex albedo could be easily computed by sampling from its UV space counterpart :
Usually, it involves sub-pixel sampling via bilinear interpolation:
where is the UV space projection of via Eqn. 6.
Albedo information is naturally expressed in the UV space but spatial data can be embedded in the same space as well. Here, a D facial mesh can be represented as a D image with three channels, one for each spatial dimension , and . Fig 4 gives an example of this UV space shape representation .
Representing D face shape in UV space allow us to use a CNN for shape decoder instead of using a multi-layer perceptron (MLP) as in our preliminary version . Avoiding using wide fully-connected layers allow us to use deeper network for , potentially model more complex shape variations. This results in better fitting results as being demonstrated in our experiment (Sec. 4.1.2).
Also, it is worth to note that different from our preliminary version  where the reference UV space, for texture, is build upon projection of the mean shape with neutral expression; in this version, the reference shape used has the mouth open. This change helps the network to avoid learning a large gradient near the two lips’ borders in the vertical direction when the mouth is open.
To regress these D representation of shape and albedo, we can employ CNNs as shape and albedo networks respectively. Specifically, ,
are CNN constructed by multiple fractionally-strided convolution layers. After each convolution is batchnorm and eLU activation, except the last convolution layers of encoder and decoders. The output layer has aactivation to constraint the output to be in the range of . The detailed network architecture is presented in Tab. I.
3.2.3 In-Network Physically-Based Face Rendering
To reconstruct a face image from the albedo , shape , lighting parameter , and projection parameter , we define a rendering layer to render a face image from the above parameters. This is accomplished in three steps, as shown in Fig. 5. Firstly, the facial texture is computed using the albedo and the surface normal map of the rotated shape . Here, following , we assume distant illumination and a purely Lambertian surface reflectance. Hence the incoming radiance can be approximated using spherical harmonics (SH) basis functions , and controlled by coefficients . Specifically, the texture in UV space is composed of albedo and shading :
where is the number of spherical harmonics bands. We use , which leads to coefficients in for each of three color channels. Secondly, the D shape/mesh is projected to the image plane via Eqn. 4. Finally, the D mesh is then rendered using a Z-buffer renderer, where each pixel is associated with a single triangle of the mesh,
where is an operation returning three vertices of the triangle that encloses the pixel after projection ; is the same operation with resultant vertices mapped into the referenced UV space using Eqn. 6. In order to handle occlusions, when a single pixel resides in more than one triangle, the triangle that is closest to the image plane is selected. The final location of each pixel is determined by interpolating the location of three vertices via barycentric coordinates .
There are alternative designs to our rendering layer. If the texture representation is defined per vertex, as in Fig. 3(a), one may warp the input image onto the vertex space of the D shape , whose distance to the per-vertex texture representation can form a reconstruction loss. This design is adopted by the recent work of [49, 27]. In comparison, our rendered image is defined on a D grid while the alternative is on top of the D mesh. As a result, our rendered image can enjoy the convenience of applying the perceptual loss or adversarial loss, which is shown to be critical in improving the quality of synthetic texture. Another design for rendering layer is image warping based on the spline interpolation, as in . However, this warping is continuous: every pixel in the input will map to the output. Hence this warping operation fails in the occluded region. As a result, Cole et al.  limit their scope to only synthesizing frontal-view faces by warping from normalized faces.
The CUDA implementation of our rendering layer is publicly available at https://github.com/tranluan/Nonlinear_Face_3DMM.
3.2.4 Occlusion-aware Rendering
Very often, in-the-wild faces are occluded by glasses, hair, hands, etc. Trying to reconstruct abnormal occluded regions could make the model learning more difficult or result in an model with external occlusion baked in. Hence, we propose to use a segmentation mask to exclude occluded regions in the rendering pipeline:
As a result, these occluded regions won’t affect our optimization process. The foreground mask is estimated using the segmentation method given by Nirkinet al. . Examples of segmentation masks and rendering results can be found in Fig. 6.
|Layer||Filter/Stride||Output Size||Layer||Filter/Stride||Output Size|
3.2.5 Model Learning
The entire network is end-to-end trained to reconstruct the input images, with the loss function:
where the reconstruction loss enforces the rendered image to be similar to the input , the landmark loss enforces geometry constraint, and the regularization loss encourages plausible solutions.
Reconstruction Loss. The main objective of the network is to reconstruct the original face via disentangle representation. Hence, we enforce the reconstructed image to be similar to the original input image:
is the set of all pixels in the images covered by the estimated face mesh. There are different norms can be used to measure the closeness. To better handle outliers, we adopt the robust, where the distance in the D RGB color space is based on and the summation over all pixels enforces sparsity based on -norm [7, 62].
To improve from blurry reconstruction results of losses, in our preliminary work , thanks for our rendering layer, we employ adversarial loss to enhance the image realism. However, adversarial objective only encourage the reconstruction to be close to the real image distribution but not necessary the input image. Also, it’s known to be not stable to optimize. Here, we propose to use a perceptual loss to enforce the closeness between images and , which overcomes both of adversarial loss’s weaknesses. Besides encouraging the pixels of the output image to exactly match the pixels of the input
, we encourage them to have similar feature representations as computed by the loss network.
We choose VGG-Face as our to leverage its face-related features and also because of simplicity. The loss is summed over , a subset of layers of . Here is the activations of the -th layer of when processing the image with dimension
. This feature reconstruction loss is one of perceptual losses widely used in different image processing tasks.
The final reconstruction loss is a weighted sum of two terms:
Sparse Landmark Alignment. To help achieving better model fitting, which in turn helps to improve the model learning itself, we employ the landmark alignment loss, measuring Euclidean distance between estimated and groundtruth landmarks, as an auxiliary task,
where is the manually labeled D landmark locations, is a constant -dim vector storing the indexes of D vertices corresponding to the labeled D landmarks. Different from traditional face alignment work where the shape bases are fixed, our work jointly learns the bases functions (i.e., the shape decoder ) as well. Minimizing the landmark loss while updating only moves a tiny subsets of vertices. If the shape is represented as a vector and is a MLP consisting of fully connected layers, vertices are independent. Hence only adjusts vertices. In case is represented in the UV space and is a CNN, local neighbor region could also be modified. In both cases, updating based on only moves a subsets of vertices, which could lead to implausible shapes. Hence, when optimizing the landmark loss, we fix the decoder and only update the encoder.
Also, note that different from some prior work , our network only requires ground-truth landmarks during training. It is able to predict landmarks via and during the test time.
Regularizations. To ensure plausible reconstruction, we add a few regularization terms:
Albedo Symmetry As the face is symmetry, we enforce the albedo symmetry constraint,
Employing on D albedo, this constraint can be easily implemented via a horizontal image flip operation .
Albedo Constancy Using symmetry constraint can help to correct the global shading. However, symmetrical details, i.e., dimples, can still be embedded in the albedo channel. To further remove shading from the albedo channel, following Retinex theory which assumes albedo to be piecewise constant, we enforce sparsity in two directions of its gradient, similar to [66, 67]:
where denotes a set of -pixel neighborhood of pixel . With the assumption that pixels with the same chromaticity (i.e., ) are more likely to have the same albedo, we set the constant weight , where the color is referenced from the input image using the current estimated projection. Following , we set and in our experiment.
Shape Smoothness For shape component, we impose the smoothness by adding the Laplacian regularization on the vertex locations for the set of all vertices.
Intermediate Semi-Supervised Training. Fully unsupervised training using only the reconstruction and adversarial loss on the rendered images could lead to a degenerate solution, since the initial estimation is far from ideal to render meaningful images. Therefore, we introduce intermediate loss functions to guide the training in the early iterations.
With the face profiling technique, Zhu et al.  expand the 300W dataset  into images with fitted DMM shapes and projection parameters . Given and , we create the pseudo groundtruth texture by referring every pixel in the UV space back to the input image, i.e., the backward of our rendering layer. With , , , we define our intermediate loss by:
where: , , .
It’s also possible to provide pseudo groundtruth to the SH coefficients and followed by albedo using least square optimization with a constant albedo assumption, as has been done in [59, 67]. However, this estimation is not reliable for in-the-wild images with occlusion regions. Also empirically, with proposed regularizations, the model is able to explore plausible solutions for these components by itself. Hence, we decide to refrain from supervising and to simplify our pipeline.
Due to the pseudo groundtruth, using may run into the risk that our solution learns to mimic the linear model. Thus, we switch to the loss of Eqn. 12 after converges. Note that the estimated groundtruth of , , and the landmarks are the only supervision used in our training, for which our learning is considered as weakly supervised.
4 Experimental Results
The experiments study three aspects of the proposed nonlinear DMM, in terms of its expressiveness, representation power, and applications to facial analysis. Using facial mesh triangle definition by Basel Face Model (BFM) , we train our DMM using 300W-LP dataset , which contains in-the-wild face images, in a wide pose range from to . Images are loosely square cropped around the face and scale to . During training, images of size are randomly cropped from these images to introduce translation variations.
The model is optimized using Adam optimizer with a learning rate of in both training stages. We set the following parameters: , , . values are set to make losses to have similar magnitudes.
4.1 Ablation Study
4.1.1 Effect of Regularization
Albedo Regularization. In this work, to regularize albedo learning, we employ two constraints to efficiently remove shading from albedo namely albedo symmetry and constancy. To demonstrate the effect of these regularization terms, we compare our full model with its partial variants: one without any albedo reqularization and one with the symmetry constraint only. Fig. 7 shows visual comparison of these models. Learning without any constraints results in the lighting is totally explained by the albedo, meanwhile is the shading is almost constant (Fig. 7(a)). Using symmetry help to correct the global lighting. However, symmetric geometry details are still baked into the albedo (Fig. 7(b)). Enforcing albedo constancy helps to further remove shading from it (Fig. 7(c)). Combining these two regularizations helps to learn plausible albedo and lighting, which improves the shape estimation.
Shape Smoothness Regularization. We also evaluate the need in shape regularization. Fig. 8 shows visual comparisons between our model and its variant without the shape smoothness constraint. Without the smoothness term the learned shape becomes noisy especially on two sides of the face. The reason is that, the hair region is not completely excluded during training because of imprecise segmentation estimation.
|With smoothness||Without smoothness|
4.1.2 Modeling Lighting and Shape Representation
In this work, we make two major algorithmic differences with our preliminary work : incorporating lighting into the model and changing the shape representation.
Our previous work  models the texture directly, while this work disentangles the shading from the albedo. As argued, modeling the lighting should have a positive impact on shape learning. Hence we compare our models with results from  in face alignment task.
Also, in our preliminary work , as well as in traditional DMM, shape is represented as a vector, where vertices are independent. Despite this shortage, this approach has been widely adopted due to its simplicity and sampling efficiency. In this work, we explore an alternative to this representation: represent the D shape as a position map in the D UV space. This representation has three channels: one for each spatial dimension. This representation maintains the spatial relation among facial mesh’s vertices. Also, we can use CNN as the shape decoder replacing an expensive MLP. Here we also evaluate the performance gain by switching to this representation.
Tab. II reports the performance on the face alignment task of different variants. As a result, modeling lighting helps to reduce the error from to . Using the D representation, with the convenience of using CNN, the error is further reduced to .
4.1.3 Comparison to Autoencoders
We compare our model-based approach with a convolutional autoencoder in Fig.9. The autoencoder network has a similar depth and model size as ours. It gives blurry reconstruction results as the dataset contain large variations on face appearance, pose angle and even diversity background. Our model-based approach obtains sharper reconstruction results and provides semantic parameters allowing access to different components including D shape, albedo, lighting and projection matrix.
times standard deviations, in opposite directions. Ordered by the magnitude of shape changes.
Exploring feature space. We feed the entire CelebA dataset  with k images to our network to obtain the empirical distribution of our shape and texture parameters. By varying the mean parameter along each dimension proportional to its standard deviation, we can get a sense how each element contribute to the final shape and texture. We sort elements in the shape parameter based on their differences to the mean D shape. Fig. 10 shows four examples of shape changes, whose differences rank No., , , and among elements. Most of top changes are expression related. Similarly, in Fig. 11, we visualize different texture changes by adjusting only one element of off the mean parameter . The elements with the same ranks are selected.
Attribute Embedding. To better understand different shape and albedo instances embedded in our two decoders, we dig into their attribute meaning. For a given attribute, e.g., male, we feed images with that attribute into our encoder to obtain two sets of parameters and . These sets represent corresponding empirical distributions of the data in the low dimensional spaces. Computing the mean parameters and feed into their respective decoders, also using the mean lighting parameter, we can reconstruct the mean shape and texture with that attribute. Fig. 12 visualizes the reconstructed textured D mesh related to some attributes. Differences among attributes present in both shape and texture. Here we can observe the power of our nonlinear DMM to model small details such as “bag under eyes”, or “rosy cheeks”, etc.
|Male||Mustache||Bags Under Eyes||Old|
|Female||Rosy Cheeks||Bushy Eyebrows||Smiling|
4.3 Representation Power
We compare the representation power of the proposed nonlinear DMM vs. traditional linear DMM.
Albedo. Given a face image, assuming we know the groundtruth shape and projection parameters, we can unwarp the texture into the UV space, as we generate “pseudo groundtruth” texture in the weakly supervision step. With the groundtruth texture, by using gradient descent, we can jointly estimate, a lighting parameter and an albedo parameter whose decoded texture matches with the groundtruth. Alternatively, we can minimize the reconstruction error in the image space, through the rendering layer with the groundtruth and . Empirically, two methods give similar performances but we choose the first option as it involves only one warping step, instead of doing rendering in every optimization iteration. For the linear model, we use albedo bases of Basel Face Model (BFM) . As in Fig. 13, our nonlinear texture is closer to the groundtruth than the linear model. This is expected since the linear model is trained with controlled images. Quantitatively, our nonlinear model has significantly lower averaged reconstruction error than the linear model ( vs. ).
3D Shape. We also compare the power of nonlinear and linear DMMs in representing real-world D scans. We compare with BFM , the most commonly used DMM at present. We use ten D face scans provided by , which are not included in the training set of BFM. As these face meshes are already registered using the same triangle definition with BFM, no registration is necessary. Given the groundtruth shape, by using gradient descent, we can estimate a shape parameter whose decoded shape matches the groundtruth. We define matching criterion on both vertex distances and surface normal direction. This empirically improves fidelity of final results compared to only optimizing vertex distances. Fig. 14 shows the visual quality of two models’ reconstruction. Our reconstructions closely match the face shapes details. To quantify the difference, we use NME, averaged per-vertex errors between the recovered and groundtruth shapes, normalized by inter-ocular distances. Our nonlinear model has a significantly smaller reconstruction error than the linear model, vs. (Tab. III).
Besides, to evaluate our model power to represent different facial expressions, we make a similar comparison on 3D scans with different expression from BU-3DFE dataset . Here we compare with Facewarehouse bilinear model , which is directly learn from D scans with various facial expressions. Due to the difference in mesh topology, here we try to optimize the Chamfer distance  between the ground-truth shape and our estimation. Fig. 15 show qualitatively comparisons on scans with different expressions. Our model has comparable performance on capturing the facial expression, while being better on resembling facial details. This is reflected on the averaged Chamfer distance of all scans in the dataset ( v.s for our model and FaceWarehouse model respectively).
Having shown the capability of our nonlinear DMM (i.e., two decoders), now we demonstrate the applications of our entire network, which has the additional encoder. Many applications of DMM are centered on its ability to fit to D face images. Similar to linear DMM, our nonlinear DMM can be utilized for model fitting, which decomposes a D face into its shape, albedo and lighting. Fig. 16 visualizes our DMM fitting results on AFLW2000 and CelebA dataset. Our encoder estimates the shape , albedo as well as lighting and projection parameter . We can recover personal facial characteristic in both shape and albedo. Our albedo can present facial hair, which is normally hard to be recovered by linear DMM.
4.4.1 Face Alignment
Face alignment is a critical step for many facial analysis tasks such as face recognition [71, 72]. With enhancement in the modeling, we hope to improve this task (Fig. 17). We compare face alignment performance with state-of-the-art methods, DDFA , DeFA , D-FAN  and PRN , on AFLW2000 dataset on both D and D settings.
The accuracy is evaluated using Normalized Mean Error (NME) as the evaluation metric with bounding box size as the normalization factor. For fair comparison with these methods in term of computational complexity, for this comparison we use ResNet18  as our encoder. Here, DDFA and DeFA use the linear DMM model (BFM). Even though being trained with larger training corpus (DeFA) or having a cascade of CNNs iteratively refines the estimation (DDFA), these methods are still significantly outperformed by our nonlinear model (Fig. 18). Meanwhile, D-FAN and PRN achieve competitive performances by by-passing the linear DMM model. D-FAN uses the heat map representation. PRN uses the position map representation which shares a similar spirit to our UV representation. Not only outperforms these methods in term of regressing landmark locations (Fig. 18), our model also directly provides head pose as well as the facial albedo and environment lighting information.
4.4.2 3D Face Reconstruction
We compare our approach to recent representative face reconstruction work: DMM fitting networks learned in unsupervised (Tewari et al. [49, 27]) or supervised fashion (Sela et al. ) and also a non-DMM approach (Jackson et al. ).
MoFA, the monocular reconstruction work by Tewari et al. , is relevant to us as they also learn to fit DMM in an unsupervised fashion. Even being trained on in-the-wild images, their method is still limited to the linear bases. Hence there reconstructions suffer the surface shrinkage when dealing with challenging texture, i.e., facial hair (Fig. 19). Our network faithfully models these in-the-wild texture, which leads to better D shape reconstruction.
Concurrently, Tewari et al.  try to improve the linear DMM representation power by learning a corrective space on top of a traditional linear model. Despite sharing similar spirit, our unified model exploits spatial relation between neighbor vertices and uses CNNs as shape/albedo decoders, which is more efficient than MLPs. As a result, our reconstructions more closely match the input images in both texture and shape (Fig. 20).
The high-quality D reconstruction work by Richardson et al.[43, 51], Sela et al.  obtain impressive results on adding fine-level details to the face shape when images are within the span of the used synthetic training corpus or the employed DMM model. However, their performance significantly degrades when dealing with variations not in its training data span, e.g., facial hair. Our approach is not only robust to facial hair and make-up, but also automatically learns to reconstruct such variations based on the jointly learned model. We provide comparisons with them in Fig. 21, using the code provided by the author.
The current state-of-art method by Sela et al.  consisting of three steps: an image-to-image network estimating a depth map and a correspondence map, non-rigid registration and a fine detail reconstruction. Their image-to-image network is trained on synthetic data generated by the linear model. Besides domain gap between synthetic and real images, this network faces a more serious problem of lacking facial hair in the low-dimension texture subspace of the linear model. This network’s output tends to ignore these unexplainable region (Fig. 21), which leads to failure in later steps. Our network is more robust in handing these in-the-wild variations. Furthermore, our approach is orthogonal to Sela et al. ’s fine detail reconstruction module or Richardson et al.’s finenet. Employing these refinement on top of our fitting could lead to promising further improvement.
We also compare our approach with a non-DMM apporach VRN by Jackson et al. . To avoid using low-dimension subspace of the linear DMM, it directly regresses a D shape volumetric representation via an encoder-decoder network with skip connection. This potentially helps the network to explore a larger solution space than the linear model, however with a cost of losing correspondence between facial meshes. Fig. 22 shows D reconstruction visual comparison between VRN and ours. In general, VRN robustly handles in-the-wild texture variations. However, because of the volumetric shape representation, the surface is not smooth and is partially limited to present medium-level details as ours. Also, our model further provides projection matrix, lighting and albedo, which is applicable for more applications.
To quantitatively compare our method with prior works, we evaluate monocular D reconstruction performance on FaceWarehouse  and Florence dataset , in which groundtruth 3D shape is available. Due to the diffrence in mesh topology, ICP  is used to establish correspondence between estimated shapes and ground truth point clouds. Similar to previous experiments, NME (averaged per-vertex errors normalized by inter-ocular distances) is used as the comparison metric.
FaceWarehouse. We compare our method with prior works with available pretrained models on all expressions of subjects of FaceWarehouse database . Visual and quantitative comparisons are shown in Fig. 23. Our model can faithfully resemble the input expression and significantly surpass all other regression methods (PRN  and 3DDFA+ ) in term of dense face alignment.
|(a) CED Curves||(b) Pose-specific NME|
Florence. Using the experimental setting proposed in , we also quantitatively compared our approach with state-of-the-art methods (e.g.VRN  and PRN ) on the Florence dataset . Each subject is rendered with multiple poses: pitch rotations of , and and raw rotations between and . Our model consistently outperforms other methods across different view angles (Fig. 24).
4.4.3 Face editing
Decomposing face image into individual components give us ability to edit the face by manipulating any component. Here we show two examples of face editing using our model.
Relighting. First we show an application to replacing the lighting of a target face image using lighting from a source face (Fig. 25). After estimating the lighting parameters of the source image, we render the transfer shading using the target shape and the source lighting . This transfer shading can be used to replace the original source shading. Alternatively, value of can be arbitrarily chosen based on the SH lighting model, without the need of source images. Also, here we use the original texture instead of the output of our decoder to maintain image details.
Attribute Manipulation. Given faces fitted by DMM model, we can edit images by naive modifying one or more elements in the albedo or shape representation. More interestingly, we can even manipulate the semantic attribute, such as growing beard, smiling, etc. The approach is similar to learning attribute embedding in Sec. 4.2. Assuming, we would like to edit appearance only. For a given attribute, e.g., beard, we feed two sets of images with and without that attribute and into our encoder to obtain two average parameters and . Their difference is the direction to move from the distribution of negative images to positive ones. By adding with different magnitudes, we can generate modified images with different degree of changes. To achieve high-quality editing with identity-preserved, the final editing result is obtained by adding the residual, the different between the modified image and our reconstruction, to the original input image. This is a critical difference to Shu et al.  to improve results quality (Fig. 26).
Since its debut in 1999, DMM has became a cornerstone of facial analysis research with applications to many problems. Despite its impact, it has drawbacks in requiring training data of D scans, learning from controlled D images, and limited representation power due to linear bases for both shape and texture. These drawbacks could be formidable when fitting DMM to unconstrained faces, or learning DMM for generic objects such as shoes. This paper demonstrates that there exists an alternative approach to DMM learning, where a nonlinear DMM can be learned from a large set of in-the-wild face images without collecting D face scans. Further, the model fitting algorithm can be learnt jointly with DMM, in an end-to-end fashion.
Our experiments cover a diverse aspects of our learnt model, some of which might need the subjective judgment of the readers. We hope that both the judgment and quantitative results could be viewed under the context that, unlike linear DMM, no genuine D scans are used in our learning. Finally, we believe that unsupervisedly or weak-supervisedly learning D models from large-scale in-the-wild D images is one promising research direction. This work is one step along this direction.
-  V. Blanz and T. Vetter, “A morphable model for the synthesis of 3D faces,” in Proceedings of the 26th annual conference on Computer graphics and interactive techniques, 1999.
-  R. Yu, S. Saito, H. Li, D. Ceylan, and H. Li, “Learning dense facial correspondences in unconstrained images,” in ICCV, 2017.
-  A. T. Tran, T. Hassner, I. Masi, E. Paz, Y. Nirkin, and G. Medioni, “Extreme 3D face reconstruction: Looking past occlusions,” in CVPR, 2018.
-  O. Aldrian and W. A. Smith, “Inverse rendering of faces with a 3D morphable model,” TPAMI, 2013.
-  F. Shi, H.-T. Wu, X. Tong, and J. Chai, “Automatic acquisition of high-fidelity facial performances using monocular videos,” ACM TOG, 2014.
-  J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt, “Real-time expression transfer for facial reenactment.” ACM TOG, 2015.
-  J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner, “Face2face: Real-time face capture and reenactment of RGB videos,” in CVPR, 2016.
-  B. Amberg, R. Knothe, and T. Vetter, “Expression invariant 3D face recognition with a morphable model,” in FG, 2008.
-  X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Towards large-pose face frontalization in the wild,” in ICCV, 2017.
-  F. C. Staal, A. J. Ponniah, F. Angullia, C. Ruff, M. J. Koudstaal, and D. Dunaway, “Describing crouzon and pfeiffer syndrome based on principal component analysis,” Journal of Cranio-Maxillofacial Surgery, 2015.
-  P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3D face model for pose and illumination invariant face recognition,” in AVSS, 2009.
-  C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou, “Facewarehouse: A 3D facial expression database for visual computing,” TVCG, 2014.
-  L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3D facial expression database for facial behavior research,” in FGR, 2006.
-  J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway, “A 3D morphable model learnt from 10,000 faces,” in CVPR, 2016.
-  M. Zollhöfer, J. Thies, D. Bradley, P. Garrido, T. Beeler, P. Péerez, M. Stamminger, M. Nießner, and C. Theobalt, “State of the art on monocular 3D face reconstruction, tracking, and applications,” Eurographics, 2018.
-  L. Tran and X. Liu, “Nonlinear 3D morphable model,” in CVPR, 2018.
-  A. Patel and W. A. Smith, “3D morphable face models revisited,” in CVPR, 2009.
-  F. L. Bookstein, “Principal warps: Thin-plate splines and the decomposition of deformations,” TPAMI, 1989.
-  B. Amberg, S. Romdhani, and T. Vetter, “Optimal step nonrigid ICP algorithms for surface registration,” in CVPR, 2007.
-  D. Vlasic, M. Brand, H. Pfister, and J. Popović, “Face transfer with multilinear models,” in ACM TOG, 2005.
-  T. Bolkart and S. Wuhrer, “A groupwise multilinear correspondence optimization for 3D faces,” in ICCV, 2015.
-  P. Koppen, Z.-H. Feng, J. Kittler, M. Awais, W. Christmas, X.-J. Wu, and H.-F. Yin, “Gaussian mixture 3D morphable face model,” Pattern Recognition, 2017.
-  J. Booth, E. Antonakos, S. Ploumpis, G. Trigeorgis, Y. Panagakis, and S. Zafeiriou, “3D face morphable models “In-the-wild”,” in CVPR, 2017.
-  J. Booth, A. Roussos, E. Ververas, E. Antonakos, S. Poumpis, Y. Panagakis, and S. P. Zafeiriou, “3D reconstruction of “In-the-wild” faces in images and videos,” TPAMI, 2018.
-  A. T. Tran, T. Hassner, I. Masi, and G. Medioni, “Regressing robust and discriminative 3D morphable models with a very deep neural network,” in CVPR, 2017.
-  C. Nhan Duong, K. Luu, K. Gia Quach, and T. D. Bui, “Beyond principal components: Deep Boltzmann Machines for face modeling,” in CVPR, 2015.
-  A. Tewari, M. Zollhöfer, P. Garrido, F. Bernard, H. Kim, P. Pérez, and C. Theobalt, “Self-supervised multi-level face model learning for monocular reconstruction at over 250 Hz,” in CVPR, 2018.
-  H. Wu, X. Liu, and G. Doretto, “Face alignment via boosted ranking models,” in CVPR, 2008.
-  X. Liu, “Discriminative face alignment,” TPAMI, 2009.
-  P. Dollár, P. Welinder, and P. Perona, “Cascaded pose regression,” in CVPR, 2010.
-  A. Jourabloo and X. Liu, “Pose-invariant 3D face alignment,” in ICCV, 2015.
-  ——, “Large-pose face alignment via CNN-based dense 3D model fitting,” in CVPR, 2016.
-  X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li, “Face alignment across large poses: A 3D solution,” in CVPR, 2016.
-  X. Zhu, X. Liu, Z. Lei, and S. Li, “Face alignment in full pose range: A 3D total solution,” TPAMI, 2017.
-  F. Liu, D. Zeng, Q. Zhao, and X. Liu, “Joint face alignment and 3D face reconstruction,” in ECCV, 2016.
-  J. McDonagh and G. Tzimiropoulos, “Joint face detection and alignment with a deformable Hough transform model,” in ECCV, 2016.
-  A. Jourabloo, X. Liu, M. Ye, and L. Ren, “Pose-invariant face alignment with a single CNN,” in ICCV, 2017.
-  A. Jourabloo and X. Liu, “Pose-invariant face alignment via CNN-based dense 3D model fitting,” IJCV, 2017.
-  S. Tulyakov and N. Sebe, “Regressing a 3D face shape from a single image,” in ICCV, 2015.
-  Y. Wu and Q. Ji, “Robust facial landmark detection under significant head poses and occlusion,” in ICCV, 2015.
-  J. Roth, Y. Tong, and X. Liu, “Unconstrained 3D face reconstruction,” in CVPR, 2015.
-  ——, “Adaptive 3D face reconstruction from unconstrained photo collections,” TPAMI, 2017.
-  E. Richardson, M. Sela, and R. Kimmel, “3D face reconstruction by learning from synthetic data,” in 3DV, 2016.
M. Sela, E. Richardson, and R. Kimmel, “Unrestricted facial geometry reconstruction using image-to-image translation,” inICCV, 2017.
-  V. Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,” TPAMI, 2003.
-  L. Gu and T. Kanade, “A generative shape regularization model for robust face alignment,” in ECCV, 2008.
-  L. Zhang and D. Samaras, “Face recognition from a single training image under arbitrary unknown lighting using spherical harmonics,” TPAMI, 2006.
-  P. Dou, S. K. Shah, and I. A. Kakadiaris, “End-to-end 3D face reconstruction with deep neural networks,” in CVPR, 2017.
-  A. Tewari, M. Zollhöfer, H. Kim, P. Garrido, F. Bernard, P. Pérez, and C. Theobalt, “MoFA: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction,” in ICCV, 2017.
-  F. Liu, R. Zhu, D. Zeng, Q. Zhao, and X. Liu, “Disentangling features in 3D face shapes for joint face reconstruction and recognition,” in CVPR, 2018.
-  E. Richardson, M. Sela, R. Or-El, and R. Kimmel, “Learning detailed face reconstruction from a single image,” in CVPR, 2017.
-  H. Kim, M. Zollhöfer, A. Tewari, J. Thies, C. Richardt, and C. Theobalt, “Inversefacenet: Deep single-shot inverse face rendering from a single image,” in CVPR, 2018.
-  K. Genova, F. Cole, A. Maschinot, A. Sarna, D. Vlasic, and W. Kyle, “Unsupervised training for 3D morphable model regression,” in CVPR, 2018.
-  T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” TPAMI, 2001.
-  X. Liu, P. Tu, and F. Wheeler, “Face model fitting on low resolution images,” in BMVC, 2006.
-  X. Liu, “Video-based face model fitting using adaptive active appearance model,” Image and Vision Computing, 2010.
-  B. T. Phong, “Illumination for computer generated pictures,” Communications of the ACM, 1975.
-  R. Ramamoorthi and P. Hanrahan, “An efficient representation for irradiance environment maps,” in Proceedings of the 28th annual conference on Computer graphics and interactive techniques, 2001.
-  Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras, “Face relighting from a single image under arbitrary unknown lighting conditions,” TPAMI, 2009.
-  F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman, “Face synthesis from facial identity features,” in CVPR, 2017.
-  Y. Nirkin, I. Masi, A. T. Tran, T. Hassner, and G. M. Medioni, “On face segmentation, face swapping, and face perception,” in FG, 2018.
-  J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner, “FaceVR: Real-time facial reenactment and eye gaze control in virtual reality,” ACM TOG, 2018.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in BMVC, 2015.
J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” inECCV, 2016.
-  P. Garrido, M. Zollhöfer, D. Casas, L. Valgaerts, K. Varanasi, P. Pérez, and C. Theobalt, “Reconstruction of personalized 3D face rigs from monocular video,” ACM TOG, 2016.
-  A. Meka, M. Zollhöfer, C. Richardt, and C. Theobalt, “Live intrinsic video,” ACM TOG, 2016.
-  Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras, “Neural face editing with intrinsic image disentangling,” in CVPR, 2017.
-  C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 faces in-the-wild challenge: Database and results,” Image and Vision Computing, 2016.
Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” inICCV, 2015.
-  H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” in ICCV, 2017.
L. Tran, X. Yin, and X. Liu, “Disentangled representation learning GAN for pose-invariant face recognition,” inCVPR, 2017.
-  ——, “Representation learning by rotating your faces,” TPAMI, 2018.
-  Y. Liu, A. Jourabloo, W. Ren, and X. Liu, “Dense face alignment,” in ICCVW, 2017.
-  A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2D & 3D face alignment problem?(and a dataset of 230,000 3D facial landmarks),” in ICCV, 2017.
-  Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou, “Joint 3D face reconstruction and dense alignment with position map regression network,” in ECCV, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
-  A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropoulos, “Large pose 3D face reconstruction from a single image via direct volumetric cnn regression,” in ICCV, 2017.
-  A. D. Bagdanov, A. Del Bimbo, and I. Masi, “The florence 2D/3D hybrid face dataset,” in Proceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding. ACM, 2011, pp. 79–80.
-  Z. Shu, S. Hadap, E. Shechtman, K. Sunkavalli, S. Paris, and D. Samaras, “Portrait lighting transfer using a mass transport approach,” ACM TOG, 2018.