THREE-dimensional (3D) face models have recently been employed to assist pose or expression invariant face recognition and achieve state-of-the-art performance [1, 2, 3]. A crucial step in these 3D face assisted face recognition methods is to reconstruct the 3D face model from a two-dimensional (2D) face image. Besides its applications in face recognition, 3D face reconstruction is also useful in other face-related tasks, e.g., facial expression analysis [4, 5] and facial animation [6, 7]. While many 3D face reconstruction methods are available, they mostly require landmarks on the face image as input, and are difficult to handle large-pose faces that have invisible landmarks due to self-occlusion.
Existing studies tackle the problems of facial landmark localization (or face alignment) and 3D face reconstruction separately. However, these two problems are chicken-and-egg problems. On one hand, 2D face images are projections of 3D faces onto the 2D plane. Given a 3D face and a 3D-to-2D mapping function, it is easy to compute the visibility and position of 2D landmarks. On the other hand, the landmarks provide rich information about facial geometry, which is the basis of 3D face reconstruction. Figure 1 illustrates the relationship between 2D landmarks and 3D faces. That is, the visibility and position of landmarks in the projected 2D image are determined by four factors: the 3D shape, the deformation due to expression and pose, and the camera projection parameters. Given such a clear correlation between 2D landmarks and 3D shape, it is evident that ideally they should be solved jointly, instead of separately as in prior works - indeed this is the core of this work.
Motivated by the aforementioned observation, this paper proposes a unified framework to simultaneously solve the two problems of face alignment and 3D face reconstruction. Two sets of regressors are jointly learned from a training set of pairing annotated 2D face images and 3D face shapes. Based on the texture features around landmarks on a face image, one set of regressors (called landmark regressors) gradually move the landmarks towards their true positions. By utilizing the facial landmarks as clues, the other set of regressors (called shape regressors) gradually improve the reconstructed 3D face. These two sets of regressors are alternately and iteratively applied. Specifically, in each iteration, adjustment to the landmarks is firstly estimated via the landmark regressors, and this landmark adjustment is also used to estimate 3D shape adjustment via the shape regressors. The 3D-to-2D mapping is then computed based on the adjusted 3D shape and 2D landmarks, and it further refines the landmarks.
A preliminary version of this work was published in the 14th European Conference on Computer Vision (ECCV) 2016. We further extend the work from four aspects. (i) We explicitly reconstruct expression deformation of 3D faces, so that both PEN (pose and expression normalized) and expressive 3D faces can be reconstructed. (ii) We implement the proposed method in both linear and nonlinear regressions. (iii) We present in detail the application of the proposed method to face recognition. (iv) We carry out a more extensive evaluation with comparisons to state-of-the-art methods. In summary, this paper makes the following contributions.
We present a novel cascaded coupled-regressor based method with linear and non-linear regressions for joint face alignment and 3D face reconstruction from a single 2D image of arbitrary pose and expression.
By integrating 3D shape information, the proposed method can more accurately localize landmarks on images of arbitrary view angles in [-].
We explicitly deal with expression deformation of 3D faces, so that both PEN and expressive 3D faces can be reconstructed at a high accuracy.
We propose a 3D-enhanced approach to improve face recognition accuracy on off-angle and expressive face images based on the reconstructed PEN 3D faces.
We achieve state-of-the-art 3D face reconstruction and face alignment performance on BU3DFE , AFLW , and AFLW2000 3D  databases. We investigate the other-race effect on 3D reconstruction of the proposed method on FRGC v2.0 database 
. We demonstrate the effectiveness of our proposed 3D-enhanced face recognition method in improving state-of-the-art deep learning based face matchers on Multi-PIE and CFP  databases.
The rest of this paper is organized as follows. Section 2 briefly reviews related work in the literature. Section 3 introduces in detail the proposed joint face alignment and 3D face reconstruction method and two alternative implementations. Section 4 shows its application to face recognition. Section 5 reports the experimental results. Section 6 concludes the paper.
2 Prior Work
2.1 Face Alignment
Classical face alignment methods, e.g., Active Shape Model (ASM) [14, 15] or Active Appearance Model (AAM) [16, 17, 18, 19], search for landmarks based on global shape models and texture models. Constrained Local Model (CLM)  also utilizes global shape models to regularize the landmark locations, but it employs discriminative local texture models. Regression based methods [21, 22, 23, 24] have been recently proposed to directly estimate landmark locations by applying cascaded regressors to an input image. These methods mostly do not consider the visibility of landmarks under different view angles. Consequently, their performance degrades substantially for non-frontal faces, and their detected landmarks could be ambiguous because the anatomically correct landmarks might be invisible due to self-occlusion (see Fig. 1).
A few methods focused on large-pose face alignment, which can be roughly divided into two categories: multi-view based and 3D model based. Multi-view based methods [25, 26] define different sets of landmarks as templates, one for each view range. Given an input image, they fit the multi-view templates to it and choose the best fitted one as the final result. These methods are usually complicated to apply, and cannot detect invisible self-occluded landmarks. 3D model based methods, in contrast, can better handle self-occluded landmarks with the assistance of 3D face models. Their basic idea is to fit a 3D face model to the input image to recover the 3D landmark locations. Most of these methods [27, 28, 10, 29, 30, 31] use 3D morphable models (3DMM)  — either a simplified one with a sparse set of landmarks [28, 10] or a relatively dense one . They estimate the 3DMM parameters by using cascaded regressors with texture features as the input. In , the visibility of landmarks is explicitly computed, and the method can cope with face of yaw angles ranging from - to , whereas the method in  does not work properly for faces of yaw angles beyond . In , Tulyakov and Sebe propose to directly estimate the 3D landmark locations via texture-feature-based regressors for faces of yaw angles up to .
These existing 3D model based methods regress between 2D image features and 3D landmark locations (or indirectly, 3DMM parameters). While our proposed approach is also based on 3D model, unlike existing methods, it carries out regressions both on 2D images and in the 3D space. Regressions on 2D images predict 2D landmarks, while regressions in the 3D space predict 3D landmarks coordinates. By integrating both regressions, our proposed method can more accurately estimate landmarks, and better handle self-occluded landmarks. It thus works well for images of arbitrary view angles in [-].
2.2 3D Face Reconstruction
Estimating the 3D face geometry from a single 2D image is an ill-posed problem. Existing methods, such as Shape from Shading (SFS) and 3DMM, thus heavily depend on priors or constraints. SFS based methods [34, 35] usually utilize an average 3D face model as a reference, and assume the Lambertian lighting model for the 3D face surface. One limitation of SFS methods lies in its assumed connection between 2D texture clues and 3D shape, which could be weak to discriminate among different individuals. 3DMM [32, 1, 36, 37, 38]
establishes statistical parametric models for both texture and shape, and represents a 3D face as a linear combination of basis shapes and textures. To recover the 3D face from a 2D image, 3DMM-based methods estimate the combination coefficients by minimizing the discrepancy between the input image and the image rendered from the reconstructed 3D face. They can better cope with 2D face images of varying illuminations and poses. However, they still suffer from invisible facial landmarks when the input face has large pose angles. To deal with extreme poses, Lee et al., Qu et al.  and Liu et al.  propose to discard the self-occluded landmarks or treat them as missing data.
All the aforementioned 3D face reconstruction methods require landmarks as input. Consequently, they either manually mark the landmarks, or employ standalone face alignment methods to automatically locate the landmarks. Very recently, Tran et al. 
propose a convolutional neural network (CNN) based method to estimate discriminative 3DMM parameters directly from single 2D images without requirement of input landmarks. Yet, existing methods always generate 3D faces that have the same pose and expression as the input image, which may not be desired in face recognition due to the challenge of matching 3D faces with expressions. In this paper, we improve 3D face reconstruction by (i) integrating the face alignment step into the 3D face reconstruction procedure, and (ii) reconstructing both expressive and PEN 3D faces, which is shown to be useful for face recognition.
2.3 Unconstrained Face Recognition
Face recognition has been developed rapidly in the past decade, especially since the emergence of deep learning techniques. Although automated methods [44, 45, 46] outperform humans in face recognition accuracy on the labelled faces in the wild (LFW) benchmark database, it is still very challenging to recognize faces in unconstrained images with large poses or intensive expressions [47, 48]. Potential reasons for degraded accuracy on off-angle and expressive faces include (i) off-angle faces usually have less discriminative texture information for identification than frontal ones, resulting in small inter-class differences, (ii) cross-view faces (e.g., frontal and profile faces) may have very limited features in common, leading to large intra-class differences, and (iii) pose and expression variations could cause substantial deformation to faces.
Existing methods recognize off-angle and expressive faces either by extracting invariant features or by normalizing out the pose or expression deformation. Yi et al.  fitted a 3D face mesh to an arbitrary-view face, and extracted pose-invariant features based on the 3D face mesh adaptively deformed to the input face. In DeepFace , the input face was first aligned to the frontal view with assistance of a generic 3D face model, and then recognized utilizing a deep network. Zhu et al.  proposed to generate frontal and neutral face images from the input images by using 3DMM  and deep convolutional neural networks. Very recently, generative adversarial networks (GAN) have been explored by Tran et al. [48, 51] for unconstrained face recognition. They devised a novel network, namely DR-GAN, which simultaneously synthesizes frontal faces and learn pose-invariant feature representations. Hu et al.  proposed to directly transform a non-frontal face into frontal face by Learning a Displacement Field network (LDF-Net). LDF-Net achieves state-of-the-art performance for face recognition across poses on Multi-PIE, especially at large poses. To summarize, all these existing methods carry out pose and expression normalization on 2D faces and utilize merely 2D features for recognition. In this paper, on the contrary, we generate pose and expression normalized 3D faces from the input 2D images, and use these resultant 3D faces to improve the unconstrained face recognition accuracy.
3 Proposed Method
In this section, we introduce the proposed joint face alignment and 3D face reconstruction method and its implementations in detail. We start by defining the 3D face model with separable identity and expression components, and based on this model formulate the problem of interest. We then provide the overall procedure of the proposed method. Afterwards, the preparation of training data is presented, followed by the introduction of key steps in the proposed method, including learning 2D landmark and 3D shape regressors, and estimating 3D-to-2D mapping and landmark visibility. Finally, a deep learning based nonlinear implementation of the proposed method is given.
3.1 Problem Formulation
We denote an n-vertex frontal pose 3D face shape of one subject as
and represent it as a summation of three components:
where is the mean of frontal pose and neutral expression 3D face shapes, termed pose-and-expression-normalized (PEN) 3D face shape, is the difference between the subject’s PEN 3D shape (denoted as ) and , and is the expression-induced deformation in w.r.t. (Fig. 2).
We use to denote a subset of with columns corresponding to landmarks. The projections of these landmarks onto an image of the subject with arbitrary view are represented by
where and are, respectively, camera projection and pose-induced deformation. In this work, we employ a 3D-to-2D mapping matrix to approximate the composite effect of pose-induced deformation and camera projection.
Given a face image , our goal is to simultaneously estimate its landmarks , PEN 3D shape , and expression deformation
. Note that, in some context, we also write the 3D shape and landmarks as column vectors:, and , where ‘’ is transpose operator.
3.2 The Overall Procedure
Figure 3 shows the flowchart of the proposed method. Given an image , its 3D shape is initialized as the mean PEN 3D shape of training faces (i.e., ). Its landmarks are initialized by placing the mean landmarks of training frontal and neutral faces into the face region specified by a bounding box in via similarity transforms. and are iteratively updated by applying a series of regressors. Each iteration contains three steps: (i) updating landmarks, (ii) updating 3D face shape, and (iii) refining landmarks.
Updating landmarks This step updates the landmarks’ locations from to based on the texture features in the image. This is similar to the conventional cascaded regressor based 2D face alignment . The adjustment to the landmarks’ locations in iteration, is determined by the local texture feature around via a regressor,
denotes the texture feature extracted around the landmarksin the image , and is a regression function. The landmarks can then be updated by . The method for learning these landmark regressors in linear case will be introduced in Sec. 3.4.
Updating 3D face shape In this step, the aforementioned landmark location adjustment is used to estimate the adjustment of the 3D shape , which consists of two components, and . Specifically, a regression function models the correlation between the landmark location adjustment and the expected adjustment and , i.e.,
The 3D shape can be then updated by . The method for learning these shape regressors in linear case will be given in Sec. 3.5.
Refining landmarks Once a new estimate of the 3D shape is obtained, the landmarks can be further refined with the assitance of the 3D-to-2D mapping matrix. We estimate based on and . The refined landmarks can be obtained by projecting onto the image via according to Eq. (3). In this process, the landmark visibility is also re-computed. Details of this step will be given in Sec. 3.6.
3.3 Training Data Preparation
Before we provide details of the three steps, we first introduce the training data needed for learning the landmark and shape regressors, which will also facilitate the understanding of our algorithms. Since the purpose of these regressors is to gradually adjust the estimated landmark and shape towards their ground truth, we need a sufficient number of triplet data , where and are, respectively, the ground truth 3D shape and landmarks for the image , and is the total number of training samples. All the 3D shapes have established dense correspondences among their vertices; i.e., they have the same number of vertices, and vertices of the same index in the 3D shapes have the same semantic meaning. Here, each of the ground truth 3D shapes includes two parts, the PEN 3D shape and its expression shape , i.e., . Moreover, both visible and invisible landmarks in have been annotated and included in . For invisible landmarks, the annotated positions should be anatomically correct positions (e.g., the red points in Fig. 1).
Obviously, to enable regressors to cope with expression and pose variations, the training data should contain faces of these variations. It is, however, difficult to find in the public domain such data sets of 3D faces and corresponding annotated 2D images with various expressions/poses. Thus, we construct two training sets by ourselves: one based on BU3DFE , and the other based on 300W-LP [53, 10].
BU3DFE database contains 3D face scans of females and males, acquired in neutral plus six basic expressions (happiness, disgust, fear, anger, surprise and sadness). All basic expressions are acquired at four intensity levels. These 3D scans have been manually annotated with landmarks ( landmarks provided by the database plus one nose tip marked by ourselves). For each of the subjects, we select the scans of neutral and the level-one intensity of the rest six expressions as the ground truth 3D face shapes. From each of the chosen seven scans of a subject, face images are rendered at different poses (- to yaw with a interval) with landmark locations recorded. As a result, each subject has images of different poses and expressions. We use the method in  to establish dense correspondence of the 3D scans of vertices. With the registered 3D scans, we compute the mean PEN 3D face shape by averaging all the subjects’ PEN 3D shapes, which are defined by their 3D scans of frontal pose and neutral expression. All the images of one subject share the same PEN 3D shape of that subject, while their expression shapes can be obtained by first subtracting from their corresponding 3D scans, their PEN 3D face shape, and then adding the mean PEN 3D shape.
300W-LP database  is created based on 300W  database, which integrates multiple face alignment benchmark datasets (i.e., AFW , LFPW , HELEN , IBUG  and XM2VTS ). It includes in-the-wild images of a wide variety of poses and expressions. For each image, its corresponding registered PEN 3D shape and expression shape are estimated by using the method in  based on BFM  and FaceWarehouse . The obtained 3D faces have vertices. Figure 4 and 5 shows example images and corresponding PEN 3D shapes and expression shapes in our training sets.
3.4 Learning Landmark Regressors
According to Eq. (4), landmark regressors estimate the adjustment to such that the updated landmarks are closer to their ground truth, which, along with landmark visibility, are given by in training. Therefore, the objective of landmark regressors is to better predict the difference between and . In this section, we first implement the proposed method in a linear manner, by optimizing:
which has a closed-form least-square solution. Note that, as we will show later, other nonlinear regression schemes, such as CNN , can also be adopted in our framework.
We use -dim SIFT descriptors  as the local feature. The feature vector of is a concatenation of the SIFT descriptors at all the landmarks, i.e., a -dim vector. If a landmark is invisible, no feature will be extracted, and its corresponding entries of will be zero. Note that the regressors estimate the semantic locations of all landmarks including invisible ones.
3.5 Learning 3D Shape Regressors
The landmark adjustment is also used as the input to the 3D shape regressor . The objective of is to compute an update to the initially estimated 3D shape in the iteration to minimize the difference between the updated 3D shape and the ground truth. Using similar linear regressors, the 3D shape regressors can be learned by solving the following optimization via least squares:
with its closed-form solution as
where and are, respectively, the 3D shape and landmark adjustment. and denote, respectively, the ensemble of 3D face shapes and 2D landmarks of all training samples with one column per sample.
Since (recall that has two parts, PEN shape and expression deformation) and , it can be mathematically shown that should be larger than so that is invertible. Fortunately, since the landmark set is usually sparse, this requirement can be easily satisfied in real-world applications.
3.6 3D-to-2D Mapping and Landmark Visibility
In order to refine landmarks with the updated 3D shape, we project the 3D shape to the 2D image with a 3D-to-2D mapping matrix. In this paper, we dynamically estimate the mapping matrix based on and . As discussed in Sec. 3.1, the mapping matrix is a composite effect of pose-induced deformation and camera projection. By assuming a weak perspective camera projection as in prior work [61, 28], the mapping matrix is represented by a matrix, and can be estimated as a least-square solution to the following fitting problem:
Once a new mapping matrix is computed, the landmarks can be further refined as .
The visibility of the landmarks can be then computed based on the mapping matrix using the method in . Suppose the average surface normal around a landmark in the 3D face shape is . Its visibility v is measured by
where is the sign function, ‘’ means dot product and ‘’ cross product, and and are the left-most three elements at the top two rows of . This rotates the surface normal and validates if it points toward the camera.
Algorithm 1 summarizes the process of learning the cascaded coupled linear regressors. Next, we introduce an alternative implementation of our proposed method by using nonlinear regressors, i.e., neural networks.
3.7 Nonlinear Regressors
In the above linear implementation, linear regressors with hand-crafted features are used. Here, we provide a nonlinear implementation, in which landmark and 3D shape regressors are implemented by deep convolutional neural networks (DCNN) and multiple layer perceptions (MLP), respectively. Figure 6 shows its pipeline.
Given a face image, as in linear implementation, its landmarks and 3D shape are initialized as the average landmarks and the average 3D shape. In every iteration, a landmark heatmap , which has the same dimension as the input image, is generated from the current estimated landmarks. The value of pixel in the heatmap is set as the accumulated contributions of the visible landmarks, and the contribution of a landmark is determined by
The heatmap and face image are stacked together as input to the DCNN-based landmark regressor. In this paper, we employ the structure of Deep Alignment Network (DAN) 
, and adapt its output layer so that landmark adjustment is estimated. The obtained landmark adjustment is then fed into the MLP-based 3D shape regressor (Deep Reconstruction Network, or DRN). DRN, consisting of a full-connection layer and a tanh() activation function, computes the 3D shape adjustment. After updating the 3D shape with the shape adjustment, we further refine the landmarks as in Sec..
The DCNN- and MLP-based regressors are learned iteratively. We first train the regressors in prior iteration until convergence, and then move on to the next iteration. We employ the Euclidean loss in training both regressors.
4 Application to Face Recognition
|Zhu et al. ||-||-||-||-||-|
|Tran et al. ||-||-||-||-||-|
|Liu et al. |
|Liu et al. |
In this section we apply the reconstructed 3D faces to improve face recognition accuracy on off-angle and expressive faces. The basic idea is to utilize the additional feature provided by the reconstructed PEN 3D faces and fuse it with conventional 2D face matchers. Figure 7 shows the proposed 3D-enhanced face recognition method. As can be seen, 3D face reconstruction methods are applied to both gallery and probe faces to generate PEN 3D faces. The iterative closest point (ICP) algorithm  is applied to match the reconstructed normalized 3D face shapes. It aligns the 3D shapes reconstructed from probe and gallery images, and computes their distances, which are then converted to similarity scores via subtracting them from the maximum distance. These scores are finally normalized to the range of via min-max normalization, and fused with the scores of the conventional 2D face matcher (which are within also) by a sum rule. The recognition result for a probe is defined as the subject whose gallery sample has the highest match score with it. Note that we employ the ICP-based 3D face matcher and the sum fusion rule for simplicity. Other more elaborated 3D face matchers and fusion rules can also be applied with our proposed method. Thanks to the additional discriminative feature in PEN 3D face shapes and its robustness to pose and expression variations, the accuracy of conventional 2D face matchers on off-angle and expressive face images can be effectively improved after fusion with the PEN 3D face based matcher. In the next Section, we will experimentally demonstrate this.
We conduct three sets of experiments to evaluate the proposed method in 3D face reconstruction, face alignment, and face recognition.
5.1 3D Face Reconstruction Accuracy
To evaluate the 3D shape reconstruction accuracy, a -fold cross validation is applied to split the BU3DFE data into training and testing subsets, resulting in training and testing samples. We compare the proposed method with its preliminary version in  and three state-of-the-art methods in [41, 3, 42]. The methods in [8, 42] reconstruct PEN 3D faces only, while the methods in [41, 3] reconstruct 3D faces that have the same pose and expression as the input images. Moreover, the method in  requires that visible landmarks are available together with the input images. In the following experiments, we use the visible landmarks projected from ground truth 3D faces for . For the methods of [3, 42], we use the implementation provided by the authors. In the implementation, these two methods are based on the landmarks that are detected by using . As a result, they cannot be applied to faces of large poses (i.e., beyond degrees).
We use two metrics to evaluate the 3D face reconstruction accuracy: Mean Absolute Error (MAE) and Normalized Per-vertex Depth Error (NPDE). MAE is defined as :
where is the total number of testing samples, and are the ground truth and reconstructed 3D face shape of the testing sample.
NPDE measures the depth error at the vertex in a testing sample as :
where and are the maximum and minimum depth values in the ground truth 3D face of testing samples, and and are the ground truth and reconstructed depth values at the vertex. We first report the results of our linear implementation, and then those of the nonlinear one. Note that when we mention the proposed method, the linear implementation is referred unless specified.
Reconstruction accuracy across poses Table I shows the average MAE of the proposed method under different poses of the input faces. For a fair comparison with the counterpart methods, we only compute the reconstruction error of neutral testing images. To compute MAE, the reconstructed 3D faces should be first aligned to the ground truth. Since the results of [8, 41] and our proposed method already have the same number of vertices as the ground truth, we employ Procrustes alignment for these methods as suggested by . For the results of [3, 42], however, the number of vertices is different from the ground truth. Hence, we align them using rigid ICP method as  does. It can be seen from Table I that the average MAE of the proposed method (either linear or nonlinear implementation) is lower than that of counterpart methods. Moreover, as the pose becomes large, the error of the proposed method does not increase substantially. This proves the effectiveness of the proposed method in handling arbitrary view faces. Figure 8 shows the reconstruction results of one subject.
Reconstruction accuracy across expressions Figure 9 shows the average MAE of the proposed method and [41, 3] across expressions, based on their reconstructed 3D faces that have the same pose and expression as the input. The proposed method overwhelms its counterpart for all expressions. Moreover, as expressions change, the MAE standard deviation of [3, 41] are and , whereas that of the proposed method is in linear implementation and in nonlinear implementation. This proves the superior robustness of the proposed method to expression variations.
Figure 10 compares the average MAE of the proposed method and [8, 42] across expressions, based on their reconstructed PEN 3D faces. Again, the proposed method shows superiority in both MAE under all expressions and robustness across expressions. We believe that such superiority is owing to its explicit modeling of expression deformation. Figure 11 shows the reconstruction results for one subject under seven expressions.
Reconstruction accuracy across races It is well known that people from different races (e.g., Asian and Caucasian) show different characteristics in facial shapes. Such other-race effect has been reported in face recognition literature . In this experiment, we study the impact of races on 3D face reconstruction using the FRGC v2.0 database . FRGC v2.0 contains 3D faces and images of subjects with different ethnic groups (Table II). Since these faces have no expression variation, the expression shape component in our proposed model is set to zero. We use the method in  to establish dense correspondence of the 3D faces of vertices. We conduct three experiments: (i) training with Asian samples (denoted as Setting I), (ii) training with Caucasian samples (Setting II), and (iii) training with Asian and Caucasian samples (Setting III). The testing set contains samples of remaining subjects in FRGC v2.0, including Asian, African, Hispanic, Caucasian and Unknown races.
Figure 12 compares the 3D face reconstruction accuracy (MAE) across different ethnic groups. Not surprisingly, training for one ethnic group can yield higher accuracy on testing of the same ethnic. As for the other-race effect, the model trained on Caucasian achieves comparable accuracy on Caucasian and Hispanic, but much worse on the other races (and worst on Asian). On the other hand, the model trained on Asian performs much worse on all other races compared to on its own race, and the worst on African. These results reveal the variations in the facial shapes of people from different races. Further, by combining training data of Asian and Caucasian (Setting III), comparable reconstruction accuracy is achieved for both Asian and Caucasian, which is also comparable to those in Setting I and II. This proves the capability of the proposed method in handling the 3D shape variations among all ethnic groups.
|Method||AFLW Database (21 points)||AFLW2000-3D Database (68 points)|
5.2 Face Alignment Accuracy
In evaluating face alignment, several state-of-the-art face alignment methods are considered for comparison to the proposed method, including RCPR , ESR , SDM , 3DDFA and 3DDFA+SDM . The dataset constructed from 300W-LP is used for training, the AFLW  and AFLW2000-3D  are for testing. AFLW contains in-the-wild faces with large poses (yaw from - to ). Each image is annotated with up to visible landmarks. For a fair comparison to , we use the same samples as our testing set, and divide the testing set into three subsets according to the absolute yaw angle of the testing image: , and . The resulting three subsets have , and samples, respectively. AFLW2000-3D contains the ground truth 3D faces and the corresponding landmarks of the first AFLW samples. There are samples in , in and in . The bounding boxes provided by AFLW are used in the AFLW testing, while the ground truth bounding boxes enclosing all landmarks are used for the AFLW2000-3D testing.
Normalized Mean Error (NME)  is employed to measure the face alignment accuracy. It is defined as the mean of the normalized estimation error of visible landmarks for all testing samples:
where is the square root of the bounding box area of the testing sample, is the number of its visible landmarks, and are, respectively, the ground truth and estimated coordinates of its landmark.
Table III compares the face alignment accuracy on the AFLW and AFLW2000-3D datasets. As can be seen, the proposed method achieves the best accuracy for all poses and on both datasets. In order to assess the robustness of different methods to pose variations, we also report their standard deviations of the NME in Table III. The results again demonstrate the superiority of the proposed method over the counterpart. Figure 13 shows the landmarks detected by the proposed method on some AFLW images.
Moreover, for the proposed method, the nonlinear regression implementation is better than the linear one. CNN feature is more powerful and robust than the handcrafted SIFT feature for the face alignment task. In contrast, in the experiments of 3D face reconstruction on BU3DFE database (see Section 5.1), the reconstruction error of linear regressors is lower than that of nonlinear regressors. This might be because MLP-based nonlinear regressors for 3D face reconstruction need more training samples.
5.3 Face Recognition
While there are many recent face alignment and reconstruction works [69, 70, 71, 72, 73], few works take one step further to evaluate the contribution of alignment or reconstruction to subsequent tasks, such as face recognition. In contrast, we quantitatively evaluate the contribution of the reconstructed pose-expression-normalized (PEN) 3D faces to face recognition by directly matching 3D to 3D shape and fusing it with conventional 2D face recognition. Refer to Sec. 4 for details of the PEN 3D faces enhanced face recognition method.
In this evaluation, we employ the linear implementation, and use the BU3DFE ( images of subjects; refer to Sec. 3.3) and MICC  databases as training data, the CMU Multi-PIE database  and the Celebrities in Frontal-Profile (CFP) database  as test data. MICC contains 3D face scans and video clips (indoor, outdoor and cooperative head rotations environments) of subjects. We randomly select faces with different poses from the cooperative environment videos, resulting in images of subjects and their corresponding neutral 3D face shapes (whose expression shape components are thus set to zero). The 3D faces are processed by the method in  to establish dense correspondence with vertices.
5.3.1 Face Identification on Multi-PIE Database
CMU Multi-PIE is a widely used benchmark database for face recognition, with faces of subjects collected under various views, expressions and lighting conditions. Here, we consider pose and expression variations, and conduct two experiments. In the first experiment, following the setting of [75, 3], probe images consist of the images of all subjects at poses (, , , , , ) with neutral expression and frontal illumination. In the second experiment, instead of neutral expression, all images with smile, surprise, squint, disgust and scream expressions at the poses and under frontal illumination are the probe images. This protocol is an extended version of [4, 3] by adding large-pose images (, , ). In both experiments, the frontal images captured in the first session are the gallery. And four state-of-the-art deep learning based (DL-based) face matchers are used as baselins, i.e., VGG , Lightened CNN , CenterLoss  and LDF-Net . The first three matchers are publicly available. We evaluate them with all subjects in Multi-PIE. The last matcher, LDF-Net, is a latest one specially designed for pose-invariant face recognition. It uses the first subjects for training and the remaining subjects for testing. Since it is not publicly available, we request the match scores from the authors, and fuse our 3D shape match scores with theirs. Note that given the good performance of LDF-Net, we assign a higher weight (i.e., ) to it, whereas the weights for all the other three baseline matchers are set to .
Table IV reports the rank-1 accuracy of the baseline face matchers in the first experiment, where the baseline matchers are all further improved by our proposed method. Specifically, VGG and Lightened CNN are consistently improved across different poses when fused with 3D, while CenterLoss gains substantial improvement at large poses ( at and at ). Even for the latest LDF-Net, the recognition accuracy is improved by at and at . For all the baseline matchers, the larger the yaw angle is, the more evident the accuracy improvement. Table IV also gives the recognition accuracy of using only the reconstructed 3D faces, at the row headed by “ICP-3D”. Although its average accuracy is much worse compared with its 2D counterparts, it fluctuates more gently as probe faces rotate from frontal to profile. These results prove the effectiveness of the proposed method in dealing with pose variations, as well as in reconstructing individual 3D faces with discriminative details that are complementary to 2D face recognition.
Given its best performance among three publicly available baseline matchers, we employ the CenterLoss matcher in the second experiment. The results are shown in Table V. As can be seen, the compound impact of pose and expression variations makes the face recognition more challenging, resulting in obviously lower accuracy compared with those in Table IV. Yet, our proposed method still improves the overall accuracy of the baseline, especially for probe faces of large pose or disgust expression. We believe that such performance gain in recognizing non-frontal and expressive faces is owing to the capability of the proposed method in providing complementary pose-and-expression-invariant discriminative features in the 3D face shape space.
5.3.2 Face Verification on CFP Database
We further evaluate our reconstructed PEN 3D faces on a more challenging unconstrained face recognition setting by using the CFP database, which has subjects each with frontal and profile images. The evaluation includes frontal-frontal (FF) and frontal-profile (FP) face verification, each having folders with same-person and different-person pairs. Table VI reports the average results with standard deviations in terms of Accuracy, Equal Error Rate (EER), and Area Under the Curve (AUC).
Given its best performance on Multi-PIE database, we employ the CenterLoss matcher in this experiment. We also report the recognition accuracy of reconstructed PEN 3D faces (see “ICP-3D”). Although its average accuracy is much worse compared with the baseline, it further improves the performance of CenterLoss in both frontal-frontal (FF) and frontal-profile (FP) face verification. These results prove the effectiveness of the proposed method in dealing with pose variations, as well as the ability in providing complementary discriminative features in unconstrained environment. Figure 14 shows some example genuine and imposter pairs in CFP, which are incorrectly recognized by CenterLoss, but correctly recognized by fusion of CenterLoss and our proposed method.
The proposed method has two alternate optimization processes, one in 2D space for face alignment and the other in 3D space for 3D shape reconstruction. We experimentally investigate the convergence of these two processes when training the proposed linear and nonlinear implementations on the BU3DFE database. We conduct ten-fold cross-validation experiments, and compute the average errors over the training data through ten iterations. As shown in Fig. 15, the training errors converge in about five iterations in the linear implementation, while in the nonlinear implementation the training errors converge fast after two to three iterations. Hence, we set the number of iterations as and in the linear and nonlinear implementations, respectively.
5.5 Computational Complexity
According to our experiments on a PC with i7-4790 CPU and GB memory, the linear implementation of the proposed method runs at FPS, and the nonliner implementation runs at FPS with a NVIDIA GeForce GTX . This indicates that the proposed method can detect landmarks and reconstruct 3D faces in real-time. We also report the efficiency of individual steps in Table VII, and comparison with existing methods in Table VIII.
In this paper, we present a novel regression based method for joint face alignment and 3D face reconstruction from single 2D images of arbitrary poses and expressions. It utilizes landmarks on a 2D face image as clues for reconstructing 3D shapes, and uses the reconstructed 3D shapes to refine landmarks. By alternately applying cascaded landmark regressors and 3D shape regressors, the proposed method can effectively accomplish the two tasks simultaneously in real-time. Unlike existing 3D face reconstruction methods, the proposed method does not require additional face alignment methods, but can fully automatically reconstruct both pose-and-expression-normalized and expressive 3D faces from a single face image of arbitrary poses and expressions. Compared with existing face alignment methods, the proposed method can effectively handle invisible and expression-deformed landmarks with the assistance of 3D face models. Extensive experiments with comparisons to state-of-the-art methods demonstrate the effectiveness and superiority of the proposed method in both face alignment and 3D face reconstruction, and in facilitating cross-view and cross-expression face recognition as well.
The authors would like to thank the authors of LDF-Net for sharing us with the match scores of LDF-Net on Multi-PIE. This work is supported by the National Key Research and Development Program of China (2017YFB0802300), the National Natural Science Foundation of China (61773270), and the National Key Scientific Instrument and Equipment Development Projects of China (2013YQ49087904).
-  V. Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,” TPAMI, vol. 25, no. 9, pp. 1063–1074, 2003.
-  H. Han and A. K. Jain, “3D face texture modeling from uncalibrated frontal and profile images,” in BTAS, 2012, pp. 223–230.
-  X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li, “High-fidelity pose and expression normalization for face recognition in the wild,” in CVPR, 2015, pp. 787–796.
-  B. Chu, S. Romdhani, and L. Chen, “3D-aided face recognition robust to expression and pose variations,” in CVPR, 2014, pp. 1907–1914.
-  L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3D facial expression database for facial behavior research,” in FG, 2006, pp. 211–216.
-  C. Cao, Y. Weng, S. Lin, and K. Zhou, “3D shape regression for real-time facial animation,” TOG, vol. 32, no. 4, p. 41, 2013.
-  C. Cao, H. Wu, Y. Weng, T. Shao, and K. Zhou, “Real-time facial animation with image-based dynamic avatars,” TOG, vol. 35, no. 4, pp. 126:1–126:12, 2016.
-  F. Liu, D. Zeng, Q. Zhao, and X. Liu, “Joint face alignment and 3D face reconstruction,” in ECCV, 2016, pp. 545–560.
-  M. Köstinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization,” in ICCVW, 2011, pp. 2144–2151.
-  X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Li, “Face alignment across large poses: A 3D solution,” in CVPR, 2016, pp. 146–155.
-  P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, and W. Worek, “Overview of the face recognition grand challenge,” in CVPR, 2005, pp. 947–954.
-  R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-PIE,” IVC, vol. 28, no. 5, pp. 807–813, 2010.
-  S. Sengupta, J. C. Chen, C. Castillo, and V. M. Patel, “Frontal to profile face verification in the wild,” in WACV, 2016, pp. 1–9.
-  T. F. Cootes and A. Lanitis, “Active shape models: Evaluation of a multi-resolution method for improving image search,” in BMVC, 1994, pp. 327–338.
-  D. Cristinacce and T. F. Cootes, “Boosted regression active shape models.” in BMVC, 2007, pp. 1–10.
-  T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” TPAMI, no. 6, pp. 681–685, 2001.
-  I. Matthews and S. Baker, “Active appearance models revisited,” IJCV, vol. 60, no. 2, pp. 135–164, 2004.
-  X. Liu, P. Tu, and F. Wheeler, “Face model fitting on low resolution images,” in BMVC, 2006, pp. 1079–1088.
-  X. Liu, “Discriminative face alignment,” TPAMI, vol. 31, no. 11, pp. 1941–1954, 2009.
-  D. Cristinacce and T. Cootes, “Automatic feature localisation with constrained local models,” Pattern Recognition, vol. 41, no. 10, pp. 3054–3067, 2008.
-  X. Xiong and F. De la Torre, “Supervised descent method and its applications to face alignment,” in CVPR, 2013, pp. 532–539.
-  X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape regression,” IJCV, vol. 107, no. 2, pp. 177–190, 2014.
-  S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps via regressing local binary features,” in CVPR, 2014, pp. 1685–1692.
-  S. Zhu, C. Li, C. C. Loy, and X. Tang, “Face alignment by coarse-to-fine shape searching,” in CVPR, 2015, pp. 4998–5006.
-  X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in CVPR, 2012, pp. 2879–2886.
-  X. Yu, J. Huang, S. Zhang, W. Yan, and D. N. Metaxas, “Pose-free facial landmark fitting via optimized part mixtures and cascaded deformable shape model,” in ICCV, 2013, pp. 1944–1951.
-  L. A. Jeni, J. F. Cohn, and T. Kanade, “Dense 3D face alignment from 2D videos in real-time,” in FG, vol. 1, 2015, pp. 1–8.
-  A. Jourabloo and X. Liu, “Pose-invariant 3D face alignment,” in ICCV, 2015, pp. 3694–3702.
-  ——, “Large-pose face alignment via CNN-based dense 3D model fitting,” in CVPR, 2016, pp. 4188–4196.
-  ——, “Pose-invariant face alignment via CNN-based dense 3D model fitting,” IJCV, vol. 124, no. 2, pp. 187–203, 2017.
-  A. Jourabloo, X. Liu, M. Ye, and L. Ren, “Pose-invariant face alignment with a single CNN,” in ICCV, 2017, pp. 3219–3228.
-  V. Blanz and T. Vetter, “A morphable model for the synthesis of 3D faces,” in SIGGRAPH, 1999, pp. 187–194.
-  S. Tulyakov and N. Sebe, “Regressing a 3D face shape from a single image,” in ICCV, 2015, pp. 3748–3755.
-  I. Kemelmacher-Shlizerman and R. Basri, “3D face reconstruction from a single image using a single reference face shape,” TPAMI, vol. 33, no. 2, pp. 394–405, 2011.
-  J. Roth, Y. Tong, and X. Liu, “Adaptive 3D face reconstruction from unconstrained photo collections,” TPAMI, vol. 39, no. 11, pp. 2127–2141, 2017.
-  S. Romdhani and T. Vetter, “Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior,” in CVPR, 2005, pp. 986–993.
-  G. Hu, F. Yan, J. Kittler, W. Christmas, C. H. Chan, Z. Feng, and P. Huber, “Efficient 3D morphable face model fitting,” Pattern Recognition, vol. 67, pp. 366–379, 2017.
-  L. Tran and X. Liu, “Nonlinear 3D face morphable model,” in CVPR, 2018, pp. 7346–7355.
-  Y. J. Lee, S. J. Lee, K. R. Park, J. Jo, and J. Kim, “Single view-based 3D face reconstruction robust to self-occlusion,” EURASIP Journal on Advances in Signal Processing, vol. 2012, no. 1, pp. 1–20, 2012.
-  C. Qu, E. Monari, T. Schuchert, and J. Beyerer, “Fast, robust and automatic 3D face model reconstruction from videos,” in AVSS, 2014, pp. 113–118.
-  F. Liu, D. Zeng, J. Li, and Q. Zhao, “Cascaded regressor based 3D face reconstruction from a single arbitrary view image,” arXiv:1509.06161, 2015.
-  A. T. Tran, T. Hassner, I. Masi, and G. Medioni, “Regressing robust and discriminative 3D morphable models with a very deep neural network,” in CVPR, 2017, pp. 1493–1502.
-  H. Drira, B. Ben Amor, A. Srivastava, M. Daoudi, and R. Slama, “3D face recognition under expressions, occlusions, and pose variations,” TPAMI, vol. 35, no. 9, pp. 2270–2283, 2013.
-  F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in CVPR, 2015, pp. 815–823.
-  Y. Sun, D. Liang, X. Wang, and X. Tang, “DeepID3: Face recognition with very deep neural networks,” arXiv:1502.00873, 2015.
-  E. Zhou, Z. Cao, and Q. Yin, “Naive-deep face recognition: Touching the limit of LFW benchmark or not?” arXiv:1501.04690, 2015.
-  C. A. Corneanu, M. O. Simón, J. F. Cohn, and S. E. Guerrero, “Survey on RGB, 3D, thermal, and multimodal approaches for facial expression recognition: History, trends, and affect-related applications,” TPAMI, vol. 38, no. 8, pp. 1548–1568, 2016.
L. Tran, X. Yin, and X. Liu, “Disentangled representation learning GAN for pose-invariant face recognition,” inCVPR, 2017, pp. 1283–1292.
-  D. Yi, Z. Lei, and S. Z. Li, “Towards pose robust face recognition,” in CVPR, 2013, pp. 3539–3545.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in CVPR, 2014, pp. 1701–1708.
-  L. Tran, X. Yin, and X. Liu, “Representation learning by rotating your faces,” TPAMI, 2018.
-  L. Hu, M. Kan, S. Shan, X. Song, and X. Chen, “LDF-Net: Learning a displacement field network for face recognition across pose,” in FG, 2017, pp. 9–16.
-  C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 faces in-the-wild challenge: The first facial landmark localization challenge,” in ICCVW, 2013, pp. 397–403.
-  T. Bolkart and S. Wuhrer, “3D faces in motion: Fully automatic registration and statistical analysis,” CVIU, vol. 131, pp. 100–115, 2015.
-  P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar, “Localizing parts of faces using a consensus of exemplars,” in CVPR, 2011, pp. 545–552.
-  E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin, “Extensive facial landmark localization with coarse-to-fine convolutional network cascade,” in ICCVW, 2013, pp. 386–391.
-  K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “XM2VTSDB: The extended M2VTS database,” in AVBPA, vol. 964, 1999, pp. 965–966.
-  P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3D face model for pose and illumination invariant face recognition,” in AVSS, 2009, pp. 296–301.
-  C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou, “Facewarehouse: A 3D facial expression database for visual computing,” TVCG, vol. 20, no. 3, pp. 413–425, 2014.
-  D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004.
-  X. Zhou, S. Leonardos, X. Hu, and K. Daniilidis, “3D shape estimation from 2D landmarks: A convex relaxation approach,” in CVPR, 2015, pp. 4447–4455.
-  M. Kowalski, J. Naruniec, and T. Trzcinski, “Deep alignment network: A convolutional neural network for robust face alignment,” in CVPRW, 2017, pp. 2034–2043.
-  Y. Chen and G. Medioni, “Object modeling by registration of multiple range images,” IVC, pp. 2724–2729, 1991.
-  V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in CVPR, 2014, pp. 1867–1874.
Z. Lei, Q. Bai, R. He, and S. Z. Li, “Face shape recovery from a single image using CCA mapping between tensor spaces,” inCVPR, 2008, pp. 1–7.
-  A. Bas, W. A. Smith, T. Bolkart, and S. Wuhrer, “Fitting a 3D morphable model to edges: A comparison between hard and soft correspondences,” in ACCV, 2016, pp. 377–391.
-  P. J. Phillips, F. Jiang, A. Narvekar, J. Ayyad, and A. J. O’Toole, “An other-race effect for face recognition algorithms,” ACM Tran. on Applied Perception (TAP), vol. 8, no. 2, p. 14, 2011.
-  X. P. Burgos-Artizzu, P. Perona, and P. Dollár, “Robust face landmark estimation under occlusion,” in ICCV, 2013, pp. 1513–1520.
-  R. A. Güler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos, “DenseReg: Fully convolutional dense shape regression in-the-wild,” in CVPR, 2017, pp. 2614–2623.
-  O. Tuzel, T. K. Marks, and S. Tambe, “Robust face alignment using a mixture of invariant experts,” in ECCV, 2016, pp. 825–841.
-  X. Peng, R. S. Feris, X. Wang, and D. N. Metaxas, “A recurrent encoder-decoder network for sequential face alignment,” in ECCV, 2016, pp. 38–56.
-  J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway, “A 3D morphable model learnt from 10,000 faces,” in CVPR, 2016, pp. 5543–5552.
-  E. Richardson, M. Sela, R. Or-El, and R. Kimmel, “Learning detailed face reconstruction from a single image,” pp. 5553–5562, 2017.
-  A. D. Bagdanov, A. Del Bimbo, and I. Masi, “The florence 2D/3D hybrid face dataset,” in ACM J-HGBU, 2011, pp. 79–80.
Z. Zhu, P. Luo, X. Wang, and X. Tang, “Multi-view perceptron: a deep model for learning face identity and view representations,” inNIPS, 2014, pp. 217–225.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition.” in BMVC, 2015, pp. 41.1–41.12.
-  X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep face representation with noisy labels,” TIFS, vol. 13, no. 11, pp. 2884–2896, 2018.
-  Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in ECCV, 2016, pp. 499–515.