The problem of face alignment (FA) is the problem of facial landmark detection and localization from a single RGB image. Face alignment is an important research topic as it provides input to a variety of computer vision tasks, such as head-pose estimation and tracking, face recognition, facial expression understanding, visual speech recognition, etc.,(Escalera et al, 2018; Loy et al, 2019). 2D face alignment (2DFA) has been extensively studied for the last decades, yielding a plethora of methods and algorithms (Wu and Ji, 2019)
. State of the art 2DFA based on deep neural networks (DNNs) are the best-performing methods in terms of accuracy, invariance with respect to facial appearances, shapes, expressions, as well as in terms of repeatability and reproducibility in the presence of image noise, image resolution, motion blur, lighting conditions and varying backgrounds.
Nevertheless, 2DFA methods yield poor landmark detection and localization performance in the presence of occlusions which occur in case of large poses induced by out-of-image-plane head rotations (self occlusions) as well as by the presence of various objects in the camera field of view, such as glasses, hair, hands and handheld objects, etc. Robust facial landmark detection and localization in the presence of occlusions can only be achieved on the premise that 3D information is taken into account. It is well established that 2D facial landmarks (and, more generally, face images) embed 3D information. This information can be retrieved by fitting a 3D face model to a 2D face image, even if the latter is only partially visible. The process of fitting a 3D model to a 2D image constitutes the basis of training 3D face alignment (3DFA) algorithms.
Consider for example a 3D face model that is parameterized both by identities and by facial deformations, e.g. the parametric 3D deformable model (3DMM) (Blanz and Vetter, 1999). The task of fitting 3DMM to an RGB image of a face consists of estimating the parameters of the mapping from the 3D generic model to a particular face, namely the identity and expression parameters, as well as the pose parameters (scale, rotation, translation and projection), e.g. (Gou et al, 2016; Zhu et al, 2016). Once an optimal set of parameters is found, one can associate 3DMM vertices with facial landmarks. This stays at the basis of many automatic semi-automatic methods for annotating 2D faces with 3D landmarks.
Nevertheless, the fitting task just mentioned is a difficult nonlinear optimization problem, in particular in the presence of large poses and occlusions. In the recent past, a number of methods have been developed to perform this 3D-to-2D fitting process necessary for 3D facial landmark annotation. The performance of the vast majority of existing 3DFA methods rely on the quality of landmark annotation. This is true for training using modern discriminative deep learning methods, but it is true for testing as well. Indeed, to date, algorithm performance is computed empirically by measuring the error between the predicted output and the corresponding ground-truth, e.g.(Jeni et al, 2016). Under these circumstances, annotation errors are likely to bias both parameter estimation (training) and performance evaluation (testing).
There is a lack of a benchmarking methodology that could assess quantitatively and in a completely unsupervised manner the robustness and effectiveness of 3DFA algorithms, namely a method that computes a confidence score that measures algorithm performance in the absence of the ground truth. This is also crucial in order to decide, without human intervention, whether a 3DFA method, when applied to an unknown image of a face with no annotation available, yields an output that is accurate enough to be further used by other algorithms, such as head-gaze estimation, facial expression analysis or lip reading.
This paper proposes a methodological framework for assessing the performance of 3DFA algorithms based on robust statistics and a parametric confidence test. Unlike supervised metrics, currently in use for 3DFA performance evaluation and based on annotated datasets, the proposed method is fully unsupervised. We show that the robust estimation of the rigid mapping between two sets of 3D facial landmarks, one set associated with a face in an unknown orientation and with an unknown expression, and another set associated with a frontal face, provides an extremely reliable way to separate face pose (due to head motions) from non-rigid face deformations (due to facial expressions), all in the presence of badly located landmarks.
Using a 3DFA algorithm and a very large and unannotated dataset of face images with large variabilities in orientation, expression and identity, we make use of the robust rigid-mapping methodology to build a statistical frontal landmark model and a parametric confidence score. Based on this pipeline, the proposed performance evaluation protocol proceeds as follows. First, 3D landmarks are extracted from a face image. Second, the landmarks are rigidly mapped onto the frontal model. Third, a confidence score is computed for each mapped landmark, thus allowing to assess whether the landmark lies within a confidence region or not.
We describe in detail an experimental evaluation framework that uses several datasets and two 3DFA algorithms. We empirically show that our methodological pipeline is neither dataset- nor 3DFA algorithm-biased. We also show that the proposed framework can be used not only to assess quantitatively the performance of 3DFA algorithms, but also to test the accuracy of automatic and semi-automatic methods currently used for the annotation of face datasets.
The methodology proposed in this paper is illustrated on Fig. 1. The two images (left) are from the AFLW2000-3D dataset (Zhu et al, 2016). The statistical frontal landmark model (right) is built using the 3DFA method of (Feng et al, 2018) and the YawDD dataset (Abtahi et al, 2014). This model characterizes each landmark with an ellipsoidal confidence region centered at a posterior mean. Fig. 1(a): Landmarks extracted using (Bulat and Tzimiropoulos, 2016) (left) and mapped onto the statistical model (right). In this case most of the landmarks lie inside their confidence regions, thus assessing their correctness. Fig. 1(b): Ground-truth landmarks obtained with a semi-automatic annotation process (Zhu et al, 2016) and mapped onto the statistical model (right). One may notice that in this case, many mapped landmarks fall outside their confidence regions. The benefit of the proposed method is twofold: (i) an unsupervised assessment of the quality of the detected landmarks, and (ii) a robust and expression-preserving landmark mapping from an arbitrary pose to a frontal pose.
The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 reviews statistical rigid-mapping estimation and describes two robust methods. Section 4 empirically analyses the proposed robust rigid-mapping methods. Section 5 proposes a methodological pipeline for building a statistical face model and an associated parametric confidence metric. Section 6 presents extensive experimental results, and Section 7 draws some conclusions. 111Supplemental material for this paper can be found at https://team.inria.fr/perception/research/upa3dfa/
2 Related Work
It is interesting to note that recently proposed methods for 3DFA lie at the crossroads of deformable shape models, model-based image analysis and neural networks. In order to discuss these links we introduce some mathematical notations and concepts. Let vectordenote the ensemble of parameters of a 3D face model (identity, expression and pose), where is the parameter vector space, is the number of parameters, and let denote the image of a face from a set of images of size . One class of 3DFA methods directly learns a mapping from a training dataset of face images and their associated model parameters , e.g. (Zhu et al, 2016; Jourabloo and Liu, 2017; Feng et al, 2018).
Another class of methods proceeds in two steps. First, 2D landmarks are extracted from a face image by learning an image-to-landmark mapping , from a face image to a set of 2D landmarks , and using a training dataset . Second, a 2D-to-3D mapping is estimated, where is a set of 3D landmarks. This mapping can be obtained either by learning, using a training dataset conditioned by a 3D model parameterized by , i.e. , e.g. (Zhao et al, 2016; Bulat and Tzimiropoulos, 2016, 2017), or by direct optimization over of a function that maps a 3D model onto the 2D landmarks, e.g. (Yu et al, 2017).
These 3DFA DNN-based methods use a variety of architectures in order to learn the regression functions , and mentioned above. Given this variety, it is difficult to directly compare them and assess their merits based on the analysis of the underlying DNN concepts and methodologies. Hence, 3DFA algorithm performance should be measured empirically, as is often the case in deep learning.
To date and to the the best of our knowledge, there has been a single attempt to benchmark 3DFA (Jeni et al, 2016). In detail, four datasets were specifically gathered, annotated and prepared, and two performance metrics were used for this challenge. The BU-4DFE (Yin et al, 2008) and BP-4D-Spontaneous (Zhang et al, 2014) datasets used a structured-light stereo sensor to capture textured 3D meshes of faces in controlled conditions and with various backgrounds. 2,295 meshes were selected from these datasets and manually annotated with 66 landmarks and with self-occlusion information. Then, 16,065 2D views were synthesized (seven views for each mesh) with yaw and pitch rotations ranging in the intervals and , respectively. Additionally, there were 7,000 frames from the Multi-PIE (Gross et al, 2010) and 541 frames from the Time-Sliced datasets, respectively. Both these datasets contain RGB images gathered with multiple cameras from different viewpoints but with no 3D information, hence a 3D face model is extracted for each image, using the model-based multi-view structure-from-motion technique of (Jeni et al, 2017). As above, each 3D face model was annotated with 66 landmarks and with self-occlusion information.
Moreover, the following metrics were used: the ground-truth error (GTE) and the cross-view ground-truth consistency error (CVGTCE), namely,
where denotes the -th detected 3D landmark associated with test sample , is the corresponding ground-truth 3D landmark, is the inter-ocular distance of the sample face , is the number of landmarks, and , , and are the scale factor, rotation matrix and translation vector associated with a rigid mapping that compensates the possible discrepancy between the set of detected landmarks and the set of ground-truth landmarks.
These metrics require a test dataset with ground-truth 3D landmarks, which may be prone to errors, if the annotations are performed manually or even if an automatic 2D-to-3D fitting process is being used, e.g. (Bulat and Tzimiropoulos, 2017). They cannot provide a precise account of how 3DFA is affected by large poses, by non-rigid facial deformations, or by occlusions. They don not provide a detailed analysis of a particular landmark or of a group of landmarks used in tasks that are likely to require highly accurate landmark detection and localization, e.g. landmarks located on the lips, needed for visual voice activity detection, visual speech enhancement and visual speech recognition.
In contrast, the proposed methodology does not make use of ground-truth annotations for assessing the performance of 3DFA algorithms. We use robust rigid mapping to build frontal landmark models of faces in a completely unsupervised way and we propose a statistical confidence score to assess whether the landmarks associated with a test face have been accurately detected or not. The methodology can indifferently be applied to landmarks obtained with 3DFA algorithms or with an annotation process, be it manual, semi-automatic or fully automatic. Moreover, based on the proposed statistical score, it is possible to remove badly located landmarks from an annotated dataset.
3 Robust Rigid Mapping
Let us consider the mapping between two sets of facial landmarks, and . In the general case, this mapping is composed of a rigid transformation, i.e. head motion, and of non-rigid facial deformations. We associate an additive error vector (or residual) with each landmark , , to account for the non-rigid component of the mapping and for various sources of errors. Without loss of generality, it is assumed that the set is associated with an unknown face with an unknown pose and that the set is associated with the frontal view of a prototype face. Therefore, we seek a mapping from an arbitrary face pose to a frontal face pose. This mapping can be modeled in the following way:
where the rigid transformation is parameterized by a scale factor , a rotation matrix and a translation vector . If we assume that the residuals are independent and identically distributed (i.i.d), the problem of estimating the rigid-transformation parameters can be solved via log-likelihood maximization or, equivalently, via negative log-likelihood minimization, , with:
is the probability distribution function (pdf) ofparameterized by which is composed of , , and of the pdf parameters.
3.1 Gaussian Model
The simplest statistical model is to assume that the residuals follow a zero-centered Gaussian distribution with covariance matrix, namely . By developing (4) and ignoring terms that do not depend on the model parameters, we obtain:
where is the squared Mahalanobis norm of . The minimization of (5) over yields:
where the over-script indicates the optimal value of a parameter and with the notations:
thus leading to closed-form solutions, e.g. (Horn, 1987; Horn et al, 1988; Faugeras and Hebert, 1986; Arun et al, 1987; Umeyama, 1991). Nevertheless, the isotropic-covariance assumption is barely valid in practice. In the case of a full covariance, the optimization becomes
where is the trace operator and with the notations , . A rotation matrix must satisfy and . This yields a constrained non-linear optimization problem. An elegant formulation consists of parameterizing the rotation with a unit quaternion, thus reducing the number of parameters from 9 to 4, while the number of constraints is reduced from 7 to 1. Let , where is a unit quaternion (please consult Appendix A). Using this representation, the rotation is described by four parameters and the associated constrained optimization problem writes:
Similar to (Horn, 1987) the optimal scale factor is obtained in closed form:
Finally, the optimal covariance is estimated with:
Once the rotation and scale are initialized using the unit-quaternion closed-form method of (Horn, 1987), alternating optimization can be used to iterate through (3.1), (11) and (12). We will refer to this method as generalized Horn.
3.2 Gaussian-uniform Mixture Model
Unfortunately, the above statistical model does not behave well in the presence of large residuals, or outliers. For the purpose of explicitly modeling inliers and outliers, a discrete random variableis associated with each residual , and let be a realization of . Now, is drawn either from a zero-centered Gaussian distribution, as above, or from a multivariate uniform distribution:
is the volume of the distribution. This yields a two-component mixture model, an inlier component with prior probability, and an outlier component with prior probability
. This naturally leads to solving the problem via expectation-maximization (EM) which alternates betwen (i) evaluating the posterior probabilities of the residuals to be inliers or outliers, and (ii) minimizing theexpected complete-data negative log-likelihood, , where the expectation is taken over the realizations of , and where the parameter vector is .222Note that the translation vector is evaluated outside the EM procedure. This yields the minimization of:
where the posterior probability to be an inlier , is :
as well as and from (3.1) with
The prior probability and covariance matrix are estimated with:
We refer to this model as the Gaussian-uniform mixture (GUM) and the associated EM is summarized in Algorithm 1.
3.3 Generalized Student Model
Another way to enforce robustness is to use the generalized Student’s t-distribution, also known as the Pearson type VII distribution (Sun et al, 2010):
where and are the parameters of the prior gamma distribution of and is the gamma function. The distribution (3.3) differs from the standard Student’s t-distribution in that the weight variable
, or the precision, is drawn from a gamma distribution with parametersand , instead of and . Notice that in (3.3) and appear only through their product, which means that an additional constraint is required to make the parameterization unique. One possibility is to constrain the determinant of the covariance, e.g. , which is equivalent to have an unconstrained with . Unconstrained parameters are easier to deal with in inference algorithms. Therefore, we will rather assume without loss of generality that .
Notice that the posterior distribution of is also a gamma distribution, namely the posterior gamma distribution:
The posterior mean of the weight variable is:
As with the Gaussian-uniform model, we need to minimize the expected complete-data negative log-likelihood, and in this case the parameter vector is since we set . This yields the minimization of:
The parameter is updated by solving the following equation, where is the digamma function:
We refer to this model as the generalized Student (GStudent) and the associated EM algorithm is summarized in Algorithm 2.
3.4 Algorithm Implementation and Analysis
Algorithm 1 and Algorithm 2 are expectation maximization (EM) procedures and it is well known that they have good convergence properties. One should notice that all the computations inside these algorithms are in closed form, with the notable exception of the estimation of the rotation matrix. The latter is parameterized with a unit quaternion which may be estimated via nonlinear constrained optimization. The unit-quaternion parameterization of rotations, i.e. Appendix A, has several advantages: (i) the number of parameters to be estimated is reduced from nine to four, (ii) the number of nonlinear constraints is reduced from seven constraints (six quadratic constraints, i.e. , and one quartic constraint, i.e. ) to one quadratic constraint (), (iii) the initialization is performed with the closed-form solution of (Horn, 1987) that uses a unit quaternion as well.
In practice, the constrained nonlinear optimization problem (3.1) is solved using the sequential quadratic programming method (Bonnans et al, 2006), more precisely a sequential least squares programming (SLSQP) solver333https://docs.scipy.org/doc/scipy/reference/optimize.html is used in combination with a root-finding software package (Kraft, 1988). The SLSQP minimizer found at the previous EM iteration is used as an initial estimate at the current EM iteration. The closed-form method of (Horn, 1987) (please consult Appendix A) is used to initialize the unit-quaternion parameters at the start of the EM algorithm.
, perform singular value decomposition to extract an orthogonal matrix from the measurement matrix, but without the guarantee that the estimated matrix is a rotation, i.e. its determinant must be equal to. Appendix A4 semi-definite positive symmetric matrix – a well known mathematical problem yielding a straightforward numerical solver.
4 Analyzing the Robustness of Rigid Mapping
In order to quantify the performance of the proposed robust rigid-mapping algorithms, we devised an experimental protocol on the following grounds. Let be a set of landmarks associated with the frontal view of a face. The set is generated with:
where is a scalar that controls the level of noise and is the trial index. As described in detail below, the noise level,
can be the variance of Gaussian isotropic noise, the total variance of Gaussian anisotropic noise, or the volume of uniformly distributed noise. The landmark coordinates are normalized such that. For each noise level, we randomly generate trials, namely rigid mappings and sets of residuals . For each trial we estimate the rigid mapping parameters, , , , and we measure the root mean square error (RMSE) between these estimated parameters and the ground-truth parameters, , , , namely:
The ground-truth rigid-mapping parameters are generated in the following way. For each trial , the scale and the translation vector are generated from uniform distributions, namely and . The rotation matrix is parameterized by the pan, tilt and yaw angles, namely:
A rotation matrix is obtained by randomly generating the pan, tilt and yaw angles, , , from a uniform distribution, .
In order to generate residuals, , we simulate three types of noise:
Isotropic Gaussian noise: ;
Anisotropic Gaussian noise: , and
Uniform noise: .
In the case of anisotropic noise, a covariance matrix must be randomly generated for each trial. This is done in the following way. Let , with (an orthogonal matrix) and with , where the eigenvalues correspond to the variances along the eigenvectors – the directions of maximum variance. Let denote the total variance. A sample covariance matrix is simulated by randomly generating an orthogonal matrix and by randomly generating the three eigenvalues from a uniform distribution, .
We tested the following rigid mapping models and associated algorithms:
The experiments were conducted in the following way. For each noise level, we simulated trials for which we computed the RMSEs, namely eqs. (29), (30), and (31). For each trial we split the landmarks into an inlier set and an outlier set and the landmarks are randomly assigned to one of these sets. The first experiment determines the percentage of outliers that can be handled by the robust algorithms, Figure 2. For this purpose, the percentage of outliers is increased from 10% to 60%. The inlier noise is drawn from an anisotropic Gaussian distribution with a total variance . The outlier noise is drawn from a uniform distribution with amplitude (remember that the landmark coordinates are normalized to lie in the interval ). The cuves plotted in Figure 2 show that the RMSE associated with non robust methods, i.e. Horn and Gen-Horn increase monotonically. On the contrary, the robust algorithms, GUM-EM and GStudent-EM, have a radically different behavior. After a short increase, the RMSE remains constant, and then it increases again.
In the other experiments, the number of inliers was set to be equal to the number of outliers and we experimented with the three noise types already mentioned. Figure 3 shows the RMSEs when inlier noise is drawn from an isotropic Gaussian distribution with , while outlier noise is drawn from a uniform distribution whose volume is increased from to . Similarly, Figure 5 shows the RMSEs for the case when inlier noise is drawn from an anisotropic Gaussian distribution with total variance , while outlier noise is drawn from a uniform distribution whose volume is increased from to . Finally, Figure 5 shows the RMSEs when inlier noise is drawn from an anisotropic Gaussian distribution with total variance , while outlier noise is drawn from an anisotropic Gaussian distribution with total variance varying from to .
These experiments clearly show that the two classes of methods (non-robust and robust) behave differently. The performance of non-robust rigid mapping decreases monotonically in the presence of outliers with increasing levels of noise. The robust methods can deal with up to 50% of outliers affected by a substantial noise level (1.5 times the size of the image). There is no evidence that the Gen-Horn algorithm performs better than the standard Horn algorithm. Nevertheless, Gen-Horn provides interesting information about the 3D structure of the estimated anisotropic covariance. The GUM-EM algorithm performs slightly better than the GStudent-EM algorithm, in particular in the presence of outliers drawn form a uniform distribution.
5 Measuring the Performance of 3D Face Alignment
In this section we describe an unsupervised methodology for quantitatively assessing the performance of 3DFA algorithms. The idea of the proposed benchmarking is to apply 3DFA to a dataset of face images in order to extract 3D landmarks, to robustly estimate the rigid transformation that maps these facial landmarks into a 3D landmark model, and to analyze the discrepancy between the extracted 3D landmarks and the model. Based on a confidence score, it is then possible to decide whether a landmark is correctly localized or not. This allows to assess the overall performance of a 3DFA algorithm as well as its behavior with respect to various perturbations, such as occlusions or motion blur.
5.1 Neutral Frontal Landmark Model
We start by computing a neutral frontal landmark model in the following way. For this purpose, we use a dataset of images of neutral faces (frontal viewing, no expression and no interfering object causing occlusion) and we extract landmarks from each one of these faces, . Then we use the landmark coordinates to compute the directions of maximum variance (or the principal components) of each face. By aligning these directions over the dataset, we compute a mean for each landmark, namely
5.2 Statistical Frontal Landmark Model
We now explain how a statistical frontal landmark model is built, namely , where is the set of means and is the set of covariance matrices associated with the statistical frontal landmark model. For this purpose, we use another dataset that contains images of faces with the following characteristics: arbitrary poses, arbitrary expressions, both speaking and silent faces, but with no external sources of perturbation such as the presence of interfering object that may cause occlusions. We extract 3D landmarks from these images using a 3DFA algorithm, namely , and we use either GUM-EM (Algorithm 1) or GStudent-EM (Algorithm 2) to robustly estimate the rigid transformations between each landmark-set and the the neutral frontal landmark-set and . Based on this, we obtain rigid-mapping parameters (one for each ): scale factors, rotations and translations: , where the over-script denotes a robust algorithm, namely either GUM-EM or GStudent-EM. We remind that both algorithms provide a figure of merit for each landmark: posterior probabilities in the case of GUM-EM and precisions in the case of GSudent-EM. Applying one of these robust rigid-alignment methods provides frontal landmarks, , namely:
There are two different expressions for the posterior means and posterior covariances for GUM-EM and for GStudent-EM, respectively:
5.3 Unsupervised Confidence Test
We now develop an unsupervised (statistical) confidence test for assessing whether the accuracy of a landmark, i.e. its 3D coordinates, is within (inlier) or outside (outlier) an expected range (Savage, 1972). Let us drop the algorithm over-script and let be the eigen factorization of , where is an orthonormal matrix and is a diagonal matrix containing the eigenvalues. We can now project each landmark on the space spanned by the three eigenvectors of this matrix:
Landmark is an inlier with confidence if
lies inside the ellipsoid whose half-axes are three times the standard deviations, or, , , where are the eigenvalues of , or
the confidence test writes:
Based on this confidence test, we can now build a confidence test accuracy (the higher the better) associated with a sample , namely:
where denotes the indicator function. For a test dataset composed of samples, one can then compute the mean confidence test accuracy (the higher the better):
5.4 Supervised Metrics
In general, datasets of faces come with their ground truth, and we denote with the set of ground-truth landmarks associated with the dataset . We modify (1) to be able to build a metric that counts the proportion of inliers, namely the ground-truth accuracy (the higher the better):
where is a user-defined threshold that corresponds to the quality of the ground-truth landmarks. Based on this we can compute the mean ground-truth accuracy (S):
Finally, another interesting metric is the correlation coefficient between the above unsupervised and supervised metrics:
6 Experimental Results
6.1 Neutral Frontal Landmark Model
The neutral frontal landmark model was trained in-the-wild by harvesting web images and using a face detector and a head-pose estimator in order to select frontal faces. These images were visually inspected to guarantee shape and aspect variabilities as well as neutral facial expressions. This process yields a dataset composed of images. We used the 3DFA method of (Feng et al, 2018) to extract landmarks from each face in the dataset. Next, we aligned them (please consult Section 5.1) and computed the landmark means using (32). Figure 6 shows a few examples of images from this dataset as well as the detected landmarks. Figure 7 show the neutral frontal landmark model thus obtained.
6.2 Statistical Frontal Landmark Model
The statistical frontal landmark model was trained from the YawDD dataset (Abtahi et al, 2014). This dataset contains 322 videos which is equivalent to approximatively images. The face images in this dataset have large variabilities in terms of face shapes, face aspects, head poses and facial expressions. All the images were processed with no human intervention, namely: face detection, 3D face alignment, and robust rigid alignment with the neutral face landmarks just described. This yields the statistical face landmark model described in Section 5.2. For that purpose we used two 3DFA methods and the two robust alignment algorithms described in this paper. Hence, there are four possible 3DFA and robust alignment combinations that we used to train four different models:
Figure 8 shows the statistical frontal landmark models obtained with these four combination. In this figure, the dots correspond to the posterior means, i.e. (34) and (36), while the elliptical regions correspond to image projections of the ellipsoids defined by (40).
6.3 Performance Evaluation of 3D Face Alignment
Once the neutral frontal and statistical frontal models are computed using datasets and , respectively, we use a third dataset, , to empirically assess the performance of 3DFA algorithms using the unsupervised confidence test introduced in Section 5.3. For this purpose we used AFLW2000-3D (Zhu et al, 2016) as a test dataset, consisting of 2,000 images with large-pose variations. More precisely, the yaw angles (vertical axis of rotation) in the following intervals for 1,306 faces, in the interval for 462 faces and in the interval for 232 faces. The dataset contains a large variety of facial shapes, facial expressions and illuminations conditions. Moreover, there are many faces with partial occlusions caused by the presence of hair, hands, handheld objects, glasses, etc. Notice that large poses induce partial occlusions as well.
Each image in the AFLW2000-3D dataset is annotated with 68 3D landmarks. This semi-automatic annotation is performed by fitting a 3D deformable model to a dataset of 2D face images, e.g. (Ghiasi and Fowlkes, 2014). Nevertheless, as noted in (Bulat and Tzimiropoulos, 2017), many of the annotated landmarks in this dataset have large localization errors, especially in the case of profile views. Hence, performance evaluation based on supervised metrics are prone to errors. In (Bulat and Tzimiropoulos, 2017) it is visually shown that in these extreme poses their 3DFA method yields more precise landmark localization than the automatically annotated ones. Based on these observations, we applied our unsupervised performance analysis to the annotated landmarks as well, yielding the following combinations:
The results based on computing the mean confidence-test accuracies, i.e. (43) are summarized in Table 1. We remind that we used different datasets for training the neutral and statistical face landmark models, i.e. and , and for assessing the performance of the various combinations of 3DFA methods and robust-rigid mappings, i.e. . The means, evaluated over the confidence-test scores obtained with the annotated (ground-truth) landmarks (last two rows of Table 1), are equal to and to , respectively, which seems to confirm that the ground-truth landmark locations in the AFLW2000-3D dataset contain a substantial amount of errors and that, overall, both 3DFA methods that we analyzed, (Bulat and Tzimiropoulos, 2016) and (Feng et al, 2018), predict landmark locations that are more accurate than the ground-truth locations themselves.
We now compute correlation coefficients, i.e. (46), between the unsupervised and supervised metrics, i.e. (42), and (44). The results are reported in Table 2. Notice however that the supervised scores depend on the choice of the parameter . As done in (Bulat and Tzimiropoulos, 2017), this parameter was adjusted to eliminate samples yielding a very low score. With in (45) we obtained the following scores:
These scores are comparable with the scores reported in (Bulat and Tzimiropoulos, 2017) which uses a different normalization parameter. Notice that the scores obtained with the proposed confidence test, i.e. Table 1, are higher than these scores.
In the light of these results, we attempted to analyse the effect of eliminating inaccurate ground-truth landmark annotations from the benchmark just described. Let be the subset of samples satisfying , where denotes the value of the unsupervised confidence-score associated with ground-truth landmarks of face sample . We see that when is increased, the accuracy of the ground-truth landmark annotations contained in the subset increases as well, at the price of drastically decreasing the number of samples, which in turn lowers down the statistical significance of the resulting scores. The correlation coefficient is then computed with the following formula:
Figure 9 (left) shows the correlation as a function of where the 3DFA1/GStudent-EM method was used to build the confidence test, i.e. Fig. 8(c), while Figure 9 (right) shows the corresponding p-value (with a significance level of ): the smaller p-value, the more statistically-significant correlation. The red dots in these plots correspond to a p-value not satisfying the significance level.
In the light of these experiments, we conclude that the proposed methodology for assessing the performance of 3DFA methods is not biased by the quality of landmark annotation, whether the latter is automatic or human-assisted. The experiments suggest that the proposed unsupervised methodology could be used (i) to assess the quality of landmark annotation itself and (ii) to remove badly annotated landmarks.
We know illustrate the proposed performance analysis method with a few examples. Figure 10 shows the results obtained with 3DFA1/GStudent-EM, i.e. (Bulat and Tzimiropoulos, 2016), applied to six samples from the AFLW2000-3D dataset, using the statistical face model build with 3DFA1/GStudent-EM. Figure 11 shows the results obtained with 3DFA2/GStudent-EM, i.e. (Feng et al, 2018), applied to the same samples and using the same statistical face model as in Figure 10.
Finally, we illustrate the results obtained by applying GStudent-EM to the ground-truth landmarks associated with the AFLW2000-3D dataset (Zhu et al, 2016), or GT/GStudent-EM. Some examples are shown in Figure 12 (best scores) and in Figure 13 (worse scores). The results reported in Table 2 and these examples show that the ground-truth annotations should be used with caution.
|Alignment method||Statistical face models trained with and datasets:|
|using dataset :||3DFA1/GUM-EM||3DFA2/GUM-EM||3DFA1/GStudent-EM||3DFA2/GStudent-EM||Mean|
|Alignment method||Statistical face models trained with and datasets:|
|using dataset :||3DFA1/GUM-EM||3DFA2/GUM-EM||3DFA1/GStudent-EM||3DFA2/GStudent-EM||Mean|
We presented a method for analyzing the performance of 3DFA algorithms. Due to the large spectrum of 3DFA formulations, ranging from 3D deformable-model fitting to discriminative deep learning, it is difficult to compare their respective merits on the basis of formal mathematical and algorithm analyses. Instead, we adopt an empirical data-driven evaluation. To date, performance analysis relies on annotated datasets. In the case of 3DFA, these annotations correspond to the 3D coordinates of a set of pre-defined facial landmarks. This annotation process, be it manual, semi-automatic, or automatic, is prone to errors, which biases the analysis.
In contrast, we proposed a method that bypasses annotations. This is possible because, in the case of facial landmark detection and localization, there is an underlying rigid transformation between the landmarks of a face, with unknown pose and expression, and the landmarks of a frontally-viewed face. If this rigid transformation could be estimated, then it can be used to map an unknown face onto a model face. This expression-preserving rigid mapping can subsequently be used to measure face-to-model discrepancies. We proposed such an approach based on well-founded statistical models. This led to an unsupervised parametric confidence test that yields a confidence score well suited to analyze the performance of 3DFA algorithms.
Because the proposed methodological pipeline makes use of 3DFA and of robust rigid mapping, one may argue that the proposed performance analysis metric is biased. We empirically showed that the analysis is agnostic to various combinations of methods used for building the model and for the test itself. Moreover, we showed that the method could also be used to assess the quality of facial landmark annotations, in particular those annotations that are obtained automatically based on 3D deformable-model fitting. We therefore conclude that the proposed performance analysis methodology yields an interesting framework for assessing the repeatability and reliability of the predictions obtained with 3DFA algorithms.
Appendix A Closed-Form Solution Using Unit Quaternions
Consider (14) with . We immediately obtain the following formulas for the model parameters:
The formula for the posteriors becomes:
It is well known that a rotation matrix can be parameterized by a unit quaternion Horn (1987). Let be parameterized by its axis and angle of rotation, , and . The unit quaternion parameterizing the rotation is: