Unsupervised Performance Analysis of 3D Face Alignment

by   Mostafa Sadeghi, et al.

We address the problem of analyzing the performance of 3D face alignment (3DFA) algorithms. Traditionally, performance analysis relies on carefully annotated datasets. Here, these annotations correspond to the 3D coordinates of a set of pre-defined facial landmarks. However, this annotation process, be it manual or automatic, is rarely error-free, which strongly biases the analysis. In contrast, we propose a fully unsupervised methodology based on robust statistics and a parametric confidence test. We revisit the problem of robust estimation of the rigid transformation between two point sets and we describe two algorithms, one based on a mixture between a Gaussian and a uniform distribution, and another one based on the generalized Student's t-distribution. We show that these methods are robust to up to 50% outliers, which makes them suitable for mapping a face, from an unknown pose to a frontal pose, in the presence of facial expressions and occlusions. Using these methods in conjunction with large datasets of face images, we build a statistical frontal facial model and an associated parametric confidence metric, eventually used for performance analysis. We empirically show that the proposed pipeline is neither method-biased nor data-biased, and that it can be used to assess both the performance of 3DFA algorithms and the accuracy of annotations of face datasets.



page 16

page 17


ACE-Net: Fine-Level Face Alignment through Anchors and Contours Estimation

We propose a novel facial Anchors and Contours Estimation framework, ACE...

Face Alignment Robust to Pose, Expressions and Occlusions

We propose an Ensemble of Robust Constrained Local Models for alignment ...

Pose-Invariant 3D Face Alignment

Face alignment aims to estimate the locations of a set of landmarks for ...

LDDMM-Face: Large Deformation Diffeomorphic Metric Learning for Flexible and Consistent Face Alignment

We innovatively propose a flexible and consistent face alignment framewo...

Robust Face Alignment Using a Mixture of Invariant Experts

Face alignment, which is the task of finding the locations of a set of f...

Convolutional Point-set Representation: A Convolutional Bridge Between a Densely Annotated Image and 3D Face Alignment

We present a robust method for estimating the facial pose and shape info...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of face alignment (FA) is the problem of facial landmark detection and localization from a single RGB image. Face alignment is an important research topic as it provides input to a variety of computer vision tasks, such as head-pose estimation and tracking, face recognition, facial expression understanding, visual speech recognition, etc.,

(Escalera et al, 2018; Loy et al, 2019). 2D face alignment (2DFA) has been extensively studied for the last decades, yielding a plethora of methods and algorithms (Wu and Ji, 2019)

. State of the art 2DFA based on deep neural networks (DNNs) are the best-performing methods in terms of accuracy, invariance with respect to facial appearances, shapes, expressions, as well as in terms of repeatability and reproducibility in the presence of image noise, image resolution, motion blur, lighting conditions and varying backgrounds.

Nevertheless, 2DFA methods yield poor landmark detection and localization performance in the presence of occlusions which occur in case of large poses induced by out-of-image-plane head rotations (self occlusions) as well as by the presence of various objects in the camera field of view, such as glasses, hair, hands and handheld objects, etc. Robust facial landmark detection and localization in the presence of occlusions can only be achieved on the premise that 3D information is taken into account. It is well established that 2D facial landmarks (and, more generally, face images) embed 3D information. This information can be retrieved by fitting a 3D face model to a 2D face image, even if the latter is only partially visible. The process of fitting a 3D model to a 2D image constitutes the basis of training 3D face alignment (3DFA) algorithms.

Consider for example a 3D face model that is parameterized both by identities and by facial deformations, e.g. the parametric 3D deformable model (3DMM) (Blanz and Vetter, 1999). The task of fitting 3DMM to an RGB image of a face consists of estimating the parameters of the mapping from the 3D generic model to a particular face, namely the identity and expression parameters, as well as the pose parameters (scale, rotation, translation and projection), e.g. (Gou et al, 2016; Zhu et al, 2016). Once an optimal set of parameters is found, one can associate 3DMM vertices with facial landmarks. This stays at the basis of many automatic semi-automatic methods for annotating 2D faces with 3D landmarks.

Nevertheless, the fitting task just mentioned is a difficult nonlinear optimization problem, in particular in the presence of large poses and occlusions. In the recent past, a number of methods have been developed to perform this 3D-to-2D fitting process necessary for 3D facial landmark annotation. The performance of the vast majority of existing 3DFA methods rely on the quality of landmark annotation. This is true for training using modern discriminative deep learning methods, but it is true for testing as well. Indeed, to date, algorithm performance is computed empirically by measuring the error between the predicted output and the corresponding ground-truth, e.g.

(Jeni et al, 2016). Under these circumstances, annotation errors are likely to bias both parameter estimation (training) and performance evaluation (testing).

There is a lack of a benchmarking methodology that could assess quantitatively and in a completely unsupervised manner the robustness and effectiveness of 3DFA algorithms, namely a method that computes a confidence score that measures algorithm performance in the absence of the ground truth. This is also crucial in order to decide, without human intervention, whether a 3DFA method, when applied to an unknown image of a face with no annotation available, yields an output that is accurate enough to be further used by other algorithms, such as head-gaze estimation, facial expression analysis or lip reading.

(a) 3D landmarks extracted with (Bulat and Tzimiropoulos, 2016)
(b) Ground-truth 3D landmarks from the AFLW2000-3D dataset (Zhu et al, 2016)
Figure 1: Two examples from the AFLW2000-3D dataset (Zhu et al, 2016) (left). The landmarks are mapped onto a statistical frontal landmark model (right) built using the YawDD dataset (Abtahi et al, 2014) and (Feng et al, 2018), which enables to verify whether the mapped landmarks lie within their associated ellipsoidal confidence regions or not.

This paper proposes a methodological framework for assessing the performance of 3DFA algorithms based on robust statistics and a parametric confidence test. Unlike supervised metrics, currently in use for 3DFA performance evaluation and based on annotated datasets, the proposed method is fully unsupervised. We show that the robust estimation of the rigid mapping between two sets of 3D facial landmarks, one set associated with a face in an unknown orientation and with an unknown expression, and another set associated with a frontal face, provides an extremely reliable way to separate face pose (due to head motions) from non-rigid face deformations (due to facial expressions), all in the presence of badly located landmarks.

Using a 3DFA algorithm and a very large and unannotated dataset of face images with large variabilities in orientation, expression and identity, we make use of the robust rigid-mapping methodology to build a statistical frontal landmark model and a parametric confidence score. Based on this pipeline, the proposed performance evaluation protocol proceeds as follows. First, 3D landmarks are extracted from a face image. Second, the landmarks are rigidly mapped onto the frontal model. Third, a confidence score is computed for each mapped landmark, thus allowing to assess whether the landmark lies within a confidence region or not.

We describe in detail an experimental evaluation framework that uses several datasets and two 3DFA algorithms. We empirically show that our methodological pipeline is neither dataset- nor 3DFA algorithm-biased. We also show that the proposed framework can be used not only to assess quantitatively the performance of 3DFA algorithms, but also to test the accuracy of automatic and semi-automatic methods currently used for the annotation of face datasets.

The methodology proposed in this paper is illustrated on Fig. 1. The two images (left) are from the AFLW2000-3D dataset (Zhu et al, 2016). The statistical frontal landmark model (right) is built using the 3DFA method of (Feng et al, 2018) and the YawDD dataset (Abtahi et al, 2014). This model characterizes each landmark with an ellipsoidal confidence region centered at a posterior mean. Fig. 1(a): Landmarks extracted using (Bulat and Tzimiropoulos, 2016) (left) and mapped onto the statistical model (right). In this case most of the landmarks lie inside their confidence regions, thus assessing their correctness. Fig. 1(b): Ground-truth landmarks obtained with a semi-automatic annotation process (Zhu et al, 2016) and mapped onto the statistical model (right). One may notice that in this case, many mapped landmarks fall outside their confidence regions. The benefit of the proposed method is twofold: (i) an unsupervised assessment of the quality of the detected landmarks, and (ii) a robust and expression-preserving landmark mapping from an arbitrary pose to a frontal pose.

The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 reviews statistical rigid-mapping estimation and describes two robust methods. Section 4 empirically analyses the proposed robust rigid-mapping methods. Section 5 proposes a methodological pipeline for building a statistical face model and an associated parametric confidence metric. Section 6 presents extensive experimental results, and Section 7 draws some conclusions. 111Supplemental material for this paper can be found at https://team.inria.fr/perception/research/upa3dfa/

2 Related Work

It is interesting to note that recently proposed methods for 3DFA lie at the crossroads of deformable shape models, model-based image analysis and neural networks. In order to discuss these links we introduce some mathematical notations and concepts. Let vector

denote the ensemble of parameters of a 3D face model (identity, expression and pose), where is the parameter vector space, is the number of parameters, and let denote the image of a face from a set of images of size . One class of 3DFA methods directly learns a mapping from a training dataset of face images and their associated model parameters , e.g. (Zhu et al, 2016; Jourabloo and Liu, 2017; Feng et al, 2018).

Another class of methods proceeds in two steps. First, 2D landmarks are extracted from a face image by learning an image-to-landmark mapping , from a face image to a set of 2D landmarks , and using a training dataset . Second, a 2D-to-3D mapping is estimated, where is a set of 3D landmarks. This mapping can be obtained either by learning, using a training dataset conditioned by a 3D model parameterized by , i.e. , e.g. (Zhao et al, 2016; Bulat and Tzimiropoulos, 2016, 2017), or by direct optimization over of a function that maps a 3D model onto the 2D landmarks, e.g. (Yu et al, 2017).

These 3DFA DNN-based methods use a variety of architectures in order to learn the regression functions , and mentioned above. Given this variety, it is difficult to directly compare them and assess their merits based on the analysis of the underlying DNN concepts and methodologies. Hence, 3DFA algorithm performance should be measured empirically, as is often the case in deep learning.

To date and to the the best of our knowledge, there has been a single attempt to benchmark 3DFA (Jeni et al, 2016). In detail, four datasets were specifically gathered, annotated and prepared, and two performance metrics were used for this challenge. The BU-4DFE (Yin et al, 2008) and BP-4D-Spontaneous (Zhang et al, 2014) datasets used a structured-light stereo sensor to capture textured 3D meshes of faces in controlled conditions and with various backgrounds. 2,295 meshes were selected from these datasets and manually annotated with 66 landmarks and with self-occlusion information. Then, 16,065 2D views were synthesized (seven views for each mesh) with yaw and pitch rotations ranging in the intervals and , respectively. Additionally, there were 7,000 frames from the Multi-PIE (Gross et al, 2010) and 541 frames from the Time-Sliced datasets, respectively. Both these datasets contain RGB images gathered with multiple cameras from different viewpoints but with no 3D information, hence a 3D face model is extracted for each image, using the model-based multi-view structure-from-motion technique of (Jeni et al, 2017). As above, each 3D face model was annotated with 66 landmarks and with self-occlusion information.

Moreover, the following metrics were used: the ground-truth error (GTE) and the cross-view ground-truth consistency error (CVGTCE), namely,


where denotes the -th detected 3D landmark associated with test sample , is the corresponding ground-truth 3D landmark, is the inter-ocular distance of the sample face , is the number of landmarks, and , , and are the scale factor, rotation matrix and translation vector associated with a rigid mapping that compensates the possible discrepancy between the set of detected landmarks and the set of ground-truth landmarks.

These metrics require a test dataset with ground-truth 3D landmarks, which may be prone to errors, if the annotations are performed manually or even if an automatic 2D-to-3D fitting process is being used, e.g. (Bulat and Tzimiropoulos, 2017). They cannot provide a precise account of how 3DFA is affected by large poses, by non-rigid facial deformations, or by occlusions. They don not provide a detailed analysis of a particular landmark or of a group of landmarks used in tasks that are likely to require highly accurate landmark detection and localization, e.g. landmarks located on the lips, needed for visual voice activity detection, visual speech enhancement and visual speech recognition.

In contrast, the proposed methodology does not make use of ground-truth annotations for assessing the performance of 3DFA algorithms. We use robust rigid mapping to build frontal landmark models of faces in a completely unsupervised way and we propose a statistical confidence score to assess whether the landmarks associated with a test face have been accurately detected or not. The methodology can indifferently be applied to landmarks obtained with 3DFA algorithms or with an annotation process, be it manual, semi-automatic or fully automatic. Moreover, based on the proposed statistical score, it is possible to remove badly located landmarks from an annotated dataset.

3 Robust Rigid Mapping

Let us consider the mapping between two sets of facial landmarks, and . In the general case, this mapping is composed of a rigid transformation, i.e. head motion, and of non-rigid facial deformations. We associate an additive error vector (or residual) with each landmark , , to account for the non-rigid component of the mapping and for various sources of errors. Without loss of generality, it is assumed that the set is associated with an unknown face with an unknown pose and that the set is associated with the frontal view of a prototype face. Therefore, we seek a mapping from an arbitrary face pose to a frontal face pose. This mapping can be modeled in the following way:


where the rigid transformation is parameterized by a scale factor , a rotation matrix and a translation vector . If we assume that the residuals are independent and identically distributed (i.i.d), the problem of estimating the rigid-transformation parameters can be solved via log-likelihood maximization or, equivalently, via negative log-likelihood minimization, , with:



is the probability distribution function (pdf) of

parameterized by which is composed of , , and of the pdf parameters.

3.1 Gaussian Model

The simplest statistical model is to assume that the residuals follow a zero-centered Gaussian distribution with covariance matrix

, namely . By developing (4) and ignoring terms that do not depend on the model parameters, we obtain:


where is the squared Mahalanobis norm of . The minimization of (5) over yields:


where the over-script indicates the optimal value of a parameter and with the notations:


By substituting (6) into (5) and by using centered coordinates, i.e. , , we obtain:


Standard approaches to the minimization of (8) with respect to the rotation matrix assume an isotropic covariance, . Indeed, the development of (8) yields

thus leading to closed-form solutions, e.g. (Horn, 1987; Horn et al, 1988; Faugeras and Hebert, 1986; Arun et al, 1987; Umeyama, 1991). Nevertheless, the isotropic-covariance assumption is barely valid in practice. In the case of a full covariance, the optimization becomes


where is the trace operator and with the notations , . A rotation matrix must satisfy and . This yields a constrained non-linear optimization problem. An elegant formulation consists of parameterizing the rotation with a unit quaternion, thus reducing the number of parameters from 9 to 4, while the number of constraints is reduced from 7 to 1. Let , where is a unit quaternion (please consult Appendix A). Using this representation, the rotation is described by four parameters and the associated constrained optimization problem writes:


Similar to (Horn, 1987) the optimal scale factor is obtained in closed form:


Finally, the optimal covariance is estimated with:


Once the rotation and scale are initialized using the unit-quaternion closed-form method of (Horn, 1987), alternating optimization can be used to iterate through (3.1), (11) and (12). We will refer to this method as generalized Horn.

3.2 Gaussian-uniform Mixture Model

Unfortunately, the above statistical model does not behave well in the presence of large residuals, or outliers. For the purpose of explicitly modeling inliers and outliers, a discrete random variable

is associated with each residual , and let be a realization of . Now, is drawn either from a zero-centered Gaussian distribution, as above, or from a multivariate uniform distribution:



is the volume of the distribution. This yields a two-component mixture model, an inlier component with prior probability

, and an outlier component with prior probability

. This naturally leads to solving the problem via expectation-maximization (EM) which alternates betwen (i) evaluating the posterior probabilities of the residuals to be inliers or outliers, and (ii) minimizing the

expected complete-data negative log-likelihood, , where the expectation is taken over the realizations of , and where the parameter vector is .222Note that the translation vector is evaluated outside the EM procedure. This yields the minimization of:


where the posterior probability to be an inlier , is :


and the posterior probability to be an outlier is . The presence of in (14) replaces (7) with:


as well as and from (3.1) with


Hence, (3.1) can be used to estimate the optimal rotation. Moreover, (11) is replaced with :


The prior probability and covariance matrix are estimated with:


We refer to this model as the Gaussian-uniform mixture (GUM) and the associated EM is summarized in Algorithm 1.

Data: Centered point coordinates, i.e. (7). Normalization parameter ;
Initialization of : Use the closed-form solution (Horn, 1987) to evaluate and and then use these parameter values to evaluate using (12) and set ;
while  do
       E-step: Evaluate the posteriors using (15) with ;
       Update the centered coordinates using (16) ;
       M-scale-step: Evaluate using (18);
       M-rotation-step: Estimate via constrained non-linear optimization of (3.1) using (17) ;
       M-covariance-step: Evaluate using (20);
       M-prior-step: Evaluate using (19);
end while
Optimal translation: Evaluate the translation vector using (6);
Result: Estimated scale , rotation , translation , prior , covariance , and posterior probabilities of landmarks .
Algorithm 1 GUM-EM for robust estimation of the rigid transformation between two point sets.

3.3 Generalized Student Model

Another way to enforce robustness is to use the generalized Student’s t-distribution, also known as the Pearson type VII distribution (Sun et al, 2010):


where and are the parameters of the prior gamma distribution of and is the gamma function. The distribution (3.3) differs from the standard Student’s t-distribution in that the weight variable

, or the precision, is drawn from a gamma distribution with parameters

and , instead of and . Notice that in (3.3) and appear only through their product, which means that an additional constraint is required to make the parameterization unique. One possibility is to constrain the determinant of the covariance, e.g. , which is equivalent to have an unconstrained with . Unconstrained parameters are easier to deal with in inference algorithms. Therefore, we will rather assume without loss of generality that .

Notice that the posterior distribution of is also a gamma distribution, namely the posterior gamma distribution:


with parameters:


The posterior mean of the weight variable is:


As with the Gaussian-uniform model, we need to minimize the expected complete-data negative log-likelihood, and in this case the parameter vector is since we set . This yields the minimization of:


thus replacing with in (16) and (17) to estimate the optimal rotation (3.1) and scale (18). The covariance matrix is estimated with:


The parameter is updated by solving the following equation, where is the digamma function:


We refer to this model as the generalized Student (GStudent) and the associated EM algorithm is summarized in Algorithm 2.

Data: Centered point coordinates, i.e. (7). ;
Initialization of : Use the closed-form solution (Horn, 1987) to evaluate and ; evaluate using (12). Provide ;
while  do
       E-step: evaluate and using (23) with , then evaluate using (24) ;
       Update the centered coordinates using (16), where are replaced with ;
       M-scale-step: Evaluate using (18);
       M-rotation-step: Estimate with (3.1), (17) ;
       M-covariance-step: Evaluate using (26) ;
       M-mu-step: Evaluate using (27) ;
end while
Optimal translation: Evaluate the translation vector using (6);
Result: Estimated scale , rotation , translation , covariance , and landmark weights .
Algorithm 2 The GStudent-EM for robust estimation of the rigid transformation between two point sets.

3.4 Algorithm Implementation and Analysis

Algorithm 1 and Algorithm 2 are expectation maximization (EM) procedures and it is well known that they have good convergence properties. One should notice that all the computations inside these algorithms are in closed form, with the notable exception of the estimation of the rotation matrix. The latter is parameterized with a unit quaternion which may be estimated via nonlinear constrained optimization. The unit-quaternion parameterization of rotations, i.e. Appendix A, has several advantages: (i) the number of parameters to be estimated is reduced from nine to four, (ii) the number of nonlinear constraints is reduced from seven constraints (six quadratic constraints, i.e. , and one quartic constraint, i.e. ) to one quadratic constraint (), (iii) the initialization is performed with the closed-form solution of (Horn, 1987) that uses a unit quaternion as well.

In practice, the constrained nonlinear optimization problem (3.1) is solved using the sequential quadratic programming method (Bonnans et al, 2006), more precisely a sequential least squares programming (SLSQP) solver333https://docs.scipy.org/doc/scipy/reference/optimize.html is used in combination with a root-finding software package (Kraft, 1988). The SLSQP minimizer found at the previous EM iteration is used as an initial estimate at the current EM iteration. The closed-form method of (Horn, 1987) (please consult Appendix A) is used to initialize the unit-quaternion parameters at the start of the EM algorithm.

Other closed-form methods commonly used in computer vision, e.g. (Horn et al, 1988; Arun et al, 1987)

, perform singular value decomposition to extract an orthogonal matrix from the measurement matrix, but without the guarantee that the estimated matrix is a rotation, i.e. its determinant must be equal to

. Appendix A

summarizes the unit-quaternion closed-form method, which is based on estimating the smallest eigenvalue and eigenvector pair of a 4

4 semi-definite positive symmetric matrix – a well known mathematical problem yielding a straightforward numerical solver.

4 Analyzing the Robustness of Rigid Mapping

In order to quantify the performance of the proposed robust rigid-mapping algorithms, we devised an experimental protocol on the following grounds. Let be a set of landmarks associated with the frontal view of a face. The set is generated with:


where is a scalar that controls the level of noise and is the trial index. As described in detail below, the noise level,

can be the variance of Gaussian isotropic noise, the total variance of Gaussian anisotropic noise, or the volume of uniformly distributed noise. The landmark coordinates are normalized such that

. For each noise level, we randomly generate trials, namely rigid mappings and sets of residuals . For each trial we estimate the rigid mapping parameters, , , , and we measure the root mean square error (RMSE) between these estimated parameters and the ground-truth parameters, , , , namely:


The ground-truth rigid-mapping parameters are generated in the following way. For each trial , the scale and the translation vector are generated from uniform distributions, namely and . The rotation matrix is parameterized by the pan, tilt and yaw angles, namely:

A rotation matrix is obtained by randomly generating the pan, tilt and yaw angles, , , from a uniform distribution, .

(a) RMSE error in rotation.
(b) RMSE error in scale.
(c) RMSE error in translation.
Figure 2: RMSE error as a function of the percentage of outliers: inliers are affected by anisotropic Gaussian noise with total variance , while the percentage of outliers, affected by uniform noise with amplitude , increases from to .
(a) RMSE error in rotation.
(b) RMSE error in scale.
(c) RMSE error in translation.
Figure 3: RMSE error as a function of uniform noise affecting a fixed number of outliers: inliers (50%) are affected by isotropic Gaussian noise with variance , while outliers (50%) are affected by uniform noise of increasing amplitude .
(a) RMSE error in rotation.
(b) RMSE error in scale.
(c) RMSE error in translation.
Figure 4: RMSE error as a function of uniform noise affecting a fixed number of outliers: inliers (50%) are affected by anisotropic Gaussian noise with total variance , while outliers (50%) are affected by uniform noise of increasing amplitude .
(a) RMSE error in rotation.
(b) RMSE error in scale.
(c) RMSE error in translation.
Figure 5: RMSE error as a function of anisotropic Gaussian noise affecting a fixed number of outliers: inliers (50%) are affected by anisotropic Gaussian noise with total variance , while outliers (50%) are affected by anisotropic Gaussian noise with total variance .

In order to generate residuals, , we simulate three types of noise:

  • Isotropic Gaussian noise: ;

  • Anisotropic Gaussian noise: , and

  • Uniform noise: .

In the case of anisotropic noise, a covariance matrix must be randomly generated for each trial. This is done in the following way. Let , with (an orthogonal matrix) and with , where the eigenvalues correspond to the variances along the eigenvectors – the directions of maximum variance. Let denote the total variance. A sample covariance matrix is simulated by randomly generating an orthogonal matrix and by randomly generating the three eigenvalues from a uniform distribution, .

We tested the following rigid mapping models and associated algorithms:

  • Horn: Gaussian distribution with isotropic covariance, (Horn, 1987) and Appendix A;

  • Gen-Horn: Gaussian distribution with anisotropic covariance, Section 3.1;

  • GUM-EM: Gaussian-uniform mixture distribution, Algorithm 1, and

  • GStudent-EM: Generalized Student’s t-distribution, Algorithm 2.

The experiments were conducted in the following way. For each noise level, we simulated trials for which we computed the RMSEs, namely eqs. (29), (30), and (31). For each trial we split the landmarks into an inlier set and an outlier set and the landmarks are randomly assigned to one of these sets. The first experiment determines the percentage of outliers that can be handled by the robust algorithms, Figure 2. For this purpose, the percentage of outliers is increased from 10% to 60%. The inlier noise is drawn from an anisotropic Gaussian distribution with a total variance . The outlier noise is drawn from a uniform distribution with amplitude (remember that the landmark coordinates are normalized to lie in the interval ). The cuves plotted in Figure 2 show that the RMSE associated with non robust methods, i.e. Horn and Gen-Horn increase monotonically. On the contrary, the robust algorithms, GUM-EM and GStudent-EM, have a radically different behavior. After a short increase, the RMSE remains constant, and then it increases again.

In the other experiments, the number of inliers was set to be equal to the number of outliers and we experimented with the three noise types already mentioned. Figure 3 shows the RMSEs when inlier noise is drawn from an isotropic Gaussian distribution with , while outlier noise is drawn from a uniform distribution whose volume is increased from to . Similarly, Figure 5 shows the RMSEs for the case when inlier noise is drawn from an anisotropic Gaussian distribution with total variance , while outlier noise is drawn from a uniform distribution whose volume is increased from to . Finally, Figure 5 shows the RMSEs when inlier noise is drawn from an anisotropic Gaussian distribution with total variance , while outlier noise is drawn from an anisotropic Gaussian distribution with total variance varying from to .

These experiments clearly show that the two classes of methods (non-robust and robust) behave differently. The performance of non-robust rigid mapping decreases monotonically in the presence of outliers with increasing levels of noise. The robust methods can deal with up to 50% of outliers affected by a substantial noise level (1.5 times the size of the image). There is no evidence that the Gen-Horn algorithm performs better than the standard Horn algorithm. Nevertheless, Gen-Horn provides interesting information about the 3D structure of the estimated anisotropic covariance. The GUM-EM algorithm performs slightly better than the GStudent-EM algorithm, in particular in the presence of outliers drawn form a uniform distribution.

5 Measuring the Performance of 3D Face Alignment

In this section we describe an unsupervised methodology for quantitatively assessing the performance of 3DFA algorithms. The idea of the proposed benchmarking is to apply 3DFA to a dataset of face images in order to extract 3D landmarks, to robustly estimate the rigid transformation that maps these facial landmarks into a 3D landmark model, and to analyze the discrepancy between the extracted 3D landmarks and the model. Based on a confidence score, it is then possible to decide whether a landmark is correctly localized or not. This allows to assess the overall performance of a 3DFA algorithm as well as its behavior with respect to various perturbations, such as occlusions or motion blur.

5.1 Neutral Frontal Landmark Model

We start by computing a neutral frontal landmark model in the following way. For this purpose, we use a dataset of images of neutral faces (frontal viewing, no expression and no interfering object causing occlusion) and we extract landmarks from each one of these faces, . Then we use the landmark coordinates to compute the directions of maximum variance (or the principal components) of each face. By aligning these directions over the dataset, we compute a mean for each landmark, namely


5.2 Statistical Frontal Landmark Model

We now explain how a statistical frontal landmark model is built, namely , where is the set of means and is the set of covariance matrices associated with the statistical frontal landmark model. For this purpose, we use another dataset that contains images of faces with the following characteristics: arbitrary poses, arbitrary expressions, both speaking and silent faces, but with no external sources of perturbation such as the presence of interfering object that may cause occlusions. We extract 3D landmarks from these images using a 3DFA algorithm, namely , and we use either GUM-EM (Algorithm 1) or GStudent-EM (Algorithm 2) to robustly estimate the rigid transformations between each landmark-set and the the neutral frontal landmark-set and . Based on this, we obtain rigid-mapping parameters (one for each ): scale factors, rotations and translations: , where the over-script denotes a robust algorithm, namely either GUM-EM or GStudent-EM. We remind that both algorithms provide a figure of merit for each landmark: posterior probabilities in the case of GUM-EM and precisions in the case of GSudent-EM. Applying one of these robust rigid-alignment methods provides frontal landmarks, , namely:


There are two different expressions for the posterior means and posterior covariances for GUM-EM and for GStudent-EM, respectively:




Notice that (36) and (37) compute a mean and a covariance for landmark over the entire dataset. Hence, and unlike in (26), the covariance should be normalized with the sum of the weights.

5.3 Unsupervised Confidence Test

We now develop an unsupervised (statistical) confidence test for assessing whether the accuracy of a landmark, i.e. its 3D coordinates, is within (inlier) or outside (outlier) an expected range (Savage, 1972). Let us drop the algorithm over-script and let be the eigen factorization of , where is an orthonormal matrix and is a diagonal matrix containing the eigenvalues. We can now project each landmark on the space spanned by the three eigenvectors of this matrix:


Landmark is an inlier with confidence if

lies inside the ellipsoid whose half-axes are three times the standard deviations, or

, , , where are the eigenvalues of , or


where . Combining (38) and (39), yields . With the notation


the confidence test writes:


Based on this confidence test, we can now build a confidence test accuracy (the higher the better) associated with a sample , namely:


where denotes the indicator function. For a test dataset composed of samples, one can then compute the mean confidence test accuracy (the higher the better):


5.4 Supervised Metrics

In general, datasets of faces come with their ground truth, and we denote with the set of ground-truth landmarks associated with the dataset . We modify (1) to be able to build a metric that counts the proportion of inliers, namely the ground-truth accuracy (the higher the better):


where is a user-defined threshold that corresponds to the quality of the ground-truth landmarks. Based on this we can compute the mean ground-truth accuracy (S):


Finally, another interesting metric is the correlation coefficient between the above unsupervised and supervised metrics:


6 Experimental Results

6.1 Neutral Frontal Landmark Model

The neutral frontal landmark model was trained in-the-wild by harvesting web images and using a face detector and a head-pose estimator in order to select frontal faces. These images were visually inspected to guarantee shape and aspect variabilities as well as neutral facial expressions. This process yields a dataset composed of images. We used the 3DFA method of (Feng et al, 2018) to extract landmarks from each face in the dataset. Next, we aligned them (please consult Section 5.1) and computed the landmark means using (32). Figure 6 shows a few examples of images from this dataset as well as the detected landmarks. Figure 7 show the neutral frontal landmark model thus obtained.

Figure 6: Examples of faces and corresponding landmarks used to compute a neutral frontal landmark model.
Figure 7: A 3D view of the neutral frontal landmark model.

6.2 Statistical Frontal Landmark Model

The statistical frontal landmark model was trained from the YawDD dataset (Abtahi et al, 2014). This dataset contains 322 videos which is equivalent to approximatively images. The face images in this dataset have large variabilities in terms of face shapes, face aspects, head poses and facial expressions. All the images were processed with no human intervention, namely: face detection, 3D face alignment, and robust rigid alignment with the neutral face landmarks just described. This yields the statistical face landmark model described in Section 5.2. For that purpose we used two 3DFA methods and the two robust alignment algorithms described in this paper. Hence, there are four possible 3DFA and robust alignment combinations that we used to train four different models:

  • 3DFA1/GUM-EM: (Bulat and Tzimiropoulos, 2016) and GUM-EM (Algorithm 1),

  • 3DFA2/GUM-EM: (Feng et al, 2018) and GUM-EM (Algorithm 1),

  • 3DFA1/GStudent-EM: (Bulat and Tzimiropoulos, 2016) and GStudent-EM (Algorithm 2), and

  • 3DFA2/GStudent-EM: (Feng et al, 2018) and GStudent-EM (Algorithm 2).

Figure 8 shows the statistical frontal landmark models obtained with these four combination. In this figure, the dots correspond to the posterior means, i.e. (34) and (36), while the elliptical regions correspond to image projections of the ellipsoids defined by (40).

(a) 3DFA1/GUM-EM
(b) 3DFA2/GUM-EM
(c) 3DFA1/GStudent-EM
(d) 3DFA2/GStudent-EM
Figure 8: Statistical frontal landmark models obtained with two 3DFA methods and with the proposed robust rigid-mapping algorithms.

6.3 Performance Evaluation of 3D Face Alignment

Once the neutral frontal and statistical frontal models are computed using datasets and , respectively, we use a third dataset, , to empirically assess the performance of 3DFA algorithms using the unsupervised confidence test introduced in Section 5.3. For this purpose we used AFLW2000-3D (Zhu et al, 2016) as a test dataset, consisting of 2,000 images with large-pose variations. More precisely, the yaw angles (vertical axis of rotation) in the following intervals for 1,306 faces, in the interval for 462 faces and in the interval for 232 faces. The dataset contains a large variety of facial shapes, facial expressions and illuminations conditions. Moreover, there are many faces with partial occlusions caused by the presence of hair, hands, handheld objects, glasses, etc. Notice that large poses induce partial occlusions as well.

Each image in the AFLW2000-3D dataset is annotated with 68 3D landmarks. This semi-automatic annotation is performed by fitting a 3D deformable model to a dataset of 2D face images, e.g. (Ghiasi and Fowlkes, 2014). Nevertheless, as noted in (Bulat and Tzimiropoulos, 2017), many of the annotated landmarks in this dataset have large localization errors, especially in the case of profile views. Hence, performance evaluation based on supervised metrics are prone to errors. In (Bulat and Tzimiropoulos, 2017) it is visually shown that in these extreme poses their 3DFA method yields more precise landmark localization than the automatically annotated ones. Based on these observations, we applied our unsupervised performance analysis to the annotated landmarks as well, yielding the following combinations:

  • GT/GUM-EM: Ground-truth landmarks provided by (Zhu et al, 2016) and GUM-EM, and

  • GT/GStudent-EM: Ground-truth landmarks provided by (Zhu et al, 2016) and GStudent-EM.

The results based on computing the mean confidence-test accuracies, i.e. (43) are summarized in Table 1. We remind that we used different datasets for training the neutral and statistical face landmark models, i.e. and , and for assessing the performance of the various combinations of 3DFA methods and robust-rigid mappings, i.e. . The means, evaluated over the confidence-test scores obtained with the annotated (ground-truth) landmarks (last two rows of Table 1), are equal to and to , respectively, which seems to confirm that the ground-truth landmark locations in the AFLW2000-3D dataset contain a substantial amount of errors and that, overall, both 3DFA methods that we analyzed, (Bulat and Tzimiropoulos, 2016) and (Feng et al, 2018), predict landmark locations that are more accurate than the ground-truth locations themselves.

We now compute correlation coefficients, i.e. (46), between the unsupervised and supervised metrics, i.e. (42), and (44). The results are reported in Table 2. Notice however that the supervised scores depend on the choice of the parameter . As done in (Bulat and Tzimiropoulos, 2017), this parameter was adjusted to eliminate samples yielding a very low score. With in (45) we obtained the following scores:

  • 3DFA1: ,

  • 3DFA2: .

These scores are comparable with the scores reported in (Bulat and Tzimiropoulos, 2017) which uses a different normalization parameter. Notice that the scores obtained with the proposed confidence test, i.e. Table 1, are higher than these scores.

In the light of these results, we attempted to analyse the effect of eliminating inaccurate ground-truth landmark annotations from the benchmark just described. Let be the subset of samples satisfying , where denotes the value of the unsupervised confidence-score associated with ground-truth landmarks of face sample . We see that when is increased, the accuracy of the ground-truth landmark annotations contained in the subset increases as well, at the price of drastically decreasing the number of samples, which in turn lowers down the statistical significance of the resulting scores. The correlation coefficient is then computed with the following formula:


Figure 9 (left) shows the correlation as a function of where the 3DFA1/GStudent-EM method was used to build the confidence test, i.e. Fig. 8(c), while Figure 9 (right) shows the corresponding p-value (with a significance level of ): the smaller p-value, the more statistically-significant correlation. The red dots in these plots correspond to a p-value not satisfying the significance level.

In the light of these experiments, we conclude that the proposed methodology for assessing the performance of 3DFA methods is not biased by the quality of landmark annotation, whether the latter is automatic or human-assisted. The experiments suggest that the proposed unsupervised methodology could be used (i) to assess the quality of landmark annotation itself and (ii) to remove badly annotated landmarks.

We know illustrate the proposed performance analysis method with a few examples. Figure 10 shows the results obtained with 3DFA1/GStudent-EM, i.e. (Bulat and Tzimiropoulos, 2016), applied to six samples from the AFLW2000-3D dataset, using the statistical face model build with 3DFA1/GStudent-EM. Figure 11 shows the results obtained with 3DFA2/GStudent-EM, i.e. (Feng et al, 2018), applied to the same samples and using the same statistical face model as in Figure 10.

Finally, we illustrate the results obtained by applying GStudent-EM to the ground-truth landmarks associated with the AFLW2000-3D dataset (Zhu et al, 2016), or GT/GStudent-EM. Some examples are shown in Figure 12 (best scores) and in Figure 13 (worse scores). The results reported in Table 2 and these examples show that the ground-truth annotations should be used with caution.

Alignment method Statistical face models trained with and datasets:
using dataset : 3DFA1/GUM-EM 3DFA2/GUM-EM 3DFA1/GStudent-EM 3DFA2/GStudent-EM Mean
3DFA1/GUM-EM 0.89 0.65 0.93 0.80 0.82
3DFA2/GUM-EM 0.93 0.88 0.95 0.93 0.92
3DFA1/GStudent-EM 0.80 0.57 0.88 0.74 0.75
3DFA2/GStudent-EM 0.84 0.76 0.90 0.88 0.84
GT/GUM-EM 0.73 0.54 0.82 0.71 0.70
GT/GStudent-EM 0.67 0.48 0.78 0.66 0.65
Table 1: Performance analysis based on the proposed unsupervised metrics. The numbers correspond to the proportion of inliers (the higher the better) computed using (42) and (43) over a dataset that contains 2,000 face images and 68 landmarks per face.
Alignment method Statistical face models trained with and datasets:
using dataset : 3DFA1/GUM-EM 3DFA2/GUM-EM 3DFA1/GStudent-EM 3DFA2/GStudent-EM Mean
3DFA1/GUM-EM 0.30 0.28 0.28 0.29 0.29
3DFA2/GUM-EM 0.33 0.37 0.30 0.33 0.33
3DFA1/GStudent-EM 0.25 0.26 0.26 0.26 0.26
3DFA2/GStudent-EM 0.25 0.28 0.23 0.26 0.25
Table 2: Correlation coefficients computed with (46) (the higher the better) between unsupervised and supervised metrics.
Figure 9: Correlation between the unsupervised and supervised metrics (left) and corresponding p-value (right) as a function of the accuracy of the ground-truth landmark annotations. The red dots correspond to a p-value not satisfying a significance level, which is set to in these experiments.
Figure 10: A few examples obtained with (Bulat and Tzimiropoulos, 2016) and with GStudent-EM.
Figure 11: A few examples obtained with (Feng et al, 2018) and with GStudent-EM.
Figure 12: Some examples of the best scores obtained with GStudent-EM and the ground-truth landmarks.
Figure 13: Some examples of the worse scores obtained with GStudent-EM and the ground-truth landmarks.

7 Conclusions

We presented a method for analyzing the performance of 3DFA algorithms. Due to the large spectrum of 3DFA formulations, ranging from 3D deformable-model fitting to discriminative deep learning, it is difficult to compare their respective merits on the basis of formal mathematical and algorithm analyses. Instead, we adopt an empirical data-driven evaluation. To date, performance analysis relies on annotated datasets. In the case of 3DFA, these annotations correspond to the 3D coordinates of a set of pre-defined facial landmarks. This annotation process, be it manual, semi-automatic, or automatic, is prone to errors, which biases the analysis.

In contrast, we proposed a method that bypasses annotations. This is possible because, in the case of facial landmark detection and localization, there is an underlying rigid transformation between the landmarks of a face, with unknown pose and expression, and the landmarks of a frontally-viewed face. If this rigid transformation could be estimated, then it can be used to map an unknown face onto a model face. This expression-preserving rigid mapping can subsequently be used to measure face-to-model discrepancies. We proposed such an approach based on well-founded statistical models. This led to an unsupervised parametric confidence test that yields a confidence score well suited to analyze the performance of 3DFA algorithms.

Because the proposed methodological pipeline makes use of 3DFA and of robust rigid mapping, one may argue that the proposed performance analysis metric is biased. We empirically showed that the analysis is agnostic to various combinations of methods used for building the model and for the test itself. Moreover, we showed that the method could also be used to assess the quality of facial landmark annotations, in particular those annotations that are obtained automatically based on 3D deformable-model fitting. We therefore conclude that the proposed performance analysis methodology yields an interesting framework for assessing the repeatability and reliability of the predictions obtained with 3DFA algorithms.

Appendix A Closed-Form Solution Using Unit Quaternions

Consider (14) with . We immediately obtain the following formulas for the model parameters:


The formula for the posteriors becomes:


It is well known that a rotation matrix can be parameterized by a unit quaternion Horn (1987). Let be parameterized by its axis and angle of rotation, , and . The unit quaternion parameterizing the rotation is: