Face alignment refers to the process of localizing a number of landmarks on the face, such as lips or eye corners, pupils or nose tips. It serves as a backbone preprocessing for many applications in computer vision, such as expression recognition, face synthesis , or facial performance reenactment . Depending on the application, the precision of the annotation markup varies a lot, i.e. the number of annotated landmarks can be very different across datasets. How to integrate these heterogeneous annotations in order to robustly localize variable numbers of landmarks belonging to different markups remains an open challenge.
A closely related task is head pose estimation, which is usually formulated as a regression task on the three Euler angles (yaw, pitch and roll). Given a precise landmark localization, head pose can be estimated quite straightforwardly. However, there is no guarantee that such two-step multi-task formulation is optimal at all: for example, given a rough head pose estimate, one can theoretically select more relevant face features in order to specialize landmark localization to a certain pose range, as the face appearance varies wildly between e.g. yaw angles close to degrees and smaller angles, particularly when considering cheek regions.
In this paper, we propose to entwine head pose estimation and face alignment tasks, so that each task benefit from each other. This process is illustrated on figure 1
. To do this, we introduce an attentional cascade with doubly conditional fusion (AC-DC) model. AC-DC is an hybrid architecture between cascaded regression methods and deep learning-based end-to-end learnable approaches. It is composed of several stages that each contains a backbone fully-convolutional U-net block. For each stage, a geometry transfer network (GTN) converts the current landmark estimates to fit multiple, heterogeneous annotation markups, and provides a current head pose estimate. Then, a doubly-conditional fusion block uses both head pose estimates and spatial attention maps to select relevant channels, and regions thereof, to provide richer embeddings to the next cascade stages. The whole architecture is trained in an end-to-end fashion. To wrap it up, the contributions of this paper are three-folds:
We propose an attentional cascade (AC-DC) that iteratively refines head pose and landmark estimates. It uses a doubly conditional pose and spatial masking to refine the predictions of the next stages.
We propose a dual-stream geometry transfer network (GTN) to integrate heterogeneous landmark and head pose prediction objectives.
We experimentally show that AC-DC extend state-of-the-art results by a significant margin for both 2D and 3D landmark alignment, as well as head pose estimation.
Ii Related work
Face alignment can be divided into two coarse categories, the first of which is 2D alignment. This consists in predicting 2D landmark localization, usually for face images with low to medium yaw angles. Popular methods for 2D face alignment either belong to cascaded regression or deep learning-based, end-to-end approaches. Popular exemples of cascaded regression include SDM , LBF  and DAN . A natural pitfall of such approaches is that the regressors are not learned jointly in a end-to-end fashion, thus there is no guarantee that the whole cascade might be optimal. Tackling this issue, MDM 
improves the feature extraction process by sharing CNN layers among cascade stages, which are formulated as a recurrent neural network. This results in a more optimized landmark trajectory throughout the cascade.
Exemples of deep methods include TCDCN , which involves pretraining on a wide facial attributes database . More recently, SAN  uses generative adversarial networks to convert images from different styles to an aggregated style before performing landmark localization. Authors of  propose to use edge map estimation as an intermediate representation to drive the landmark prediction task. Authors in  use a surrogate loss to enhance training of deep networks. AAN  proposes to use intermediate feature maps as attentional masks to select relevant regions.
3D face alignment methods are usually formulated as dense landmark localization objectives, which are degraded to a sparse set of landmarks for evaluation. For instance, PRN  learns a direct mapping between an input image and a UV map that contains 3D coordinates for each pixel. Such methods do not explicitly use head pose information. By contrast, 3DDFA 
fits a 3D morphable model using a deep neural network. In such a case, head pose lies among the parameters of the morphable model and is explicitly estimated. However, such parametric model uses a restricted number of dimensions and is usually quite rigid, e.g. w.r.t. expressions.
Head pose estimation: like 3DDFA , most methods in the literature either makes the assumption that head pose can be estimated prior to face alignment and can be used as a low-dimensional variable for conditioning the landmark localization task. This is due to the fact that head pose can be predicted accurately using a single deep network and without relying on landmark localization, as demonstrated in . For instance, 3DDE  first predicts a rough, rigid landmark localization guess using a deep neural network to estimate head pose, then refine its predictions with a coarse-to-fine ensemble of regression trees. PCD-CNN  integrates head pose information inside a dentritic CNN architecture. Furthermore, many approaches such as  treat head pose as a byproduct of the landmark localization, failing to enrich the latter task by the knowledge of pose estimation.
Dealing with heterogeneous annotations: besides head pose and landmark-wise annotations, face geometry is annotated in an heterogeneous fashion, with various numbers of landmarks. This is problematic, since fine-grained face annotation comes at a significant cost, hence available quality data is rather scarce. As an example, 300W database  contains images labelled in terms of 68 landmarks, whereas WFLW  contains images with 98 landmarks and CelebA  contains images annotated with only 5 landmarks. Thus, one can wonder if we can use all those images within the same framework to learn more robust landmark predictions. This problem is usually tackled as a knowledge transfer between heterogeneous datasets . In  the sauthors use a multi-task formulation, with a separate regression head for every annotation markup. However, this essentially ignores the intrinsic relationship between the landmark alignment tasks. Finally, in prior work  we proposed to chain landmark prediction tasks within a fully-convolutional attentional cascade. This approach, however, does not integrate 3D landmarks along with 2D annotations, nor do they entangle head pose with landmark localization.
Iii Attentional Cascade with Doubly-Conditional fusion
An overview of AC-DC is illustrated on Figure 2. 4 fully-convolutional U-Net blocks are stacked on top of each other. A geometry transfer network (Section III-A) converts these feature maps to address heterogeneous landmark alignment tasks, and provides a current estimate of head pose and landmark coordinates . After each of these blocks, we apply our doubly conditional fusion block (Section III-B) to produce the input for the subsequent U-net block. The whole architecture is trained in an end-to-end manner with intermediate supervisions (Section III-C
) and hyperparameter settings specified in SectionIII-D.
Iii-a Dual-stream geometry transfer network
Let’s denote the embeddings of the U-net block , with channels. As we aim at aligning landmarks belonging to heterogeneous markups, we distinguish two cases: (a) markups that are semanticaly close to each other, or that can straightforwardly deduce from each other (such as  and ), and (b) semantically different markups (e.g. 2d  and 3d landmarks). In order to integrate this constraint, we propose the dual-stream geometry transfer network (GTN) illustrated on Figure 3. The GTN is composed of separate 2d and 3d sub-networks with distinct transfer layers. We use depthwise separable convolutions with overlap from so that the 2d and 3d alignment tasks benefit from each other while allowing the predicted landmarks to differ, most notably on landmarks belonging to the jawline. The 2d GTN contains several chained transfer layers for matching the number of landmarks of different markups with landmarks, respectively, ordered by ascending order :
With a conv layer with output channels. By chaining these landmark pre-attention maps, each prediction task can benefit from the others, the finer tasks ( landmarks) benefiting from the coarser ones ( landmarks), as the gradients can flow from the former layers to the latter ones at train time. As for the 3d GTN, we only use one transfer layer as we only benchmark one 3d database in our experiments. From these 2d and 3d pre-attention maps , we can derive landmark-wise attention maps by applying spatial softmax, that generates attention maps, one for each landmark:
An estimation of the coordinates for landmark of a
-landmarks markup can be obtained by computing the first order moments of:
This provides a differentiable estimate of the landmark coordinates. In particular, a head pose estimate can be provided from by applying a single dense layer. It should be noted that we use a single GTN by sharing the weights of all transfer layers, as well as the head pose estimation layer, between the different stages. The rationale behind doing this is that the geometric transformation that maps the different markups shall be intrinsic to the relationships between the tasks, hence not depending on a current estimation.
Iii-B Doubly-Conditional Fusion Block
Similarly to what is done in cascaded regression, we can then use the attention maps to select relevant regions for refining the landmark estimates. This is illustrated on Figure 4. In order to limit the number of feature maps, we merge all the landmark-wise attention maps into a single spatial mask:
The spatially-conditional fusion output of an attentional cascade (AC) can thus be written:
Where denotes channel concatenation and the Hadamard product. We can use the head pose estimate to emphasize the most relevant channels among the channels of , that shall be used by the next block. To do so, we map to a 130-dimensional output by applying a fully-connected layer with sigmoid activation. Thus the output of the doubly-conditional fusion block is:
Where indicates the Hadamard product replicated along the spatial dimensions of the embeddings. This is similar to Squeeze-and-Excitation . However, we constrain the squeeze layer to describe head pose information, which is notoriously relevant for landmark alignment . Also note that while the parameters GTN (including the head pose estimation layer) are shared across all cascade stages, the parameters of the pseudo-excitation layer are not shared to allow to select different channels depending on the cascade stage . In what follows, we call this model the attentional cascade with doubly conditional (AC-DC) network. Note that AC-DC is fairly deep ( convolutional layers), but contains less that parameters, thus is relatively light as compared to state-of-the-art approaches.
Iii-C Training AC-DC model
AC-DC is trained in an end-to-end manner by minimizing a loss between the predicted landmark locations and pose, and their ground truth counterpart: where denotes the parameters of the 4 U-net backbone and GTN, as well as head pose excitation layers. In order to facilitate training, similarly to what is traditionally done in cascaded approaches, we drive the learning of each stage with intermediate supervision using both landmark and head pose ground truth values with:
the landmark localization objective function, and
the head pose estimation term, and denoting the intermediate supervision weights. In what follows, we use ascending weights to enable proper cascaded alignment as suggested in prior work , with the first stages outputting coarse predictions that are refined throughout the attentional cascade.
Iii-D Implementation details
In what follows, we use 4-stages AC-DC models that takes as input grayscale images. Each U-Net block is composed of a convolution layer, an encoder and a decoder part. The input of the conv is the original grayscale image for block 1, and tensor for blocks 2,3,4. The encoder part of each block performs subsequents applications of
conv, batch norm, ReLU, followed
conv with stride 2, batch norm, ReLU. The number of channels is
. The decoder part mirrors the encoding part. In order to generate smooth feature maps we do not use transposed convolution but instead use bilinear image upsampling followed withconvolutional layers. Furthermore, skip connections are used between feature maps of the same size to preserve the full spatial resolution of the input image. The whole architecture is trained using ADAM optimizer with a learning rate with and learning rate annealing with power . We apply updates with batch size for each database, with alternating updates between the databases.
In this Section, we validate our models on several databases and metrics specified in Section IV-A. We first highlight the benefits of our doubly-conditional fusion scheme in Section IV-B, then proceed to compare AC-DC with recent approaches for 2d alignment in Section IV-C, 3d alignment in Section IV-D, and head pose in Section IV-E. Finally, in Section IV-F we show qualitative results obtained with AC-DC.
Iv-a Experimental setup
The 300W database, introduced in , contains moderate variations in pose and expressions. It also embraces a few occluded images. It consists in four databases: LFPW (811 images for train / 224 images for test), HELEN (2000 images for train / 330 images for test), AFW (337 images for train) and IBUG (135 images for test), for a total of 3148 images annotated with 68 landmarks for training the models. As state-of-the-art approaches already outputs very high accuracy on this dataset, authors of  introduced the 300W-LP database, which is a large-pose dataset synthesized from 300W. It contains 100842 train images and 21608 images following the same partitions as in 300W, but with yaw angles covering the degrees range. Authors of  also proposed AFLW2000-3D database, which contains example synthesized from the 2000 first images of AFLW database using the same protocol as in 300W-LP.
The CelebA database  is a large-scale face attribute database which contains celebrity images coming from identities, each annotated with binary attributes (such as gender, eyeglasses, smile), and the localization of landmarks (nose, left and right pupils, mouth corners). In our experiments, we use the train partition that contains images from identities to train our models. The test partition contains instances from identities that are different from the training set identities.
The Wider Facial Landmarks in the Wild or WFLW database  contains 10000 faces (7500 for training and 2500 for testing) with 98 annotated landmarks. This database also features rich attribute annotations in terms of occlusion, head pose, make-up, illumination, blur and expressions.
In our experiments, we train on the train partitions of 300W, 300W-LP, WFLW, and CelebA and evaluate our models on the test partitions of these datasets as well as AFLW2000-3D. We report three evaluation metrics, the normalized mean error (NME), the failure rate or FR@0.1 and the AUC@0.1. For 2d alignment, the NME denotes the average landmark-wise distance normalized by the inter-ocular distance (distance between the outer eye corners). For 3d face alignment, as it is traditionnally done in the literature, we normalize the distances using the square root of the bounding box, as proposed in . The FR@0.1 corresponds to the proportion of examples for which the NME is larger than 0.1, and AUC@0.1 is the integral or the cumulative error distribution (CED) curve for examples for which the NME is below 0.1. For head pose estimation, we report the mean absolute difference (MAE) for each Euler angle, as well as the average error over these angles.
Iv-B Ablation study
In this Section, we discuss the interest of using a doubly-conditional fusion. For validation of hyperparameters such as the number of cascade stages, task ordering and intermediate supervision weights, the reader shall refer to . In Table I we show the interest of our doubly conditional fusion scheme on WFLW database. First, we measure the performance of a simply-conditionnal attentional cascade (AC). In this case we use only spatial masking and no excitation layer to produce the input on each stage. By contrast, AC+Pose is obtained by adding a head pose loss (Equation (8)), but without using head pose as an excitation variable to select relevant channels to refine the embeddings for the next U-net block. As such, using AC+Pose already provides an improvement over AC on every subset. Furthermore, by selecting relevant channels using head pose information (AC-DC), we improve the landmark localization accuracy of the model on nearly every subset of the database, most notably on the head pose and make-up subsets. This validates the fact that entwining head pose within the landmark alignment task by applying doubly-conditional fusion improves the performance of the model for landmark localization, particularly in case of difficult out-of-plane rotations. Moreover, the head pose yaw MAE on AFLW2000-3D is 2.92 for AC-DC vs. 3.29 for AC+Pose. Hence, using doubly-conditional fusion inside AC-DC architecture is beneficial for both landmark alignment and head pose estimation.
Iv-C 2D face alignment
Next, we compare AC-DC with recent state-of-the-art approaches for 2D alignment on WFLW database. Most notably, AC-DC improves the landmark alignment accuracy on nearly every subset and metric, as compared to LAB , Wing  and 3DDE . Note that Wing  uses head pose to balance example sampling at train time and  first infer head pose to pre-align a rigid set of landmarks before regressing the final extimates with coarse to fine ensemble of trees. By contrast, AC-DC jointly learns head pose and landmark alignment, each one benefiting from the other via doubly conditional fusion, allowing to iteratively refine the predictions through the subsequent cascade stages. This, in turn, provides higher alignment accuracies. It should also be empathized that entwined head pose estimation, as well as the integration of the 3D landmark alignment task, heavily benefit the landmark localization on the pose subset of WFLW in terms of NME, AUC as well as FR metric.
Iv-D 3D face alignment
Next, in Table III we compare the accuracy of our method for 3D face alignment on AFLW2000-3D database. AC-DC significantly enhance the state-of-the-art results on the and head pose ranges. Furthermore, Despite the fact that both 3DDFA  and PRN  benefit from dense 3D morphable model fitting or UV map ground truth while AC-DC only aligns a sparse set of landmarks, our method achieves performances that are close to the state-of-the-art best approach, 2DASL  on the pose range. The average accuracy on those 3 pose bins is 3.40 vs for 2DASL . Note that 2DASL  also uses external 2D data for training. Also, the unweighted average accuracy (normalized error averaged on all images from AFLW2000-3D database without considering the pose subsets) is 2.83 vs for the approach that use stacked dense U-Nets , which also integrates data from the MENPO dataset.
Iv-E Head pose estimation
|AC-DC (stage 1)||3.48||8.26||6.87||6.20|
|AC-DC (stage 4)||2.92||6.94||5.99||5.29|
Finally, in Table IV we compare our approach with state-of-the-art methods for head pose estimation on AFLW2000-3D database. Our method performs better than Trees  and FAN  which are a landmark-based methods. It is also better than 3DDFA  and the state-of the-art Hopenet , which respectively uses 3D morphable model fitting and head pose estimation from the raw images without aligning facial landmarks. note that our method is most significantly better on the yaw angle estimation, which is the main benchmark on AFLW2000-3D. More precisely, after cascade stage 1, AC-DC is roughly as accurate as Hopenet  in terms of average head pose estimation whereas, after stage 4 it performs much better: this highlights the fact that using head pose information to refine landmark alignment provides more precise landmark estimates, which in turn helps refine the head pose prediction, further advocating for an entwined landmark alignment and head pose prediction scheme.
Iv-F Qualitative analysis
Figure 5 shows the excitation values (computed using a single dense layer from the estimated head pose, as illustrated on Figure 4). For clarity purposes, we only plotted the excitation values for the first 200 examples of AFLW2000-3D database, and the first (left), second (middle) and third (right) channels of the embeddings (and of the corresponding masked embeddings). The conclusion of these charts are multiple: first, the excitation values for the original image and raw embeddings are significantly lower than the excitation values of their spatially-masked counterparts. This indicates that the network gives more weight to the spatially-masked channels to refine the landmark-wise attention maps in the subsequent cascade stages, showing the interest of designing a deep cascade as compared to a traditional deep approach. Second, the respective weights of the channels are heavily influenced by the yaw angle, notably for the masked image and channels. This influence is intrinsic to each individual channel, showing that channels are indeed selected using the head pose estimate and that the head pose conditionning takes place as expected.
Figures 6 shows examples of successful alignment on held out images from WFLW test set. Notice how the fused attention maps are coarse after the first and second cascade stages and are refined after the subsequent stages, outputting precise landmark alignment even under facial expression, non-planar head poses as well as difficult environmental lighting.
Last but not least, note that landmark-wise attention maps are generally more spread-out in case of an occluded or misaligned landmark: as such, an interesting future direction would be to estimate landmark-wise alignment uncertainty from the spread measurement, which is possible under certain assumptions .
In this paper, we proposed to entwine head pose estimation and facial landmark alignment tasks inside an attentional cascade. The proposed architecture employs a geometry transfer network (GTN), whose parameters are shared among the various cascade stages, to solve the annotation transfer task and integrate multiple heterogeneous landmark prediction tasks as well as head pose estimation within a single deep network. Our model also uses doubly-conditional fusion blocks to select relevant channels, and regions thereof, depending on a current head pose estimate, as well as attention maps corresponding to a current landmark alignment. The proposed AC-DC network can be trained in an end-to-end manner and significantly improves the state-of-the-art results for both 2D and 3D face alignment on recent large-scale datasets, as well as for head pose estimation, advocating for an entwined head pose estimation and face alignment pipeline. As such, AC-DC could be applied to closely related computer vision domains such as body pose estimation.
Faster than real-time facial alignment: a 3d spatial transformer network approach in unconstrained poses. In ICCV, Cited by: TABLE III.
-  (2017) How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In CVPR, Cited by: §IV-E, TABLE IV.
-  (2012) Real-time facial feature detection using conditional regression forests. In CVPR, Cited by: §III-B.
-  (2019) DeCaFA: deep convolutional cascade for face alignment in the wild. arXiv preprint arXiv:1904.02549. Cited by: §II, §III-C, §IV-B, TABLE II.
Pairwise conditional random forests for facial expression recognition. In ICCV 2015, Cited by: §I.
-  (2018) Style aggregated network for facial landmark detection. In CVPR, Cited by: §II.
-  (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In ECCV, Cited by: §II, §IV-D, TABLE III.
Wing loss for robust facial landmark localisation with convolutional neural networks. In CVPR, Cited by: §II, §IV-C, TABLE II.
-  (2018) Stacked dense u-nets with dual transformers for robust face alignment. BMVC. Cited by: §IV-D.
-  (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §III-B.
-  (2014) One millisecond face alignment with an ensemble of regression trees. In CVPR, Cited by: §IV-E, TABLE IV.
-  (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In NeurIPS, Cited by: §IV-F.
-  (2017) Deep alignment network: a convolutional neural network for robust face alignment. In CVPR workshops, Cited by: §II.
-  (2018) Disentangling 3d pose in a dendritic cnn for unconstrained 2d face alignment. In CVPR, Cited by: §II.
-  (2015) Deep learning face attributes in the wild. In ICCV, Cited by: §II, §II, §IV-A.
-  (2014) Face alignment at 3000 FPS via regressing local binary features. CVPR. Cited by: §II.
-  (2018) Fine-grained head pose estimation without keypoints. In CVPR Workshops, Cited by: §II, §IV-E, TABLE IV.
-  (2015) 300 Faces In-The-Wild Challenge: database and results. IVC. External Links: Cited by: §II, §III-A, §IV-A.
-  (2016) Face2face: real-time face capture and reenactment of rgb videos. In CVPR, Cited by: §I.
-  (2016) Mnemonic Descent Method: A Recurrent Process Applied for End-to-End Face Alignment. CVPR. External Links: Cited by: §II.
Joint 3d face reconstruction and dense face alignment from a single image with 2d-assisted self-supervised learning. arXiv preprint arXiv:1903.09359. Cited by: §IV-D, TABLE III.
-  (2019) Face alignment using a 3d deeply-initialized ensemble of regression trees. arXiv preprint arXiv:1902.01831. Cited by: §II, §IV-C, TABLE II.
-  (2017) Landmark based head pose estimation benchmark and method. In ICIP, Cited by: §II.
-  (2018) Look at boundary: a boundary-aware face alignment algorithm. In CVPR, Cited by: §II, §II, §III-A, §IV-A, §IV-C, TABLE II.
-  (2017) Leveraging intra and inter-dataset variations for robust face alignment. In CVPR workshops, Cited by: §II.
-  (2013) Supervised descent method and its applications to face alignment. CVPR. External Links: Cited by: §II, TABLE III.
-  (2018) Attentional alignment network. BMVC. Cited by: §II.
-  (2015) Leveraging datasets with varying annotations for face alignment via deep regression network. In ICCV, Cited by: §II.
-  (2016) Learning deep representation for face alignment with auxiliary attributes. PAMI. Cited by: §II.
-  (2016) Face alignment across large poses: a 3d solution. In CVPR, pp. 146–155. Cited by: §IV-A.
-  (2019) Face alignment in full pose range: a 3d total solution. PAMI. Cited by: §II, §II, §IV-A, §IV-D, §IV-E, TABLE III, TABLE IV.