LDDMM-Face: Large Deformation Diffeomorphic Metric Learning for Flexible and Consistent Face Alignment

08/02/2021
by   Huilin Yang, et al.
9

We innovatively propose a flexible and consistent face alignment framework, LDDMM-Face, the key contribution of which is a deformation layer that naturally embeds facial geometry in a diffeomorphic way. Instead of predicting facial landmarks via heatmap or coordinate regression, we formulate this task in a diffeomorphic registration manner and predict momenta that uniquely parameterize the deformation between initial boundary and true boundary, and then perform large deformation diffeomorphic metric mapping (LDDMM) simultaneously for curve and landmark to localize the facial landmarks. Due to the embedding of LDDMM into a deep network, LDDMM-Face can consistently annotate facial landmarks without ambiguity and flexibly handle various annotation schemes, and can even predict dense annotations from sparse ones. Our method can be easily integrated into various face alignment networks. We extensively evaluate LDDMM-Face on four benchmark datasets: 300W, WFLW, HELEN and COFW-68. LDDMM-Face is comparable or superior to state-of-the-art methods for traditional within-dataset and same-annotation settings, but truly distinguishes itself with outstanding performance when dealing with weakly-supervised learning (partial-to-full), challenging cases (e.g., occluded faces), and different training and prediction datasets. In addition, LDDMM-Face shows promising results on the most challenging task of predicting across datasets with different annotation schemes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 8

12/02/2020

ACE-Net: Fine-Level Face Alignment through Anchors and Contours Estimation

We propose a novel facial Anchors and Contours Estimation framework, ACE...
10/19/2021

Synergy between 3DMM and 3D Landmarks for Accurate 3D Facial Geometry

This work studies learning from a synergy process of 3D Morphable Models...
10/30/2019

Facial Image Deformation Based on Landmark Detection

In this work, we use facial landmarks to make the deformation for facial...
05/26/2018

Look at Boundary: A Boundary-Aware Face Alignment Algorithm

We present a novel boundary-aware face alignment algorithm by utilising ...
05/13/2019

A High-Efficiency Framework for Constructing Large-Scale Face Parsing Benchmark

Face parsing, which is to assign a semantic label to each pixel in face ...
04/14/2020

Unsupervised Performance Analysis of 3D Face Alignment

We address the problem of analyzing the performance of 3D face alignment...
09/02/2014

Transferring Landmark Annotations for Cross-Dataset Face Alignment

Dataset bias is a well known problem in object recognition domain. This ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Face alignment refers to identifying the geometric structure of a human face in a digital image, through localizing key landmarks that are usually predefined and characterizable of the face’s geometry. Face alignment is a prerequisite in many computer vision tasks, such as face recognition

[53], facial expression recognition [36, 52], face verification [39], face reconstruction [33] and face reenactment [29].

For different datasets, there exist different face alignment annotation schemes. For example, COFW [4] annotates 29 landmarks, 300W [34] annotates 68 landmarks, WFLW [46] annotates 98 landmarks, and HELEN [23] annotates 194 landmarks. Most existing face alignment methods can only deal with the specific annotation scheme adopted by the training dataset of interest, but cannot flexibly accommodate multiple annotation schemes. Namely, if a model is trained on a dataset with a specific annotation scheme, it can then only predict landmarks of the specific scheme; a model trained on 300W with a 68-landmark annotation scheme can only predict the learned 68 landmarks but not other annotation schemes such as a 194-landmark scheme. In addition, to date, there is no such work that can fully utilize partially-annotated data. In other words, it is infeasible to make full predictions based on only partially-annotated data; for example, predicting 194 landmarks when training with only 97 or even less landmarks. Even in traditional fully-supervised settings, most existing works cannot very well handle challenging cases such as faces with occlusion. This is because those existing works usually handle each landmark individually or predict landmarks with non-diffeomorphic deformations. In this way, the predicted landmarks could be inconsistent, resulting in incorrect facial geometry topology.

In such context, we formulate face alignment into a diffeomorphic registration framework. Specifically, we use boundary curves to represent facial geometry [11]. Then, large deformation diffeomorphic metric mapping (LDDMM) simultaneously for curve and landmark [16, 18]

between an initial face and the true face is encoded into a neural network for landmark localization. LDDMM delivers a non-linear smooth transformation with a favorable topology-preserving one-to-one mapping property. Once the diffeomorphism, which is parameterized by momenta

[11, 16, 18], between the initial face and the true face is obtained, all points on/around the initial face have unique correspondence on/around the true face through the acquired diffeomorphism. This property makes it possible to predict facial landmarks of different annotation schemes with a model trained only on landmarks from a single annotation scheme. Utilizing both landmark and curve enables LDDMM to handle shape deformations both locally and globally; the role of the landmark term is to match the corresponding landmarks whereas the role of the curve term is to make the corresponding facial curves be close to each other and to preserve facial topology such that consistent landmark predictions can be made. Noteworthily, we predict momenta instead of increments between the initial face and the true face, which gives it the additional flexibility and is the key novelty of the proposed method. This is the first time that face alignment is formulated as a diffeomorphic registration problem, providing novel insights into the face alignment realm.

In this work, our contributions include:

  • We propose a novel face alignment network by integrating LDDMM into deep neural networks to handle various facial annotation schemes. Our proposed approach, LDDMM-Face, can be easily integrated into most face alignment networks to effectively predict facial landmarks with different annotation schemes.

  • Our approach for the first time identifies the feasibility of predicting consistent facial boundaries and full facial landmarks supervised with only partial annotations of training data, in a novel manner of weakly-supervised learning.

  • We comprehensively evaluate the performance of our proposed LDDMM-Face on multiple widely-used face alignment benchmark datasets [23, 46, 34], in terms of not only landmark prediction accuracy but also overall facial geometry matching degree.

  • We demonstrate the effectiveness of LDDMM-Face in being superior in handling challenging cases even across datasets, being excellent in making partial-to-full predictions, being adaptable to various deep network settings, being capable of predicting consistent facial boundaries with different training annotations, and being flexible in handling multiple annotation schemes either within or across datasets.

2 Related Works

Face alignment has been widely researched with fruitful outcomes. From early Active Appearance Models [7, 25, 19, 35], Active Shape Models [26] to recently-developed Cascaded Shape Regression [49, 38, 20, 32, 24, 58, 5, 43]

and deep learning methods

[56, 55, 41, 48, 12, 3, 51, 21, 28]

, accuracy has been improved significantly. Equipped with powerful image feature extraction capability of Convolutional Neural Network (CNN), deep learning methods hold state-of-the-art (SOTA) results. The proposed method of this work, LDDMM-Face, fits in the deep learning scope.

Coordinate regression models Coordinate regression models use neural networks to directly predict coordinates of facial landmarks. Zhang [55] incorporates a variety of facial attributes like gender, expression and appearance in a multi-task learning framework. Kowalski [21] presents the Deep Alignment Network (DAN) that employs landmark heatmap and transformation of face and landmarks to a canonical plane in CNN. Qian [31] leverages disentangled style and shape space of each individual to augment existing structures via style translation, which can be regarded as a data augmentation approach.

Heatmap regression models

Heatmap regression models use neural networks to regress a set of heatmaps, one for each individual landmark, and then indirectly estimate landmarks from the heatmaps. Wu

[46] uses facial boundary heatmap to supervise landmark prediction, utilizing a hourglass module [28] and a message passing scheme [6]. Zou [60] employs hierarchically structured landmark ensembles to depict holistic and local structures of facial landmarks for robust facial landmark detection. Wang [45] proposes an adaptive wing loss for heatmap regression, with an ability to adapt its shape to various types of ground truth heatmap pixels. Kumar [22] uses a mean estimator for heatmap and jointly predicts landmark locations, associating uncertainties of the predicted locations and landmark visibilities. Browatzki [2]

leverages implicit knowledge by training an autoencoder on unannotated facial datasets to perform few-shot face alignment.

Weakly supervised learning for face alignment Weakly supervised face alignment works [2, 31, 10] usually explore in a way that train models on partial datasets with full annotations. To our best knowledge, no previous works have ever made such attempts of either training on partial annotations but making full predictions or training on full annotations but predicting additional landmarks.

Cross-dataset/annotation face alignment Cross-dataset face alignment refers to training on a dataset of a specific annotation scheme while evaluating on other datasets of the same annotation scheme. Cross-annotation face alignment refers to training on a dataset of a specific annotation scheme but evaluating on datasets of different annotation schemes. Both cross-dataset and cross-annotation performance can measure a method’s generalization ability. So far, existing works [46, 59, 54] typically utilize information from multiple different datasets and their corresponding annotation schemes to boost the training performance on one specific dataset, and no work has ever investigated flexible and consistent face alignment across datasets nor across annotations.

LDDMM is a SOTA registration framework that has been widely used in the biomedical image field [27, 40, 17, 50]. Recently, LDDMM has also shown its effectiveness in facial recognition related fields [52]. A key component of LDDMM is that it yields a diffeomorphism between two manifolds of interest, which inspires our proposed deformation layer in LDDMM-Face and makes it feasible to consistently predict additional landmarks (in addition to the training ones) and to make cross-annotation predictions as well as effectively deal with challenging cases.

3 Deep LDDMM Network

Figure 2: The overall pipeline of LDDMM-Face, which consists of a backbone model and two functional layers: a momenta estimator and a deformation layer that consists of N flows. In each flow of the deformation layer, the initial facial curve is shown in the same color as that in the mean face, and the deformed facial curve is shown in black connected diamonds. The fine blue lines connecting each initial landmark and the corresponding deformed landmark denote the trajectory of the initial landmarks. Green arrows show the predicted momenta at each time step along the trajectory.

As mentioned in section 2, coordinate and heatmap regression methods cannot predict facial landmarks of different annotation schemes without retraining. In LDDMM-Face, we integrate LDDMM based facial shape registration into a deep learning framework, which can not only consistently predict facial landmarks across different annotations and datasets but also train a face alignment model in a weakly-supervised partial-to-full fashion.

Given a normalized RGB face image, LDDMM-Face first extracts both spatial and semantic features from the input image with a replaceable backbone model. Second, the features are passed through a deep LDDMM head which consists of a momenta estimator and a deformation layer. The momenta estimator contains fully-connected layers and predicts vertical and horizontal momenta for each landmark. Suppose the geometry of a face is characterized by boundary curves, the deformation layer has sublayers (flow to flow ). Each sublayer separately deforms the corresponding initial curve, the procedure of which is detailed in subsection 3.1. Two inputs, the mean face serving as the initial face and the estimated momenta, are fed into the deformation layer. The deformed facial curves from each layer are sequentially concatenated, yielding an estimate of the true face. Fig. 2 shows the overall pipeline of LDDMM-Face.

The structure and configurations of the baseline backbone model are identical to an existing SOTA facial landmark detector [44]

. We focus on the proposed deformation layer and loss function since these components can be readily integrated into most deep learning-based face alignment pipelines and detailed investigations of the baseline network go beyond the scope of this work.

3.1 LDDMM Deformation Layer

3.1.1 LDDMM-curvelandmark

Our proposed deformation layer, based on LDDMM-curvelandmark, combines the advantages of LDDMM-curve [16] and LDDMM-landmark [18] to account for both global and local discrepancies in the matching process. LDDMM [16, 18, 11]

is a registration framework that provides a diffeomorphic transformation acting on the ambient space. Under the LDDMM framework, objects are placed into a reproducing kernel Hilbert metric space through time-varying velocity vector fields

: for in the ambient space. The underlying assumption is that the two objects of interest are of equivalent topology and one can be deformed from the other via a flow of diffeomorphisms. Given a pair of objects and , the time-dependent flow of diffeomorphisms transforming to

is defined according to the ordinary differential equation (ODE)

, with . The resulting diffeomorphism is acquired as the end point of the diffeomorphism flow at time such that . To ensure the resulting transformation is diffeomorphic, must satisfy the constraint that , with being a Hilbert space associated with a reproducing kernel function and a norm [42]. In practice, Gaussian kernel is selected for being , where represents the kernel size that is usually selected empirically and denotes the -norm.

In LDDMM-curve, a curve is discretized into a sequence of ordered points . That curve can be encoded by those points along with their tangent vectors such that , with being the center of two sequential points and being the tangent vector at point . is associated with a sum of vector-valued Diracs, , and is embedded into a Hilbert metric space of smooth vectors with the norm being

(1)

where is the reproducing kernel in the space ( is of the same form as that of ) and is the dual space of . In LDDMM-landmark, a set of landmarks are represented by its Cartesian coordinates. Thus, a set of ordered points can be modelled as both curve and landmark.

LDDMM-curve can handle the overall shape whereas LDDMM-landmark is more powerful in dealing with local information. Assume that the template object (the transforming object) and the target object (the object being transformed to) are respectively discretized as and , and is the deformed object , then the resulting diffeomorphism is obtained by minimizing the following inexact matching functional

(2)

where can be interpreted as the energy consumed by the flow of diffeomorphisms, and the second term quantifies the overall discrepancy between the deformed object and the target object . is a weight in serving as the trade-off coefficient between the consumed energy and the overall discrepancy. In LDDMM-curvelandmark, the discrepancy consists of two parts

(3)

where measures the discrepancy between the deformed object and the target object when modelled as curves and quantifies the corresponding discrepancy when modelled as landmarks. is a trade-off weight deciding the relative importance of curve and landmark. The curve discrepancy is computed as the norm of the difference between the two vector-valued curve representations in the space , which is explicitly

(4)

and the landmark discrepancy is computed as the Euclidean distance averaged across all point pairs

(5)

After minimizing , the resulting diffeomorphism is parameterized by the velocity vector field as , where denotes the time-dependent momentum at the -th landmark. A diffeomorphism is completely encoded by the initial momenta in the template space. These momenta can be obtained by solving the following sets of ODEs

(6)

where denotes the trajectory of the -th landmark on the template object.

3.1.2 Deformation Layer

The deformation layer takes the predicted momenta as inputs to perform LDDMM-induced deformation on the initial face. Trajectory of the -th landmark is

(7)

The finally estimated true face (also called deformed face) is obtained at the end time point of the transformation flow.

As illustrated in the top panel of Fig. 2, since a face is modelled using boundary curves, the LDDMM transformation component of the deformation layer is separately implemented for each curve from flow to flow . depends on the annotation scheme. The procedure of each flow is demonstrated in the bottom panel of Fig. 2.

3.2 Loss Function

The loss function in our proposed network is inspired by the objective function of LDDMM-curvelandmark in Eq. 2. Focusing on accuracy, is chosen to be 0 given that an accurate matching matters more than a geodesic path in face alignment. Although is 0, the solution of the loss function is embedded into the space and still yields diffeomorphic transformations. Thus, the loss function, minimized with respect to the vector of LDDMM momenta , is

(8)

where is the vector-measured expression of the ground truth curve of the -th facial curve, denotes the corresponding deformed curve, and quantifies the discrepancy between the ground truth and the deformed curve computed via Eq. 4. is a vector representing the ground truth landmarks of the -th facial curve, denotes the corresponding deformed landmarks and measures the discrepancy between the ground truth and the deformed landmarks computed via Eq. 5. is the distance between the pupils of the ground truth face. is a trade-off coefficient between landmark and curve.

Therefore, our loss function takes discrepancies of both landmark and curve into consideration and consequentially is able to handle local as well as global discrepancies between the ground truth and the deformed face.

3.3 Flexible and Consistent Face Alignment

Once momenta are obtained from LDDMM between an initial face and a true face, the diffeomorphism between that face pair is uniquely defined [16, 18]. This transformation can be used to deform not only landmarks used in the matching procedure but also any other landmarks sitting around the transforming face boundary. Due to the smooth, topology-preserving and one-to-one mapping property of the obtained diffeomorphism, we can compute the deformed location of any landmark lying on/around the face boundary in a consistent way. Any two deformed landmarks would never come across each other and any deformed boundary would never across itself, which is practically and intuitively reasonable for muscle motions of the human face. Suppose the initial locations of landmarks lying on/around the face boundary are , we have , where represents the location of the deformed -th landmark at time , . and respectively denote the location and momentum of the -th landmark that has been used in the matching procedure. is the reproducing kernel used in the matching procedure. The final locations are obtained at the end point of the transformation flow, namely . The transformations and acquired momenta are different for the facial curves, and thus landmarks on/around each curve are deformed separately.

Fig. 3 demonstrates an example of consistent alignment. Noticeably, some of the newly-annotated cyan star landmarks which were not involved in obtaining the LDDMM-induced diffeomorphism can still be deformed to proper locations through the predicted diffeomorphism.

Figure 3: Demonstration of flexible and consistent face alignment for a right cheek. Red circles and blue diamonds respectively represent the initial and true facial landmarks involved in training. Cyan stars and magenta stars respectively represent newly-annotated landmarks on/around the initial face and the corresponding deformed landmarks of the newly-annotated ones through diffeomorphism obtained in the training stage. Green arrows denote the momenta along the trajectory of the transforming curve and blue lines represent the corresponding trajectory. Gray grids represent the diffeomorphism-induced deformations. Note that even the cyan points not used in computing the diffeomorphism are deformed correctly.

Therefore, given a pair of initial and true faces, once we have the LDDMM derived momenta, we can flexibly as well as consistently predict the deformed location of any extra landmark regardless of the training annotation scheme used.

4 Experiment

In this section, the employed datasets and error metrics as well as implementation details are described below. Subsection 4.1 shows the adaptability of LDDMM-Face and subsection 4.2 compares our results with SOTA. Subsections 4.3 and 4.4 show the flexibility and consistency of LDDMM-Face by performing weakly-supervised face alignment in a partial-to-full manner, both across annotations and across datasets.

Datasets To evaluate the performance of LDDMM-Face, we conduct experiments on 300W [34], WFLW [46], HELEN [23] and COFW-68 [4, 15], all of which are benchmark datasets for face alignment. For more details on these datasets, please refer to our supplementary material.

Error Metrics We use two types of metrics to quantify the face alignment error:

  • : The mean distance between the predicted landmarks and the ground truth landmarks divided by the inter-ocular distance [32, 58, 48].

  • : The mean iterative closest point (ICP) error between the predicted curves and the ground truth curves divided by the inter-ocular distance [1].

Specifically, the ICP error is introduced to quantify the overall curve discrepancy and it can be used to solve the problem that inter-ocular landmark distance is unavailable when there is no point-by-point correspondence between the predicted landmarks and the ground truth landmarks. With the ICP error, fair comparisons can be conducted between a heatmap regression-based baseline method and LDDMM-Face in weakly-supervised, cross-dataset and cross-annotation face alignment settings.

Following [46, 9], the area under the cumulative error distribution curve () and the failure rate () are also used when evaluating the test set of WFLW. And is also used for COFW-68 [47, 58].

Implementation LDDMM-Face consists of a backbone model, a momenta estimator, a deformation layer and a loss function, as presented in section 3. For the backbone model, we employ three different networks for evaluating the adaptability of LDDMM-Face, which will be described in subsection 4.1. For the momenta estimator, we adopt a simple yet effective structure consisting of an average pooling layer and a fully-connected layer. For the deformation layer of LDDMM-curvelandmark, and are respectively chosen to be the scale and half the scale of the coordinates of each facial curve of the mean face. is chosen to be 12 in order to efficiently characterize different parts of a face. For the loss function,

is empirically chosen to be 0.1. All our experiments are conducted with PyTorch 1.7.1

[30] on 4 RTX 3090 GPUs. More details are illustrated in our supplementary material. Codes will be released.

4.1 Adaptive Face Alignment across Different Learning Frameworks

In this subsection, we investigate the adaptability and robustness of LDDMM-Face incorporated into three networks, namely HRNet [44], Hourglass [51] and DAN [21]. HRNet and Hourglass are SOTA heatmap regression methods and DAN is a multi-stage coordinate regression method. Considering computation cost, we use HRNetV2-W18, 2-stacked hourglass and VGG11 [37] as the corresponding backbone models of the three networks. Experimental results on 300W, WFLW and HELEN (Table 1) demonstrate that LDDMM-Face can be easily integrated into those face alignment networks. Among all three settings, LDDMM-Face (HRNet) gives the best results, which is thus employed as the default model in our subsequent experiments.

4.2 Comparison with State-of-the-art Results

We compare LDDMM-Face with SOTA approaches on the test sets of WFLW and 300W, respectively in Table 2 and Table 3. Experimental results verify the effectiveness of LDDMM-Face. For WFLW, although this dataset confronts large variations of poses, expressions and occlusions, LDDMM-Face yields superior results and outperforms almost all compared approaches. For 300W, the performance of LDDMM-Face is comparable to its baseline HRNet and outperforms most existing methods.

Method (%)
300W WFLW HELEN
LDDMM-Face (HRNet) 3.53 4.63 3.57
LDDMM-Face (Hourglass) 3.73 5.00 3.89
LDDMM-Face (DAN) 3.91 5.43 3.95
Table 1: of LDDMM-Face incorporated into three different face alignment settings, obtained on the 300W full set, WFLW test set and HELEN test set.
Method (%) (%)
CFSS [58] 9.07 29.40 0.3659
DVLN [47] 6.08 10.84 0.4551
3FabRec [2] 5.62 8.28 0.4840
LAB (Extra Data) [46] 5.27 7.56 0.5323
AVS [31] 5.25 7.44 0.5034
SAN [9] 5.22 6.32 0.5355
HRNet [44] 4.60 4.64 0.5237
LDDMM-Face 4.63 3.68 0.5509
LDDMM-Face (Weak-LF: 50%) 4.79 4.12 0.5352
Table 2: , and results on the WFLW test set. "Weak-LF" is short for training landmark fraction in weakly-supervised face alignment.
Method (%)
300W 300W 300W
Common Challenging Full
MDM [41] 4.83 10.14 5.88
RAR [48] 5.03 8.95 5.80
SAN [9] 3.34 6.60 3.98
ODN [57] 3.91 5.43 3.95
3FabRec [2] 3.36 5.74 3.82
DAN [21] 3.15 5.53 3.62
LAB (Extra Data) [46] 2.98 5.19 3.49
DeCaFa (Extra Data) [8] 2.93 5.26 3.39
HRNet [44] 2.91 5.11 3.34
LDDMM-Face 3.07 5.40 3.53
LDDMM-Face (Weak-LF: 50%) 3.18 5.65 3.67
Table 3: results on the 300W common set, challenging set and full set.

4.3 Flexible and Consistent Face Alignment in a Weakly-supervised Manner

In this subsection, we show the power of LDDMM-Face in two folds. We first validate the flexibility and consistency of LDDMM-Face by performing weakly-supervised learning. As described in subsection 3.3, LDDMM-Face can predict any extra landmark lying nearby the predefined curve. Thus, we can train a model with partial landmarks that minimally describe facial geometry to predict full landmarks in a consistent way. We conduct such experiments also on 300W, WFLW and HELEN. For 300W and WFLW, 50% facial landmarks are used for weakly-supervised training. Since HELEN has a total of 194 landmarks annotated which is a relatively large number, we further reduce the training landmarks to 33% in the HELEN experiment. As tabulated in Table 4, LDDMM-Face outperforms its baseline by a large margin. In respect of , there is a 40% improvement on 300W, a 10% improvement on WFLW and a 20% improvement on HELEN, when trained with 50% landmarks. A 35% improvement is observed on HELEN when trained with 33% landmarks. Noteworthily, LDDMM-Face is much better than the baseline in detecting face contour and eyebrow, indicating it works better for curves with large deformations. When trained on partial landmarks, there is only very mild decline in LDDMM-Face’s performance compared to training on full landmarks.

As shown in Tables 23, on both WFLW and 300W, when trained with 50% landmarks, weakly-supervised LDDMM-Face still holds SOTA and performs even better than some of the fully supervised methods. Other methods cannot perform such partial-to-full predictions. More results are presented in the supplementary material.

Methods (%) (%)
O F E N I M Full landmarks (100%)
300W training landmark fraction: 50%
HRNet 4.82 9.34 6.00 3.18 1.86 3.74 -
LDDMM-Face 2.94 4.82 3.47 2.28 1.87 2.25 3.18
HELEN training landmark fraction: 50%
HRNet 2.95 4.33 3.08 3.16 1.77 2.40 -
LDDMM-Face 2.39 3.37 2.52 2.61 1.46 1.97 3.71
HELEN training landmark fraction: 33%
HRNet 3.73 5.63 3.95 3.92 2.15 3.01 -
LDDMM-Face 2.45 3.29 2.74 2.76 1.56 1.91 3.78
WFLW training landmark fraction: 50%
HRNet 3.95 5.72 4.04 3.34 3.30 3.36 -
LDDMM-Face 3.58 4.67 3.72 3.20 2.95 3.38 4.79
Table 4: Partial-to-full prediction results on the 300W common set, HELEN test set and WFLW test set for weakly-supervised face alignment. Landmark fraction means the fraction of full landmarks used in training stage of weakly-supervised face alignment. ’O’ indicates overall face. ’F’ means face contour. ’E’ means eyebrows. ’N’ means nose. ’I’ means eyes. ’M’ means mouth. ’-’ indicates is unavailable for HRNet since it cannot make cross-annotation predictions.

4.4 Flexible and Consistent Face Alignment across Annotations and Datasets

We further validate the flexibility and consistency of LDDMM-Face by evaluating cross-dataset/annotation face alignment performance. Existing cross-dataset evaluations mainly utilize the COFW-68 dataset which have been reannotated with an identical scheme as that of 300W.

As mentioned above, HELEN has two annotation schemes since it is also a subset of 300W. As such, the cross-annotation face alignment experiments between HELEN and 300W can be treated as cross-annotation but within-dataset. By conducting an affine transformation from source mean face to target mean face, we can easily predict landmarks of different annotation schemes without retraining. From Fig. 4 and Table 5, we observe that LDDMM-Face significantly improves the performance over the baseline. It should be noted that although the 194-landmark annotation scheme of HELEN describes the nose and eyebrow in totally different ways from the 68-landmark annotation scheme of 300W, LDDMM-Face achieves decent performance. We also conduct simultaneous cross-dataset and cross-annotation experiments between 300W and WFLW, on which only slight improvements are observed due to the highly similar annotation schemes between these two datasets. Table 5 shows that LDDMM-Face is much better than the baseline in , but the absolute value of is still relatively unsatisfactory compared to traditional within-dataset and within-annotation predictions. A plausible reason is that we use an affine transformation between the two different mean faces rather than directly modify the mean face use in the specific training process, and the two mean faces may be highly inconsistent with each other. With that being said, this is to the best of our knowledge the first attempt of simultaneous cross-dataset and cross-annotation face alignment, with satisfactory performance in identifying the overall facial geometry (curve error). This observation further verifies the effectiveness and importance of LDDMM-Face.

To compare with existing SOTA cross-dataset face alignment results, we further conduct experiments on COFW-68, as summarized in Table 6. LDDMM-Face significantly outperforms those compared methods, especially in terms of which is very sensitive to challenging cases like large pose and occlusion. The outstanding performance of LDDMM-Face for challenging cases is mainly due to the curve and landmark induced diffeomorphism; diffeomorphic transforming ensures the deformed facial geometry is consistent with that of the initial face such that the occluded parts can still be accurately predicted. Collectively, LDDMM-Face makes precise facial geometry predictions across different annotations (both within and across datasets), performs outstandingly for cross-dataset settings, and also effectively handles challenging cases such as occluded faces. More cross-dataset/annotation results from LDDMM-Face can be found in our supplementary material.

Methods (%) (%)
O F E N I M O
HELEN to 300W
HRNet 5.49 6.72 5.82 9.00 2.72 3.21 -
LDDMM-Face 4.76 6.81 4.25 7.45 2.21 3.09 5.96
300W to HELEN
HRNet 5.60 6.61 6.18 9.07 2.79 3.36 -
LDDMM-Face 4.13 3.47 4.79 7.19 2.41 2.81 7.58
WFLW to 300W
HRNet 3.91 5.29 4.62 3.01 3.14 3.46 -
LDDMM-Face 3.88 5.27 4.08 3.32 2.94 3.76 4.53
300W to WFLW
HRNet 6.61 8.65 6.74 5.63 6.02 6.02 -
LDDMM-Face 6.04 6.82 6.41 5.77 5.33 5.88 9.58
Table 5: Comparison between LDDMM-Face and HRNet on the 300W common set, HELEN test set and WFLW test set for cross-dataset/annotation face alignment. HELEN to 300W means training on the HELEN train set and testing on the 300W common set. Same logic applies for others.
Method (%) (%)
PCPR [4] 8.76 20.12
TCDCN [55] 7.66 16.17
HPM [14] 6.72 6.71
SAPM [13] 6.64 5.72
CFSS [58] 6.28 9.07
HRNet [44] 4.97 3.16
LAB (Extra Data) [46] 4.62 2.17
LDDMM-Face 4.54 1.18
Table 6: and results of training on 300W and testing on the COFW-68 test set.
Figure 4: Representative cross-annotation/dataset face alignment results. From left to right, training/testing is respectively conducted on the HELEN/300W, 300W/HELEN, 300W/WFLW and WFLW/300W.

5 Conclusion

In this work, we present and validate a novel face alignment pipeline, LDDMM-Face, that is able to perform flexible and consistent face alignment and also effectively deal with challenging cases. The flexibility and consistency delivered by LDDMM-Face arise naturally from an embedding of LDDMM into deep learning. It bridges the gap between different annotation schemes and makes the task of face alignment more flexible than existing methods which can only predict landmarks involved in annotations of the training data. Most importantly, LDDMM-Face has a great generalization ability and can be integrated into various deep learning based face alignment networks.

References

  • [1] K. S. Arun, T. S. Huang, and S. D. Blostein (1987) Least-squares fitting of two 3-d point sets. IEEE Transactions on pattern analysis and machine intelligence (5), pp. 698–700. Cited by: item -.
  • [2] B. Browatzki and C. Wallraven (2020) 3FabRec: fast few-shot face alignment by reconstruction. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 6110–6120. Cited by: §2, §2, Table 2, Table 3.
  • [3] A. Bulat and G. Tzimiropoulos (2016) Two-stage convolutional part heatmap regression for the 1st 3d face alignment in the wild (3dfaw) challenge. In European Conference on Computer Vision, pp. 616–624. Cited by: §2.
  • [4] X. P. Burgos-Artizzu, P. Perona, and P. Dollár (2013) Robust face landmark estimation under occlusion. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1513–1520. Cited by: §1, Table 6, §4.
  • [5] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun (2014) Joint cascade face detection and alignment. In European Conference on Computer Vision, pp. 109–122. Cited by: §2.
  • [6] X. Chu, W. Ouyang, H. Li, and X. Wang (2016)

    Structured feature learning for pose estimation

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4715–4723. Cited by: §2.
  • [7] T. F. Cootes, G. J. Edwards, and C. J. Taylor (2001) Active appearance models. IEEE Transactions on Pattern Analysis & Machine Intelligence (6), pp. 681–685. Cited by: §2.
  • [8] A. Dapogny, K. Bailly, and M. Cord (2019) DeCaFA: deep convolutional cascade for face alignment in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6893–6901. Cited by: Table 3.
  • [9] X. Dong, Y. Yan, W. Ouyang, and Y. Yang (2018) Style aggregated network for facial landmark detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 379–388. Cited by: Table 2, Table 3, §4.
  • [10] X. Dong and Y. Yang (2019) Teacher supervises students how to learn from partially labeled images for facial landmark detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 783–792. Cited by: §2.
  • [11] P. Dupuis, U. Grenander, and M. I. Miller (1998) Variational problems on flows of diffeomorphisms for image matching. Quarterly of applied mathematics, pp. 587–600. Cited by: §1, §3.1.1.
  • [12] H. Fan and E. Zhou (2016) Approaching human level facial landmark localization by deep learning. Image and Vision Computing 47, pp. 27–35. Cited by: §2.
  • [13] G. Ghiasi, C. C. Fowlkes, and C. Irvine (2015) Using segmentation to predict the absence of occluded parts.. In BMVC, pp. 22–1. Cited by: Table 6.
  • [14] G. Ghiasi and C. C. Fowlkes (2014) Occlusion coherence: localizing occluded faces with a hierarchical deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2385–2392. Cited by: Table 6.
  • [15] G. Ghiasi and C. C. Fowlkes (2015) Occlusion coherence: detecting and localizing occluded faces. arXiv preprint arXiv:1506.08347. Cited by: §4.
  • [16] J. Glaunès, A. Qiu, M. I. Miller, and L. Younes (2008) Large deformation diffeomorphic metric curve mapping. International journal of computer vision 80 (3), pp. 317. Cited by: §1, §3.1.1, §3.3.
  • [17] Z. Jiang, H. Yang, and X. Tang (2018) Deformation-based statistical shape analysis of the corpus callosum in mild cognitive impairment and alzheimer’s disease. Current Alzheimer Research 15 (12), pp. 1151–1160. Cited by: §2.
  • [18] S. C. Joshi and M. I. Miller (2000) Landmark matching via large deformation diffeomorphisms. IEEE transactions on image processing 9 (8), pp. 1357–1370. Cited by: §1, §3.1.1, §3.3.
  • [19] F. Kahraman, M. Gokmen, S. Darkner, and R. Larsen (2007) An active illumination and appearance (aia) model for face alignment. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7. Cited by: §2.
  • [20] V. Kazemi and J. Sullivan (2014) One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1867–1874. Cited by: §2.
  • [21] M. Kowalski, J. Naruniec, and T. Trzcinski (2017) Deep alignment network: a convolutional neural network for robust face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 88–97. Cited by: §2, §2, §4.1, Table 3.
  • [22] A. Kumar, T. K. Marks, W. Mou, Y. Wang, M. Jones, A. Cherian, T. Koike-Akino, X. Liu, and C. Feng (2020) LUVLi face alignment: estimating landmarks’ location, uncertainty, and visibility likelihood. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8236–8246. Cited by: §2.
  • [23] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang (2012) Interactive facial feature localization. In European conference on computer vision, pp. 679–692. Cited by: item •, §1, §4.
  • [24] D. Lee, H. Park, and C. D. Yoo (2015) Face alignment using cascade gaussian process regression trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4204–4212. Cited by: §2.
  • [25] I. Matthews and S. Baker (2004) Active appearance models revisited. International journal of computer vision 60 (2), pp. 135–164. Cited by: §2.
  • [26] S. Milborrow and F. Nicolls (2008) Locating facial features with an extended active shape model. In European conference on computer vision, pp. 504–513. Cited by: §2.
  • [27] M. I. Miller, L. Younes, J. T. Ratnanather, T. Brown, H. Trinh, D. S. Lee, D. Tward, P. B. Mahon, S. Mori, M. Albert, et al. (2015) Amygdalar atrophy in symptomatic alzheimer’s disease based on diffeomorphometry: the biocard cohort. Neurobiology of aging 36, pp. S3–S10. Cited by: §2.
  • [28] A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pp. 483–499. Cited by: §2, §2.
  • [29] Y. Nirkin, Y. Keller, and T. Hassner (2019) Fsgan: subject agnostic face swapping and reenactment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7184–7193. Cited by: §1.
  • [30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §4.
  • [31] S. Qian, K. Sun, W. Wu, C. Qian, and J. Jia (2019) Aggregation via separation: boosting facial landmark detector with semi-supervised style translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10153–10163. Cited by: §2, §2, Table 2.
  • [32] S. Ren, X. Cao, Y. Wei, and J. Sun (2014) Face alignment at 3000 fps via regressing local binary features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1685–1692. Cited by: §2, item -.
  • [33] E. Richardson, M. Sela, R. Or-El, and R. Kimmel (2017) Learning detailed face reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1259–1268. Cited by: §1.
  • [34] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic (2013) 300 faces in-the-wild challenge: the first facial landmark localization challenge. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 397–403. Cited by: item •, §1, §4.
  • [35] J. Saragih and R. Goecke (2007) A nonlinear discriminative approach to aam fitting. In 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. Cited by: §2.
  • [36] E. Sariyanidi, H. Gunes, and A. Cavallaro (2015) Automatic analysis of facial affect: a survey of registration, representation, and recognition. IEEE transactions on pattern analysis and machine intelligence 37 (6), pp. 1113–1133. Cited by: §1.
  • [37] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
  • [38] K. Su and X. Geng (2019) Soft facial landmark detection by label distribution learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 5008–5015. Cited by: §2.
  • [39] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: §1.
  • [40] X. Tang, D. Holland, A. M. Dale, L. Younes, M. I. Miller, and A. D. N. Initiative (2015) The diffeomorphometry of regional shape change rates and its relevance to cognitive deterioration in mild cognitive impairment and a lzheimer’s disease. Human brain mapping 36 (6), pp. 2093–2117. Cited by: §2.
  • [41] G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos, and S. Zafeiriou (2016) Mnemonic descent method: a recurrent process applied for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4177–4187. Cited by: §2, Table 3.
  • [42] A. Trouvé (1995) An infinite dimensional group approach for physics based models in pattern recognition. preprint. Cited by: §3.1.1.
  • [43] O. Tuzel, T. K. Marks, and S. Tambe (2016) Robust face alignment using a mixture of invariant experts. In European Conference on Computer Vision, pp. 825–841. Cited by: §2.
  • [44] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, et al. (2020) Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence. Cited by: §3, §4.1, Table 2, Table 3, Table 6.
  • [45] X. Wang, L. Bo, and L. Fuxin (2019) Adaptive wing loss for robust face alignment via heatmap regression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6971–6981. Cited by: §2.
  • [46] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou (2018-06) Look at boundary: a boundary-aware face alignment algorithm. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: item •, §1, §2, §2, Table 2, Table 3, Table 6, §4, §4.
  • [47] W. Wu and S. Yang (2017) Leveraging intra and inter-dataset variations for robust face alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 150–159. Cited by: Table 2, §4.
  • [48] S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, and A. Kassim (2016) Robust facial landmark detection via recurrent attentive-refinement networks. In European conference on computer vision, pp. 57–72. Cited by: §2, item -, Table 3.
  • [49] X. Xiong and F. De la Torre (2013) Supervised descent method and its applications to face alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 532–539. Cited by: §2.
  • [50] H. Yang, J. Wang, H. Tang, Q. Ba, G. Yang, and X. Tang (2017) Analysis of mitochondrial shape dynamics using large deformation diffeomorphic metric curve matching. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 4062–4065. Cited by: §2.
  • [51] J. Yang, Q. Liu, and K. Zhang (2017) Stacked hourglass network for robust facial landmark localisation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 79–87. Cited by: §2, §4.1.
  • [52] P. Yang, H. Yang, Y. Wei, and X. Tang (2018) Geometry-based facial expression recognition via large deformation diffeomorphic metric curve mapping. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 1937–1941. Cited by: §1, §2.
  • [53] D. Yi, Z. Lei, and S. Z. Li (2013-06) Towards pose robust face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [54] J. Zhang, M. Kan, S. Shan, and X. Chen (2015) Leveraging datasets with varying annotations for face alignment via deep regression network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3801–3809. Cited by: §2.
  • [55] Z. Zhang, P. Luo, C. C. Loy, and X. Tang (2016) Learning deep representation for face alignment with auxiliary attributes. IEEE transactions on pattern analysis and machine intelligence 38 (5), pp. 918–930. Cited by: §2, §2, Table 6.
  • [56] E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin (2013) Extensive facial landmark localization with coarse-to-fine convolutional network cascade. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 386–391. Cited by: §2.
  • [57] M. Zhu, D. Shi, M. Zheng, and M. Sadiq (2019) Robust facial landmark detection via occlusion-adaptive deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3486–3496. Cited by: Table 3.
  • [58] S. Zhu, C. Li, C. Change Loy, and X. Tang (2015) Face alignment by coarse-to-fine shape searching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4998–5006. Cited by: §2, item -, Table 2, Table 6, §4.
  • [59] S. Zhu, C. Li, C. C. Loy, and X. Tang (2014) Transferring landmark annotations for cross-dataset face alignment. arXiv preprint arXiv:1409.0602. Cited by: §2.
  • [60] X. Zou, S. Zhong, L. Yan, X. Zhao, J. Zhou, and Y. Wu (2019) Learning robust facial landmark detection via hierarchical structured ensemble. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 141–150. Cited by: §2.