Fast and Accurate: Structure Coherence Component for Face Alignment

06/21/2020 ∙ by Beier Zhu, et al. ∙ SenseTime Corporation UNIVERSITY OF TORONTO 0

In this paper, we propose a fast and accurate coordinate regression method for face alignment. Unlike most existing facial landmark regression methods which usually employ fully connected layers to convert feature maps into landmark coordinate, we present a structure coherence component to explicitly take the relation among facial landmarks into account. Due to the geometric structure of human face, structure coherence between different facial parts provides important cues for effectively localizing facial landmarks. However, the dense connection in the fully connected layers overuses such coherence, making the important cues unable to be distinguished from all connections. Instead, our structure coherence component leverages a dynamic sparse graph structure to passing features among the most related landmarks. Furthermore, we propose a novel objective function, named Soft Wing loss, to improve the accuracy. Extensive experiments on three popular benchmarks, including WFLW, COFW and 300W, demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance with fast speed. Our approach is especially robust to challenging cases resulting in impressively low failure rate (0 in COFW and WFLW datasets.



There are no comments yet.


page 2

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Face alignment, also known as facial landmark detection is an important topic in computer vision and has attracted much attention over past few years 

[43, 14, 8, 44]

. As a fundamental step for face image analysis, face alignment plays a key role in many face applications such as face recognition 

[58], expression analysis [52] and face editing [38]. Although significant progress has been made, face alignment is still a challenging problem due to issues like occlusion, large pose and complicated expression.

Figure 1: Comparison between the fully connected layer and our graph convolutional layer. (a) Dense connection in the fully connected layer. (b, d) The performance of fully connected based and our Structure Coherence Component based method under different levels of occlusions. Green and red points correspond to ground-truth and prediction, respectively. (c) Sparse and relation-aware graph convolutional layer.

With the success of deep learning in several computer vision tasks such as image classification and object detection, many convolutional neural networks (CNN) based face alignment methods have been proposed. Existing CNN-based face alignment methods can mainly be divided into two categories: coordinate regression based 

[39, 14, 43] and heatmap regression based ones [48, 8, 37]. Heatmap regression based methods commonly produce higher precise localization for its translation equivariant property [5]. Keeping the sizes of feature maps and heatmap is essential for high accuracy. However, it will also lead to computationally heavy models which are impractical for deployment in real-world applications. Coordinate regression based methods are relatively simpler and can be built on lighter convolutional networks. The fully connected layers (FC) are commonly used in such methods to convert feature maps to facial landmark coordinates [39, 14, 43]. However, the dense connections of fully connected layers make every landmark correlate to each other. As shown in Fig. 1(a), in the FC layer, every landmark coordinate is connected to the same hidden features. The error of one landmark leads to error of all other landmarks, especially in hard cases such as occlusion. As shown in Fig. 1(b), when we progressively occlude human face, the error of face contour leads to the error of other parts of human face.

Structure coherence between different facial parts provides important cues for effectively localizing facial landmarks, which helps keep the structure of face and predict occluded landmarks. In this paper, we propose Structure Coherence Component (SCC) to convert feature maps to facial landmark coordinates by explicitly exploring the relation among facial landmarks. With the help of deep geometric learning, we treat the intermediate features of each landmark as a node, and leverage a sparse graph structure to propagate features among the neighboring nodes, see Fig. 1(c). The sparse graph structure endows the model with the the capability of using the facial structure coherence appropriately. The sparse graph structure is learnt by data-driven based neighborhood construction and dynamic weight adjustment. Fig. 1(d) shows that reasoning with structure coherence cues allows our model to correctly localize the key points in challenging real-world situations such as occlusion and large pose. As shown in Fig. 2, Structure Coherence Component consists of four parts: attention guided multi-scale feature fusion, mapping to node, dynamic adjacency matrix weighting module and graph relation network. The attention guided multi-scale feature fusion provides rich spatial details and semantic information features. The mapping to node module converts these convolutional features into graph node representations and the relation is learnt via dynamic adjacency matrix weighting module, based on which, the graph relation network effectively regresses the coordinate of facial landmarks. The proposed SCC, simple yet effective, permits more precise localization without burdening the model.

Furthermore, we propose Soft Wing Loss to handle the side-effect of Wing loss [14] on small range errors. Since the facial landmarks are not strictly defined, the annotations vary among annotators, introducing some shifts [29]. In such a case, forcing the model to fit the ground-truth with a large gradient would cause unstable training. Therefore, we make the model more focus on the errors of medium ranges.

We evaluate the proposed method on three widely-used face alignment benchmarks including WFLW [43], COFW [3] and 300W [33]. Experimental results demonstrate the effectiveness of our approach, which outperforms existing state-of-the-art regression based methods by a large margin. In addition to the great performance, our model is much faster and lighter than the closest competitors. We conduct extensive ablation studies to show the effectiveness of each proposed modules.

2 Related Work

Traditional models: Traditional facial landmark detection models mainly fall into two categories, i.e, fitting models and constrained local models. Taylor et al. introduce the Active Appearance Model (AAM) [6][11] to fits the facial images with a small number of coefficients, controlling both the facial appearance and the facial shape. Constrained local models[7][34] are introduced to predict the landmarks based on the global facial shape constraints as well as the independent local appearance information around each landmark. Locating facial landmarks with graph structure is related to some previous works[17][57][40], which apply deformable part models (DPM)[12] to face analysis. These methods belong to probabilistic graphical models, which require hand-crafted potential functions and iterative optimization for inference. However, our method is deep learning based graph network, which generates richer and more expressive feature embeddings and enjoys the faster inference.

CNN based coordinate regression models: Coordinate regression models directly map the face image to the landmark coordinates. Zhang et al. [53] improve the robustness of detection through multi-task learning, i.e., learning landmark coordinates and predicting facial attributes at the same time. Feng et al. [14] introduce a modified log loss, named Wing loss, to increase the contribution of small and medium errors to the training process. LAB [43] regresses facial landmark coordinates with the help of boundary information to reduce the annotation ambiguities. In spite of the advantage of explicit inference of landmark coordinates without any post-processing, the coordinate regression models generally underperform heatmap regression models.

CNN based heatmap regression models: Heatmap regression models leverage fully convolutional networks (FCNs) to maintain structure information throughout the whole network, and therefore outperform coordinate regression models. In recent work, stacked hourglass (HG) [30] is widely used to achieve the state-of-the-art performance. Yang et al. [48] first normalize faces with a supervised transform and then prediction heatmap using a HG. Liu et al. [29] develop a latent variable optimization strategy to reduce the impact of ambiguous annotations when training a 4-stacked HG. In addition to HG, architecture like HR-Net [37] is also able to yield excellent performance. Despite their higher accuracy, heatmap regression models are much more costly from a computational point of view compared to coordinate regression models.

Figure 2:

An overview of the proposed method. The convolutional backbone computes hierarchical feature maps from the input face image. These features are forward into our Structure Coherence Component which extracts spatial details and semantic information from different convolutional layers via attention. The map-to-node module then maps these attentive features into graph node representations. Together with the graph adjacency matrix that learned from the dataset and features extracted from map-to-node module, they are fed into the graph relation network to infer the facial landmarks.

Graph Neural Networks (GNNs): GNNs are a class of models which try to generalize deep learning to handle graph-structured data. They are first introduced in [35] and become more and more popular recently [1]. There are mainly two types of GNNs: Message Passing based Neural Networks [35, 24, 19] and Graph Convolution based Neural Networks [2, 21, 25]. Many recent works have shown that GNNs are very effective in many computer vision tasks, e.g., RGBD semantic segmentation [31], visual situation recognition [23], scene graph generation and reasoning [49, 36], image annotation [51], object detection [47] and 3D shape analysis[41]. Specifically, in this work, we closely follow the so-called graph convolutional network (GCN) [21] which greatly simplifies the graph convolution operator by exploiting approximation to the Chebyshev polynomial based graph spectral filters. It provides a simple yet effective way to integrate local neighboring node feature following the graph topology.

3 Approach

In this section, we present the proposed method in detail. As illustrated in Fig. 2, our Structure Coherence Component is mainly composed of four key parts: an attention guided multi-scale features fusion, a mapping to node module, dynamic adjacency matrix weighting module and a graph relation network. Given an input face image, the convolutional backbone computes feature maps of different resolutions which are carefully fused via attention guidance. A sparse graph structure is learnt by dynamic adjacency matrix weighting module. The features extracted from attention module are then mapped into graph node representation and fed into the graph relation network which outputs the coordinates of facial landmarks.

3.1 Attention Guided Multi-scale Features

Since facial landmarks detection requires extreme precise localization, preserving the spatial information are crucial for an accurate model. Heatmap based methods usually uses several hourglass structures [30] to preserve the spatial information. However, such encoder-decoder architecture is extremely heavy and slows down the inference speed. We propose an efficient attention guided multi-scale features module to improve the localization capability. Fig. 3 illustrates the architecture of this module.

Multi-scale Features: The feature maps from shallower layers encode low-level information and spatial details, while deep layers encode high semantic information [4, 27, 26]. We introduce two bottom-up branches to propagate the spatial details from shallow layers to the deepest layer. Specifically, consider a convolutional backbone composed of convolutional blocks. We denote as the last feature maps of the -th block. We exploit the spatial information from the feature maps and to augment the localization precision of the features . Each branch is composed of a

Conv-BN-ReLU, an attention mechanism to filter out noisy information and a down-sampling operation. These feature maps with spatial details are then concatenated with

to form more expressive feature maps.

Semantic-guided Attention: Although the feature maps from shallow layers have rich spatial information, they also contain noisy information which are not informative from the perspective of semantic meaning. We propose semantic-guided attention module to filter out such information. Unlike existing self-attention which uses self-features to compute an attention map, we exploit the high-semantics features maps to guide the feature maps to suppress noisy information while keeping spatial details. We first upsample the feature maps , concatenate it with and reduce the channel dimension into the channel of via an convolution, obtaining . As merges the information from and , it contains both semantic information and spatial details. We then use the attention module described in [42] to generate spatial attention and channel-wise attention from as:


where and , is the channel number, denotes an convolution operation, , and

denote sigmoid function, ReLU activation and concatenation operation respectively,

/ and /

denote spatially/channel-wise average-pooled features and max-pooled features, respectively. The notation

is omitted for more clarity. The attentive features are then obtained via element-wise multiplication and residual addition. Similarly, we compute the attention for feature maps with messages from features and , and obtain the attentive features. Finally, we concatenate these features to form the attention guided multi-scale features . Note that designing the attention module is not our main focus, we adopt the commonly used attention module [42] in our semantic-guided process.

Figure 3: The architecture of the attention guided multi-scale feature learning module. We propagate the semantic information from deeper layers to guide shallower layers generating effective attention maps. After filtering out the noisy information with attention, we propagate spatial details from these shallower layers into the final feature maps.

3.2 Graph Relation Network

As the relative spatial relationship of facial landmarks is stable, it is desirable to capture and exploit such important cues. We statically calculate the correlation between face landmarks from data analysis and leverage graph relation networks to effectively explore these relation information.

Map-to-Node Module: In order to make our network end-to-end trainable, we design the map-to-node module to seamlessly map convolutional feature maps to graph node representations. The input convolutional feature maps (where , and represents number of channels, height and width) are first transformed to the hidden feature maps by non-linear function , where is the expansion coefficient and is the number of landmarks. In this paper, we consider two convolution-BN-ReLU blocks with as the non-linear function . is then reshaped to to represent the node feature.

Graph Convolution: Unlike standard convolutions that operate on local Euclidean structures, e.g., a image grid, the goal of GCN is to learn a function on a graph , which takes node feature and the corresponding adjacency matrix as input, and outputs the node features as . Here , , and denote the number of nodes, layer index, the dimension of input node features and the dimension of output node feature respectively. Every GCN layer can be written as a non-linear function by,


With the specific graph convolutional operators employed by [21], the layer can be represented as,


where is a transformation matrix to be learned, , is the degree matrix of , is the symmetric normalized version of and denotes BN-ReLU operation.

Neighborhoods Construction: Graph relation network propagates information between nodes based on the adjacency matrix which is crucial to be correctly constructed. In our problem, due to the lack of pre-defined adjacency matrix for facial landmarks, we build it through a data-driven way, i.e.

, treating each landmark as a node and mining the correlation between landmarks within the dataset. Specifically, we assemble the landmark coordinates of the dataset into a rank-three data tensor

where is the number of images, and the last dimension represents the coordinates. We then slice the tensor along the last dimension to generate and . Based on and , we calculate Pearson’s correlation coefficient in and direction respectively to form correlation matrices and . Then, the correlation between nodes is defined as:


where returns element-wise absolute value of matrix. Considering the computation cost and noisy edges, we only retain the top largest value of each row of to form a sparse adjacency matrix . In other words, most relevant landmarks are picked as the neighborhood of each landmark. The binary adjacency matrix with self-loops can be written as:


Dynamic Adjacency Matrix Weighting: The static adjacency matrix is constructed based on the geometric structure of facial landmarks, while learning relationship among landmarks for each face aims to take the facial appearance factors like occlusion and head pose into consideration. Given the binary matrix which determines the node neighborhoods, we seek to adaptively adjust its weights.

Formally, given the features extracted from the map-to-node modules, we use the global average pooling layer followed by two fully connected layers to map

to vector

whose size is equal to the non zeros in . Finally, we replace the non zeros value in with to form the dynamic adjacency matrix . Following the strategy in [54], we adopt a row-wise softmax operation to replace the symmetric normalization in Eq. 3

. Softmax operation makes the weights of each node like probabilities over its neighboring nodes, which stabilizes the training process:


We use the binary matrix to hold the neighborhoods and only learn their weights because the facial shape pattern are stable, fix the sparse connection will greatly reduce the training parameters which makes the learning process easier.

Graph Relation Network: Inspired by the success of ResNet[20], we adopt the graph residual block architecture. Each block consists of two graph convolutional layers and can be formulated based on Eq. (2) as


The overall graph relation network architecture is shown in Fig. 2. The input feature

is first fed to graph convolution, followed by several graph residual blocks. The last graph convolution (without batch normalization and ReLU) block maps the hidden node features to landmark coordinates


Comparison with FC-based regression methods. The fully connected layer and our graph convolutional layers embed the feature of landmarks in two different ways. As shown in Fig.1(a) The CNN backbone and the hidden fully connected layer map the input facial image to the hidden vector, which embed the feature of landmarks globally. Thus, the errors of some parts of the prediction effects the other parts, as they share the same hidden feature. As shown in Fig. 1(b), for the FC-based method, the errors of occluded part interfere the prediction of other visible parts. Meanwhile, our SCC embeds the node feature for each landmark, and propagates node feature according to their relationship. If some parts of predictions fail because of the occlusion, large pose or other hard condition, the node feature of other parts degrade gracefully because of the sparse connection among the node features and the dynamic adjustment of the relationship. As shown in Fig. 1(d), the SCC-based method are more robust to hard cases. Besides, fully connected layers are prone to overfit because of the large number of trainable parameters, while the graph convolution layer requires fewer trainable parameters.

3.3 Soft Wing Loss

Figure 4:

Illustration of L1, Wing and Soft Wing loss functions.

is set 2. and are set to 20. Unlike Wing loss, our loss is linear for small errors.

Wing loss[14] has constant gradient when error is large, and large gradient for small or medium range errors, which is defined as:


where is error and is to smoothly link two piece-wise functions. According to our experiment, the performance of Wing loss is not consistently better than L1 loss, especially when we train the neural networks on difficult dataset with heavy occlusion and blur, such as WFLW. As mentioned in [29], this may be caused by inconsistent annotations due to various reasons, e.g., unclear or inaccurate definition of some landmarks, poor quality of some facial images. Imposing a large gradient magnitude around very small error to force the model exactly fit the ground truth landmarks makes the training process unstable. To alleviate this problem, we present Soft Wing loss to more focus on the errors of medium range:


which is linear for small values, and take the curve of for medium and large values. Similar to Wing loss, we use the non-negative to switch between linear and non-linear part, and to limit the curvature of the non-linear part. is set to to make function continuous at . The visualization of L1, Wing and our Soft Wing loss is shown in Fig.4. Note that we discard the linear part of Wing loss, since our proposed loss can adaptively adjust the magnitude of gradient between medium () and large errors (). The magnitude of gradient of the non-linear part is (

is commonly set to small value). Our proposed loss is insensitive to outliers where the gradient varies between

( is the image size). Note that should not set to small value because it will cause gradient vanishing problem.

Metric Method Fullset Pose Expression Illumination Make-up Occlusion Blur
NME DVLN17[44] 6.08 11.54 6.78 5.73 5.98 7.33 6.88
LAB18 [43] 5.27 10.24 5.51 5.23 5.15 6.79 6.32
Wing18 [14] 5.11 8.75 5.36 4.93 5.41 6.37 5.81
AGCFN19 [28] 4.90 8.78 5.00 4.93 4.85 6.26 5.73
LAB18 [43] + AVS19 [32] 4.76 8.21 5.14 4.51 5.00 5.76 5.43
DeCaFA19 [8] 4.62 8.11 4.65 4.41 4.63 5.74 5.38
HRNet19 [37] 4.60 7.94 4.85 4.55 4.29 5.44 5.42
Ours 4.40 7.52 4.65 4.31 4.36 5.23 5.04
FR DVLN17[44] 10.84 46.93 11.15 7.31 11.65 16.30 13.71
LAB18 [43] 7.56 28.83 6.37 6.73 7.77 13.72 10.74
Wing18 [14] 6.00 22.72 4.78 4.30 7.77 12.50 7.76
AGCFN19 [28] 5.92 24.23 5.41 4.72 5.82 11.00 8.79
LAB18 [43] + AVS19 [32] 5.24 20.86 4.78 3.72 6.31 9.51 7.24
DeCaFA19 [8] 4.84 21.4 3.73 3.22 6.15 9.26 6.61
Ours 2.88 13.80 2.55 2.29 2.43 5.98 4.14
AUC DVLN17[44] 0.4551 0.1474 0.3889 0.4743 0.4494 0.3794 0.3973
LAB18 [43] 0.5323 0.2345 0.4951 0.5433 0.5394 0.4490 0.4630
Wing18 [14] 0.5504 0.3100 0.4959 0.5408 0.5582 0.4885 0.4918
AGCFN19 [28] 0.5452 0.2826 0.5267 0.5511 0.5547 0.4621 0.4823
LAB18 [43] + AVS19 [32] 0.5460 0.2764 0.5098 0.5660 0.5349 0.4700 0.4923
DeCaFA19 [8] 0.563 0.292 0.546 0.579 0.575 0.485 0.494
Ours 0.5666 0.2981 0.5430 0.5761 0.5710 0.4936 0.5095
Table 1: Evaluation of our method and state-of-the-art approaches on Fullset and six typical subsets of WFLW. The results in terms of normalized mean error, NME (), failure rate at , FR () and AUC are reported.

4 Experiments

In this section, we evaluate our method on three popular face alignment benchmarks, compare with state-of-the-art approaches and conduct the ablation study.

4.1 Experimental Setup

Datasets: We conduct evaluation on three widely-adopted challenging datasets: WFLW [43], COFW [3] and 300W [33]. WFLW is among the most challenging face alignment benchmark which includes various hard cases such as heavy occlusion, blur and large pose. COFW is collected to present faces with large variations in shape and occlusions in real-world conditions. Various types of occlusions are introduced and result in a occlusion on facial parts on average. We also use the re-annotated test set [16] with 68 landmarks annotation for cross-dataset validation. 300W contains face images with moderate variations in pose, expression and illumination.

Evaluation Metric: We evaluate the proposed method with normalized mean error and failure rate. we use the inter-ocular distance as the normalization factor. Following the protocol in [43], the failure rate for a maximum error of 0.1 is reported. Area under curve (AUC) is also calculated based on the cumulative error distribution for WFLW dataset.

4.2 Implementation details

All training images are center-cropped and resized to . Data augmentation is performed with random rotation (), translation (), flipping (), rescaling () and occlusion ( of image size). To mitigate the issue of pose variations, we adopt the Pose-based Data Balancing (PDB)[14] strategy with 9 bins. We use ResNet34[20] as our backbone. During the training, we employ vanilla SGD for optimization with a batch size of for epochs. We set the weight decay and the momentum to and respectively. The initial learning rate is which is dropped by 5 every epochs. The parameters of the Soft Wing loss are set to , and . The

is set to 3 for adjacency matrix. We use 4 graph residual blocks with hidden feature dimension 128. Our models are trained from scratch using Pytorch.

4.3 Comparison with the State of the Art

WFLW: We evaluate our approach on the WFLW dataset and compare with state-of-the-art methods in terms of mean error, failure rate and AUC. To better understand the effectiveness of the proposed method, we analyse the performance on six subsets with specific issue, e.g., large pose, occlusion and exaggerated expression [43]. The overall results are tabulated in Table 1. The proposed method achieves NME, failure rate and AUC, which outperforms most state-of-the-art approaches. Our method fails on only of all images, which demonstrates the robustness of our model. Qualitative results are depicted in Fig. 5, where our model successively localizes landmarks in hard cases.

Method Trained on COFW Trained on 300W
TCDCN14[53] - - 7.66 16.17
SAPM15[18] - - 6.64 5.72
CFSS15[56] - - 6.28 9.07
HPM14[16] 7.50 13.00 6.72 6.71
CCR15[13] 7.03 10.9 - -
DRDA16[50] 6.46 6.00 - -
RAR16[46] 6.03 4.14 - -
SFPD17[45] 6.40 - - -
DAC-CSR17[15] 6.03 4.73 - -
Wing18[14] 5.44 3.37 - -
ODN19[55] 5.30 - - -
LAB18[43] 3.92 0.39 4.62 2.17
SAN18[9] + AVS19[32] - - 4.43 2.82
Ours 3.63 0 4.18 0
Table 2: Evaluation on the COFW dataset in terms of NME () and Failure Rate () at .
Figure 5: Visualization of some hard cases from the WFLW testset.

COFW: As shown in in Table 2, our method achieves state-of-the-art performance with mean error and failure rate. To further verify the generalization capability of our method, we conduct a cross-dataset evaluation using COFW-68 dataset annotated with 68 landmarks [16]. Our method outperforms the existing best approaches by a large margin, with mean error and failure rate. Since the COFW dataset is mainly composed of occluded faces, this impressive performance indicates the robustness of our graph relation framework to handle heavy occlusions.

300W: We compare our approach against the existing best performing methods on the 300W dataset. The results are reported in Table 4.3. Our method outperforms most existing approaches. Note that our method achieves the best results on the challenging subset, which highlights the robustness of the proposed approach in hard cases.

Method Common Challenging Full PCD-CNN18 [22] 3.67 7.62 4.44 CPM+SBR18 [10] 3.28 7.58 4.10 SAN18 [9] 3.34 6.60 3.98 LAB18 [43] 2.98 5.19 3.49 DeCaFA19 [8] 2.93 5.26 3.39 HRNet19 [37] 2.87 5.15 3.32 Ours 2.88 4.93 3.28 Table 3: Evaluation on the 300W Common subset, Challenging subset and Fullset in terms of mean error(). Model # params (M) FLOPS (G) RT (ms) SAN [9] 199.63 - 343 LAB [43] 32.05 28.583 60 Wing [14] 24.75 5.396 30 Ours 24.68 5.165 23 Table 4: Efficiency comparison in terms of number of parameters, FLOPS and runtime.

Efficiency Comparison: Since facial landmark detection is widely deployed for many real-time applications, the model size, FLOPS and processing speed are key criteria. We evaluate the runtime of our model on a 1080Ti GPU and compare with existing methods in Table 4.3. Our model only takes ms and FLOPS to process a input image and consists of M parameters. Overall, our model is faster and smaller than most competitors.

4.4 Ablation Study

Our framework is composed of several pivotal modules such as graph relation network, attention guided multi-scale features, and soft-wing loss. Based on the baseline Resnet34 with layer stages, we examine the contributions of each proposed module on the WFLW dataset and report the overall results in Table 4.4.

Component Choice Fully connected GN Attention m-s F. Soft-Wing loss GN w/o DW NME () 5.95 4.64 4.53 4.52 4.47 4.40 Table 5: Ablation study on components on the WFLW dataset. GN: Graph Network. DW: Dynamic adjacency matrix Weighting Design Choice NME () Self-attention 4.61 Semantic-guided attention 4.53 Feature maps 4.64 Feature maps 4.57 Feature maps 4.53 Feature maps 4.56 Table 6: Ablation study on attention generation methods and feature maps for spatial message propagation.

Baseline Model: We first utilize the FC layers to directly regress the facial landmarks. This model is our baseline which achieves a NME of .

Graph Relation Network: The graph relation network is a key part of our Structure Coherency Component. We obtain a improvement by replacing the FC layers with our graph relation network, resulting in a NME of .

Top-k value for adjacency matrix: We report the result with different values of from to in Fig. 7. When , our model achieves the best performance on WFLW dataset. Note that, the performance degrades if the adjacency matrix is too sparse or too dense. When is too small, each graph node can not get sufficient information from its correlated neighborhoods, meanwhile, when is too large, the adjacency matrix becomes dense which leads to oversmoothing of the node features.

Dynamic adjacency matrix weighting: We replace the dynamic adjacency matrix with the binary adjacency matrix and we observe a degradation.

Top k 1 2 3 4 5 10 20 40 97
NME 4.44 4.43 4.40 4.46 4.50 4.59 4.69 4.71 4.75
Table 7: NME(%) comparison with different values of . is for building adjacency matrix.
Figure 6: Qualitative analysis of the attention effects. The max along the channel of is illustrated. Our semantic-guided attention highlights all visible key facial parts whereas the self-attention only highlights a few facial parts such as eyes or mouth.

Attention guided multi-scale feature: The attention guided multi-scale features fusion plays a key role in improving the representation capability of features. By endowing the spatial details to high-semantics features, the model performs a NME of , which corresponds to a improvement.

Semantic-guided attention: We examine the importance of incorporating additional semantic information from deeper layers to guide generating attention maps. To this end, we degenerate our semantic-guided attention structure into a general self-attention mechanism. As shown in Table 4.4, we observe a drop of performance, resulting in a NME of . This experiment proves that the semantic information from high-level features is crucial to guide generating high quality attention. The quantitative performances are supported by the qualitative results illustrated in Fig. 6. The semantic guidance permits to make the feature maps focus on all visible key facial parts. Since self-attention only explore self-information, it only highlights the high-activated part in feature maps.

Features combination: To improve the localization capability, we propagate the spatial information from the shallower layers. We study which combination of feature layers is the optimal one. As tabulated in Table 4.4, the performance increases with the the additional spatial information propagation and the combination of features yields best results. Since layer 2 is quite shallow, consists few useful information and limits the performance due to the noise.

Soft Wing Loss: The Soft Wing loss improves the results of the graph relation network by . We compare the performance of L1, Wing and our Soft Wing loss based on our baseline model, the results are shown in Table 8. Our Soft Wing loss consistently outperforms Wing loss and L1 loss. As discussed in Section 3.3, the performance of Wing loss degrades when decreases, while our loss benefits from imposing larger gradients on medium range errors. The performance of Wing loss is even worse than L1 loss when is very small.

epsilon 0.1 0.2 0.5 1 1.5 2
L1 5.95
Wing 6.52 5.98 5.81 5.78 5.75 5.72
SoftWing 5.70 5.66 5.67 5.71 5.70 5.71
Table 8: Comparison of different loss functions. Analysis shows the effectiveness of Soft Wing loss in terms of the NME (%).

5 Conclusion

In this paper, we propose a fast and accurate face alignment method. We present a structure coherence component which consists of attention guided multi-scale feature fusion, mapping to node, dynamic adjacency matrix weighting module and graph relation network. We utilize the relation among facial parts appropriately, which permits precise localization of facial landmarks under hard cases. Experimental results in three challenging face alignment benchmarks demonstrate the effectiveness of the proposed method.


  • [1] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017-07) Geometric deep learning: going beyond euclidean data. IEEE SPM 34 (4), pp. 18–42. External Links: ISSN Cited by: §2.
  • [2] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §2.
  • [3] X. P. Burgos-Artizzu, P. Perona, and P. Dollar (2013-12)

    Robust face landmark estimation under occlusion

    In ICCV, Cited by: §1, §4.1.
  • [4] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos (2016) A unified multi-scale deep convolutional neural network for fast object detection. In ECCV, pp. 354–370. Cited by: §3.1.
  • [5] T. Cohen and M. Welling (2016) Group equivariant convolutional networks. In ICML, pp. 2990–2999. Cited by: §1.
  • [6] T. F. Cootes, G. J. Edwards, and C. J. Taylor (2001-06) Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (6), pp. 681–685. External Links: ISSN 1939-3539 Cited by: §2.
  • [7] D. Cristinacce and T. Cootes (2006-01) Feature detection and tracking with constrained local models. Pattern Recognit. 41, pp. 929–938. Cited by: §2.
  • [8] A. Dapogny, K. Bailly, and M. Cord (2019) DeCaFA: deep convolutional cascade for face alignment in the wild. In ICCV, Cited by: §1, §1, Table 1, §4.3.
  • [9] X. Dong, Y. Yan, W. Ouyang, and Y. Yang (2018) Style aggregated network for facial landmark detection. In CVPR, pp. 379–388. Cited by: §4.3, §4.3, Table 2.
  • [10] X. Dong, S. Yu, X. Weng, S. Wei, Y. Yang, and Y. Sheikh (2018) Supervision-by-registration: an unsupervised approach to improve the precision of facial landmark detectors. In CVPR, pp. 360–368. Cited by: §4.3.
  • [11] G. J. Edwards, C. J. Taylor, and T. F. Cootes (1998) Interpreting face images using active appearance models. In Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 300–305. Cited by: §2.
  • [12] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §2.
  • [13] Z. Feng, G. Hu, J. Kittler, W. Christmas, and X. Wu (2015-11) Cascaded collaborative regression for robust facial landmark detection trained using a mixture of synthetic and real images with dynamic weighting. IEEE TIP 24 (11), pp. 3425–3440. External Links: ISSN Cited by: Table 2.
  • [14] Z. Feng, J. Kittler, M. Awais, P. Huber, and X. Wu (2017) Wing loss for robust facial landmark localisation with convolutional neural networks. In CVPR, pp. 2235–2245. Cited by: §1, §1, §1, §2, §3.3, Table 1, §4.2, §4.3, Table 2.
  • [15] Z. Feng, J. Kittler, W. Christmas, P. Huber, and X. Wu (2017-07) Dynamic attention-controlled cascaded shape regression exploiting training data augmentation and fuzzy-set sample weighting. In CVPR, Cited by: Table 2.
  • [16] G. Ghiasi and C. C. Fowlkes (2014-06) Occlusion coherence: localizing occluded faces with a hierarchical deformable part model. In CVPR, Cited by: §4.1, §4.3, Table 2.
  • [17] G. Ghiasi and C. C. Fowlkes (2014) Occlusion coherence: localizing occluded faces with a hierarchical deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2385–2392. Cited by: §2.
  • [18] G. Ghiasi and C. Fowlkes (2015-09) Using segmentation to predict the absence of occluded parts. In BMVC, pp. 22.1–22.12. Cited by: Table 2.
  • [19] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, pp. 1263–1272. Cited by: §2.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. Cited by: §3.2, §4.2.
  • [21] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, pp. 1–10. Cited by: §2, §3.2.
  • [22] A. Kumar and R. Chellappa (2018) Disentangling 3d pose in a dendritic cnn for unconstrained 2d face alignment. In CVPR, pp. 430–439. Cited by: §4.3.
  • [23] R. Li, M. Tapaswi, R. Liao, J. Jia, R. Urtasun, and S. Fidler (2017) Situation recognition with graph neural networks. In ICCV, pp. 4173–4182. Cited by: §2.
  • [24] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.
  • [25] R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel (2019) Lanczosnet: multi-scale deep graph convolutional networks. arXiv preprint arXiv:1901.01484. Cited by: §2.
  • [26] C. Lin, J. Lu, G. Wang, and J. Zhou (2018-09)

    Graininess-aware deep feature learning for pedestrian detection

    In The European Conference on Computer Vision (ECCV), Cited by: §3.1.
  • [27] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: §3.1.
  • [28] X. Liu, H. Wang, J. Zhou, and L. Tao (2019) Attention-guided coarse-to-fine network for 2d face alignment in the wild. IEEE Access 7 (), pp. 97196–97207. External Links: ISSN Cited by: Table 1.
  • [29] Z. Liu, X. Zhu, G. Hu, H. Guo, M. Tang, Z. Lei, N. M. Robertson, and J. Wang (2019-06) Semantic alignment: finding semantically consistent ground-truth for facial landmark detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.3.
  • [30] A. Newell, K. Yang, and J. Deng (2016)

    Stacked hourglass networks for human pose estimation

    In ECCV, pp. 483–499. Cited by: §2, §3.1.
  • [31] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun (2017) 3d graph neural networks for rgbd semantic segmentation. In ICCV, pp. 5199–5208. Cited by: §2.
  • [32] S. Qian, K. Sun, W. Wu, C. Qian, and J. Jia (2019) Aggregation via separation: boosting facial landmark detector with semi-supervised style translation. In ICCV, Cited by: Table 1, Table 2.
  • [33] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic (2016) 300 faces in-the-wild challenge: database and results. Image and Vision Computing 47, pp. 3 – 18. Note: 300-W, the First Automatic Facial Landmark Detection in-the-Wild Challenge External Links: ISSN 0262-8856 Cited by: §1, §4.1.
  • [34] J. Saragih, S. Lucey, and J. Cohn (2011-01) Deformable model fitting by regularized landmark mean-shift. International Journal of Computer Vision 91, pp. 200–215. External Links: Document Cited by: §2.
  • [35] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009-01) The graph neural network model. IEEE NN 20 (1), pp. 61–80. External Links: ISSN Cited by: §2.
  • [36] J. Shi, H. Zhang, and J. Li (2019-06) Explainable and explicit visual reasoning over scene graphs. In CVPR, Cited by: §2.
  • [37] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang (2019) High-resolution representations for labeling pixels and regions. CoRR abs/1904.04514. Cited by: §1, §2, Table 1, §4.3.
  • [38] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner (2016) Face2face: real-time face capture and reenactment of rgb videos. In CVPR, pp. 2387–2395. Cited by: §1.
  • [39] G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos, and S. Zafeiriou (2016-06) Mnemonic descent method: a recurrent process applied for end-to-end face alignment. In CVPR, Cited by: §1.
  • [40] M. Valstar, B. Martinez, X. Binefa, and M. Pantic (2010) Facial point detection using boosted regression and graph models. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2729–2736. Cited by: §2.
  • [41] N. Verma, E. Boyer, and J. Verbeek (2018) FeaStNet: Feature-Steered Graph Convolutions for 3D Shape Analysis. In CVPR - IEEE Conference on Computer Vision & Pattern Recognition, Salt Lake City, United States, pp. 2598–2606. Cited by: §2.
  • [42] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In ECCV, pp. 3–19. Cited by: §3.1.
  • [43] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou (2018-06) Look at boundary: a boundary-aware face alignment algorithm. In CVPR, Cited by: §1, §1, §1, §2, Table 1, §4.1, §4.1, §4.3, §4.3, §4.3, Table 2.
  • [44] W. Wu and S. Yang (2017-07) Leveraging intra and inter-dataset variations for robust face alignment. In CVPRW, Cited by: §1, Table 1.
  • [45] Y. Wu, C. Gou, and Q. Ji (2017-07) Simultaneous facial landmark detection, pose and deformation estimation under facial occlusion. In CVPR, Cited by: Table 2.
  • [46] S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, and A. A. Kassim (2016) Robust facial landmark detection via recurrent attentive-refinement networks. In ECCV, pp. 57–72. Cited by: Table 2.
  • [47] H. Xu, C. Jiang, X. Liang, and Z. Li (2019-06) Spatial-aware graph relation network for large-scale object detection. In CVPR, Cited by: §2.
  • [48] J. Yang, Q. Liu, and K. Zhang (2017-07) Stacked hourglass network for robust facial landmark localisation. In CVPRW, Vol. , pp. 2025–2033. External Links: ISSN Cited by: §1, §2.
  • [49] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh (2018) Graph r-cnn for scene graph generation. In ECCV, pp. 670–685. Cited by: §2.
  • [50] J. Zhang, M. Kan, S. Shan, and X. Chen (2016-06)

    Occlusion-free face alignment: deep regression networks coupled with de-corrupt autoencoders

    In CVPR, Cited by: Table 2.
  • [51] J. Zhang, Q. Wu, J. Zhang, C. Shen, and J. Lu (2019-06) Mind your neighbours: image annotation with metadata neighbourhood graph co-attention networks. In CVPR, Cited by: §2.
  • [52] Y. Zhang, R. Zhao, W. Dong, B. Hu, and Q. Ji (2018) Bilateral ordinal relevance multi-instance regression for facial action unit intensity estimation. In CVPR, pp. 7034–7043. Cited by: §1.
  • [53] Z. Zhang, P. Luo, C. C. Loy, and X. Tang (2014) Facial landmark detection by deep multi-task learning. In ECCV, Cham, pp. 94–108. Cited by: §2, Table 2.
  • [54] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas (2019) Semantic graph convolutional networks for 3d human pose regression. In CVPR, pp. 3425–3435. Cited by: §3.2.
  • [55] M. Zhu, D. Shi, M. Zheng, and M. Sadiq (2019-06) Robust facial landmark detection via occlusion-adaptive deep networks. In CVPR, Cited by: Table 2.
  • [56] S. Zhu, C. Li, C. C. Loy, and X. Tang (2015) Face alignment by coarse-to-fine shape searching. In CVPR, pp. 4998–5006. Cited by: Table 2.
  • [57] X. Zhu and D. Ramanan (2012) Face detection, pose estimation, and landmark localization in the wild. In 2012 IEEE conference on computer vision and pattern recognition, pp. 2879–2886. Cited by: §2.
  • [58] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li (2015) High-fidelity pose and expression normalization for face recognition in the wild. In CVPR, pp. 787–796. Cited by: §1.