, also known as facial landmark localization, seeks to localize pre-defined landmarks on human faces. Face alignment plays an essential role in many face related applications such as face recognition[51, 38, 36, 66, 11], face frontalization [23, 56, 28] and 3D face reconstruction [15, 48, 34, 20]. In recent years, Convolutional Neural Network (CNN) based heatmap regression has become one of the mainstream approaches to solve face alignment problems and achieved considerable performance on frontal faces. However, landmarks on faces with large pose, occlusion and significant blur are still challenging to localize.
. In heatmap regression, the ground truth heatmap is generated by plotting a Gaussian distribution centered at each landmark on each channel. The model regresses against the ground truth heatmap at pixel level and then use the predicted heatmaps to infer landmark locations. Prediction accuracy on foreground pixels (pixels with positive values), especially the ones near the mode of each Gaussian distribution (Fig.1), is essential to accurately localize landmarks, even small prediction errors on these pixels can cause the prediction to shift from the correct modes. On the contrary, accurately predicting the values of background pixels (pixels with zero values) is less important, since small errors on these pixels will not affect landmark prediction in most cases. However, prediction accuracy on difficult background pixels (Fig. 1 background pixels near foreground pixels) are also important since they are often incorrectly regressed as foreground pixels and could cause inaccurate prediction.
From this discussion, we locate two issues of the widely used Mean Square Error (MSE) loss in heatmap regression: i) MSE is not sensitive to small errors, which hurts the capability to correctly locate the mode of the Gaussian distribution; ii) During training all pixels have the same loss function and equal weights, however, background pixels absolutely dominates foreground pixels on a heatmap. As a result of i) and ii), models trained with the MSE loss tend to predict a blurry and dilated heatmap with low intensity on foreground pixels compared to the ground truth (Fig. 1(c)
). This low quality heatmap could cause wrong estimation of facial landmarks. Wing loss is shown to be effective to improve coordinate regression, however, according to our experiment, it is not applicable for heatmap regression. Small errors on background pixels will accumulate significant gradients and thus cause the training process to diverge. We thus propose a new loss function and name it Adaptive Wing loss (Sec4.2), that is able to significantly improve the quality of heatmap regression results.
Due to the translation invariance of the convolution operation in bottom-up and top-down CNN structures such as stacked Hourglass (HG) , the network is not able to capture coordinate information, which we believe is useful for facial landmark localization, since the structure of human faces is relatively stable. Inspired by the CoordConv layer proposed by Liu , we encode into our model the full coordinate information and the information only on boundaries predicted from the previous HG module into our model. The encoded coordinate information further improves the performance of our approach. To encode boundary coordinates, we also added a sub-task of boundary prediction by concatenating an additional boundary channel into the ground truth heatmap and training together with other channels.
In summary, our main contributions include:
Propose a novel loss function for heatmap regression named Adaptive Wing loss, that is able to adapt its curvature to ground truth pixel values. This adaptive property reduces small errors on foreground pixels for accurate landmark localization, while tolerates small errors on background pixels for a better convergence rate. With proposed Weighted Loss Map it is also able to focus on foreground pixels and difficult background pixels during training.
Encode coordinate information, including coordinates on boundary into the face alignment algorithm using CoordConv .
Our approach outperforms state-of-the-art algorithms by a significant margin in mainstream face alignment datasets including 300W , COFW  and WFLW . We also show the validity of the Adaptive Wing loss in human pose estimation task which also utilizes heatmap regression.
2 Related Work
CNN based heatmap regression models leverage CNN to perform heatmap regression. In recent work [64, 52, 6, 7], joint bottom-up and top-down architectures such as stacked HG  were able to achieve the state-of-the-art performance. Bulat  proposed a hierarchical, parallel and multi-scale block as a replacement for the original ResNet  block to further improve the localization accuracy of HG. Tang  was able to achieve current state-of-the-art with quantized densely connected U-Nets with fewer parameters than stacked HG models. Other architectures are also able to achieve excellent performance. Merget  proposed a fully convolutional neural network (FCN) that combines global and local context information for a refined prediction. Valle  combined CNN with ensemble of regression trees in a coarse-to-fine fashion to achieve the state-of-the art accuracy.
Loss functions for heatmap regression were rarely studied in previous work. GoDP 
used a distance-aware softmax loss to assign large penalty on incorrectly classified positive samples, while gradually reducing penalty on miss-classified negative samples as the distance from nearby positive samples decrease. Wing loss is a modified log loss for direct regression of landmark coordinates. Compared with MSE, it amplifies the influence of small errors. Although Wing loss is able to achieve the state-of-the-art performance in coordinate regression, it is not applicable to heatmap regression due to its high sensitivity to small errors on background pixels and the discontinuity of gradient at zero. Our proposed Adaptive Wing loss is novel since it is able to adapt its curvature to different ground truth pixel values, such that it can be sensitive to small errors on foreground pixels yet be able to tolerance small errors on background pixels. Hence, our loss can be applied to heatmap regression while the original Wing loss cannot be.
Boundary information was first introduced into face alignment by Wu . LAB proposed a two-stage network with a stacked HG model to generate a facial boundary map, and then regress facial landmark coordinates directly with the help of boundary map. We believe including boundary information is beneficial to the heatmap regression and add a modified version to our model.
Coordinate Encoding. Translation invariance is the nature of the convolution operation. Although CNN greatly benefited from this parameter sharing scheme, Liu 
showed the inability of the convolution operation to handle simple coordinate transforms, and proposed a new operation called CoordConv, which encodes coordinate information as additional channels before convolution operation. CoordConv was shown to improve vision tasks such as object detection and generative modeling. For face alignment, the input images are always generated from a face detector with small variance of translation and scale. These properties inspire us to include CoordConv to help CNN learn the relationship among facial landmarks based on their coordinate information.
3 Our Model
Our model is based on the stacked HG architecture from Bulat  which improved over the original convolution block design from Newell . For each HG, the output heatmap is trained with the ground truth heatmap as supervision. We also added a sub-task of boundary prediction as an additional channel of the heatmap. Coordinate encoding is added before the first convolution layer of our network and before the first convolution block of each HG module. An overview of our model is shown in Figure 3.
4 Adaptive Wing Loss for Face Alignment
4.1 Loss function rationale
Before starting our analysis, we would like to introduce a concept from robust statistics. Influence 
is a heuristic tool used in robust statistics to investigate the properties of an estimator. In the context of our paper, the influence function isproportional to the gradient  of our loss function. So if the gradient magnitude is large at point (indicting the error), then we say the loss function has a large influence at point . If the gradient magnitude is close to zero at this point, then we say the loss function has a small influence at point . Theoretically, for heatmap regression, training is converged only if:
where is the total number of training samples, , and are the height, width and channels of heatmap, respectively. is the loss of sample, and are ground truth pixel intensity and predicted pixel intensity respectively. At convergence, the influence of all errors must balance each other. Hence, a positive error on a pixel with large gradient magnitude (hence large influence) would need to be balanced by negative errors on many pixels with smaller influence. Errors with large gradient magnitude will also be more focused during training compare to errors with small gradient magnitude.
The essence of heatmap regression is to regress a Gaussian distribution centered at each ground truth landmark. Thus the accuracy of estimating pixel intensity at the mode of the Gaussian plays a vital role on correctly localizing landmarks. The two issues we illustrated in Sec. 1 result in an inaccurate estimation on the position of landmarks due to lacking of focus during training on foreground pixels. In this section and Sec. 4.2, we will discuss the causes of the first issue and how our proposed Adaptive Wing loss is able to remedy it. The second issue will be discussed in Sec. 4.3.
The first issue is due to the commonly used MSE loss function for Heatmap regression. The gradient of the MSE loss is linear, so pixels with small errors have small influence, as shown in Figure 3(b). This property could cause training to converge while many pixels still have small errors. As a result, models trained with MSE loss tend to predict a blurry and dilated heatmap. Even worse, the predicted heatmap often has low intensity on foreground pixels around difficult landmarks, e.g. occluded landmarks or faces with unusual illumination conditions. Accurately localizing landmarks from these low intensity pixels can be difficult. A good example can be found in Figure 2.
L1 loss has constant gradient so that pixels with small errors have the same influence as pixels with large errors. However, the gradient of L1 loss is not continuous at point zero, which means for convergence, the amount of pixels with positive errors has to be exactly equal to the amount that has negative errors. The difficulty of achieving such delicate balance could cause training process to be unstable and oscillating.
Feng  is able to improve the above loss functions by proposing Wing loss that has constant gradient when error is large, and large gradient when error is small. Thus pixels with small errors will be amplified. The Wing loss is defined as follows:
where and are the pixel values on ground truth heatmap and the predicted heatmap respectively, is used to make function continuous at . The Wing loss is, however, still not be able to overcome the discontinuity of its gradient at , with its large gradient magnitude around this point, training is even more difficult to converge compared with L1 loss. This property makes Wing loss not applicable for heatmap regression, since with Wing loss calculated on all background pixels, small errors on background pixels are having out-of-proportion influence. Training a neural network that outputs exactly zeros on these pixels is very difficult. According to our experiment, the training of a heatmap regression network with the Wing loss is never able to converge.
The above analysis leads us to define the desired properties of an ideal loss function for heatmap regression. We expect our loss function to have a constant influence when error is large, so that it will be able to converge to a better location quickly at the beginning phase of the training process. As the training process goes and errors getting smaller, there will be two scenarios: i) For foreground pixels, the influence (as well as the gradient) should start to increase so that the training is able to focus on reducing these errors. The influence should then decrease rapidly as the errors go very close to zero, so that these ”good enough” points will no longer be focused on. The reduced influence of correctly estimations helps the network to stay converged, instead of oscillating like L1 and Wing loss. ii) For background pixels, the gradient should behaves more like MSE loss, that is, it will gradually decrease to zero as the training error decreases, so that the influence will be relatively small when the errors are small. This property reduces the focus of the training on background pixels, stabling the training process.
A fixed loss function cannot achieve both properties simultaneously. Thus, the loss function should be able to adapt to different pixel intensities on the ground truth heatmaps. As the ground truth pixels close to the mode (have intensities that are close to 1), the influence of small errors should increase. With ground truth pixel intensities close to 0, the loss function should behave more similar to MSE loss. Since pixel values on the ground truth heatmap range from 0 to 1, we also expect our loss function to have a smooth transition according to different pixel values.
4.2 The Adaptive Wing Loss
Following intuitions above, we propose our Adaptive Wing (AWing) loss, defined as follows:
where and are the pixel values on the ground truth heatmap and the predicted heatmap respectively, and are positive values, and are used to make loss function continuous and smooth at . Unlike Wing loss which uses as the threshold, we introduce a new variable as a threshold to switch between linear and nonlinear part. For heatmap regression, we often regress a value between 0 and 1, so we expect our threshold lies in this range. When , we consider the error to be small and need stronger influence. More importantly, we adopt an exponential term , which is used to adapt the shape of the loss function to and makes loss function smooth at point zero. Note has to be slightly larger than 2 to maintain the ideal properties we discussed in Sec. 4.1, this is due to the normalization of in the range of . For pixels on with values close to 1 (the landmarks we want to localize), the power term will be slightly larger than 1, and the nonlinear part will behave like Wing loss, which has large influence on smaller errors. But different from Wing loss, the influence will decrease to zero rapidly as errors are very close to zero (see Fig. 4). As decreases, the loss function will shift to MSE-like loss function, which allows the training not to focus on the pixels that still have errors but small influence. Figure 5 shows how the power term facilities the smooth transition across different values of , so that the influence of small errors will gradually increase as the value of increases. Larger and smaller values will increase the influence on small errors and vice versa, large values are shown to be effective according to our experiment.
The nonlinear part of our Adaptive Wing loss function behaves similarly to Lorentzian (aka. Cauchy) loss  in a more generalized fashion. But different from robust loss functions such as Lorentzian and Geman-McClure 
, we do not need the gradient to decrease to zero as error increases. This is due to the nature of heatmap regression. In robust regression, the learner learns to ignore noisy outliers with large error. In the context of face alignment, all facial landmarks are annotated with relatively small noises, so we do not have noisy outliers to ignore. A linear loss is sufficient for the training to converge to a location where predictions will be fairly close to the ground truth heatmap, and after that the loss function will switch to its nonlinear part to refine the prediction with increased influence on small errors. In practice, we found the linear form when errors are large to achieve better performance, compared with keep using the nonlinear form when the error is large.
We empirically use in our model. In our experiments, we found , , to be most effective, detailed ablation studies on parameter settings are shown at Sec. 7.6.1.
4.3 Weighted loss map
In this section we will discuss the second issue in Sec. 4.1. In a typical setting for facial landmark localization with a heatmap, and the size of Gaussian of , foreground pixels only constitute 1.2% of all the pixels. Assigning equal weight for such an unbalanced data could make the training process slow to converge and result in an inferior performance. To further establish the network’s ability to focus on foreground pixels and difficult background pixels (background pixels that are close to foreground pixels), we introduce the Weighted Loss Map to balance the loss from different types of pixels. We first define our loss map mask to be:
where is generated from ground truth heatmap by a gray dilation. The loss map mask assigns foreground pixels and difficult background pixels 1, and other pixels 0.
With the loss map mask , We define our Weighted Loss Map as follows:
where is element-wise production,
is a scalar hyperparameter to control how much weight to be added. See Figure6 for a visualization of weight map generation. In our experiments we use . The intuition is to assign pixels on heatmap with different weights. Foreground pixels have to be focused on during training, since these pixels are the most useful for localizing the mode of the Gaussian distribution. Difficult background pixels should also be focused on since these pixels are relatively difficult to regress, accurately regressing them could help narrow down the area of foreground pixels to improve localization accuracy.
5 Boundary Information
Inspired by , we introduce boundary prediction into our network as a sub-task, but in a different manner. Instead of breaking boundaries into different parts, we use only one additional channel as the boundary channel that combines all boundary lines to our heatmap. We believe this will efficiently capture the global information on a human face. The boundary information then will be aggregated into network naturally via convolution operations in a forward pass, and will also be used in Section 6 to generate boundary coordinate map, which can further improve localization accuracy according to our ablation study in Sec. 7.6.1.
6 Coordinate aggregation
We integrate CoordConv  into our model to improve the capability of traditional convolutional neural network to capture coordinate information. In addition to , and radius coordinate encoding in , we also leverage our boundary prediction to generate and coordinates only at boundary. More specifically, we define coordinate encoding to be , the boundary prediction from previous HG is , the boundary coordinate encoding is defined as:
is generated in the similar fashion from . The coordinate channels are generated at runtime and then concatenated with the original input to perform regular convolution.
7.2 Evaluation Metrics
Normalized Mean Error (NME) is commonly used to evaluate the quality of face alignment algorithms. The NME for each image is defined as:
where and are the ground truth and the predicted landmark coordinates for each image respectively, is the number of landmarks of each image, is the i-th predicted landmark coordinates in and is the i-th ground truth landmark coordinates in , is the normalization factor. For the COFW dataset, we use inter-pupil (distance of eye centers) as the normalization factor. For the 300W dataset, we provide both inter-ocular distance (distance of outer eye corners) used as the original evaluation protocol in , and inter-pupil distance used in . For the WFLW dataset, we use the inter-ocular distance described in .
Failure Rate (FR) is another metric to evaluate localization quality. For one image, if NME is larger than a threshold, then it is considered a failed prediction. For the 300W private test dataset, we use and respectively to compare with different approaches. For the WFLW dataset, we follow [17, 58] and use as the threshold.
Cumulative Error Distribution (CED) curve shows the NME to the proportion of total test samples. The curve is usually plotted from zero up to the NME failure rate threshold (e.g. , ). Area Under Curve (AUC) is calculated based on the CED curve. Larger AUC reflects that larger portion of the test dataset is well predicted.
7.3 Implementation details
The input of the network is , the output of each stacked HG is
. We use four stacks of HG, same with other baselines. During training, we use RMSProp with an initial learning rate of . We set the momentum to be 0 (adopted from [7, 42]) and the weight decay to be
. We train for 240 epoches, and the learning rate is reduced toand after 80 and 160 epoches. Data augmentation is performed with random rotation (), translation (), flipping (), and rescaling (). Random Gaussian blur, noise and occlusion are also used. All models are trained from scratch. During inference, we adopt the same strategy used in Newell , the location on the pixel with the highest response is shifted a quarter pixel to the second highest nearby pixel. The boundary line is generated from landmarks via distance transform similar to , different boundary lines aremerged into one channel by selecting maximum values oneach pixel across all channels.
|NME(%)||ESRCVPR 14 ||11.13||25.88||11.47||10.49||11.05||13.75||12.20|
|SDMCVPR 13 ||10.29||24.10||11.45||9.32||9.38||13.03||11.28|
|CFSSCVPR 15 ||9.07||21.36||10.09||8.30||8.74||11.76||9.96|
|DVLNCVPR 17 ||6.08||11.54||6.78||5.73||5.98||7.33||6.88|
|LABCVPR 18 ||5.27||10.24||5.51||5.23||5.15||6.79||6.32|
|WingCVPR 18 ||5.11||8.75||5.36||4.93||5.41||6.37||5.81|
|FR10%(%)||ESRCVPR 14 ||35.24||90.18||42.04||30.80||38.84||47.28||41.40|
|SDMCVPR 13 ||29.40||84.36||33.44||26.22||27.67||41.85||35.32|
|CFSSCVPR 15 ||20.56||66.26||23.25||17.34||21.84||32.88||23.67|
|DVLNCVPR 17 ||10.84||46.93||11.15||7.31||11.65||16.30||13.71|
|LABCVPR 18 ||7.56||28.83||6.37||6.73||7.77||13.72||10.74|
|WingCVPR 18 ||6.00||22.70||4.78||4.30||7.77||12.50||7.76|
|AUC10%||ESRCVPR 14 ||0.2774||0.0177||0.1981||0.2953||0.2485||0.1946||0.2204|
|SDMCVPR 13 ||0.3002||0.0226||0.2293||0.3237||0.3125||0.2060||0.2398|
|CFSSCVPR 15 ||0.3659||0.0632||0.3157||0.3854||0.3691||0.2688||0.3037|
|DVLNCVPR 17 ||0.4551||0.1474||0.3889||0.4743||0.4494||0.3794||0.3973|
|LABCVPR 18 ||0.5323||0.2345||0.4951||0.5433||0.5394||0.4490||0.4630|
|WingCVPR 18 ||0.5504||0.3100||0.4959||0.5408||0.5582||0.4885||0.4918|
|TCDCNECCV 14 ||8.05||-||-|
|Wu ICCV 15 ||5.93||-||-|
|RARECCV 16 ||6.03||-||4.14|
|DAC-CSRCVPR 17 ||6.03||-||4.73|
|SHNCVPRW 17 ||5.60||-||-|
|PCD-CNNCVPR 18 ||5.77||-||3.73|
|WingCVPR 18 ||5.44||-||3.75|
|DCFEECCV 18 ||5.27||35.86||7.29|
7.3.1 Evaluation on COFW
Experiment results on the COFW dataset is shown in Table 2. Our approach outperforms previous state-of-the-art by a significant margin, especially on failure rate. We are able to reduce the failure rate measured at 10% NME from 3.73% to 0.99%. As for NME, our method perform much better than human (5.60%). Our performance on the COFW shows the robustness of our approach against faces with large pose and heavy occlusion.
7.4 Evaluation on 300W
Our method is able to achieve the state-of-the-art performance on the 300W testing dataset, see Table 3. For the challenge subset (iBug dataset), we are able to outperform Wing  by a significant margin, which also proves the robustness of our approach against occlusion and large pose variation. Furthermore, on the 300W private test dataset (Table 4), we again outperform the previous state-of-the-art on variant metrics including NME, AUC and FR measured with either 8% NME and 10% NME. Note that we more than halved the failure rate of the next best baseline to 0.83%, which means only 5 faces out of 600 have an NME that is larger than 8%.
|CFANECCV 14 ||5.50||16.78||7.69|
|SDMCVPR 13 ||5.57||15.40||7.50|
|LBFCVPR 14 ||4.95||11.98||6.32|
|CFSSCVPR 15 ||4.73||9.98||5.76|
|MDMCVPR 16 ||4.83||10.14||5.88|
|RARECCV 16 ||4.12||8.35||4.94|
|DVLNCVPR 17 ||3.94||7.62||4.66|
|TSRCVPR 17 ||4.36||7.56||4.99|
|DSRNCVPR 18 ||4.12||9.68||5.21|
|LABCVPR 18 ||4.20||7.41||4.92|
|RCN+(L+ELT)CVPR 18) ||4.20||7.78||4.90|
|DCFEECCV 18 ||3.83||7.54||4.55|
|WingCVPR 18 ||3.27||7.18||4.04|
|PCD-CNNCVPR 18 ||3.67||7.62||4.44|
|CPM+SBRCVPR 18 ||3.28||7.58||4.10|
|SANCVPR 18 ||3.34||6.60||3.98|
|LABCVPR 18 ||2.98||5.19||3.49|
|DU-NetECCV 18 ||2.90||5.15||3.35|
|ESRCVPR 14 ||-||32.35||17.00|
|cGPRTCVPR 15 ||-||41.32||12.83|
|CFSSCVPR 15 ||-||39.81||12.30|
|MDMCVPR 16 ||5.05||45.32||6.80|
|DANCVPRW 17 ||4.30||47.00||2.67|
|SHNCVPRW 17 ||4.05||-||-|
|DCFEECCV 18 ||3.88||52.42||1.83|
|Fan 16’ ||-||48.02||14.83|
|DR + MDM CVPR 17 ||-||52.19||3.67|
|LABCVPR 18 ||-||58.85||0.83|
7.5 Evaluation on WFLW
Our method again achieves the best results on the WFLW dataset in Table 1, which is significantly more difficult than COFW and 300W (see Fig. 7 for visualizations). On every subset we outperform the previous state-of-the-art approaches by a significant margin. Note that the baseline Wing is using ResNet50  as the backbone architecture, which already performs better than the CNN6/7 architecture they used in COFW and 300W. We are also able to reduce the failure rate and increase the AUC dramatically and hence improving the overall localization quality significantly. All in all, our approach fails on only 2.04% of all images, almost a three times improvement compared with previous best results.
7.6 Ablation study
7.6.1 Evaluation on different Adaptive Wing loss parameters
To find the optimal parameter settings for the Adaptive Wing loss for heatmap regression, we examined different parameter combinations and evaluated on the WFLW dataset. However, the search space is too large and we only have limited resources. To reduce the search space, we set our initial to 0.5, since the pixel value of the ground truth heatmap is from 0 to 1, we believe focusing on errors that are smaller than 0.5 is more than enough. Table 5 shows NMEs on different combinations of and . As a result, we picked and . The experiments also show our Adaptive Wing loss is not very sensitive to and , since the difference of NMEs are not significant within a certain range of different settings. Then we fixed and , and examine different , the results are shown in Table 6.
7.6.2 Evaluation on different modules
Evaluation on the effectiveness of different modules is shown in Table 7. The dataset used for ablation study is WFLW. Note the baseline model (model trained with MSE) underperforms the state-of-the-art. To compare with a naive weight mask, we introduced a baseline weight map WMbase = ĤW+1, where W = 10. Note that only with a different loss function, our method is able to outperform Wing  and LAB  by a significant margin. The major contribution comes from Adaptive Wing loss, which improves the benchmark by 0.74%. All other modules contributed incrementally to the localization performance, our Weighted Loss Map improves 0.25%, boundary prediction and coordinates encoding are able to contribute another 0.09%. Our Weighted Loss Map also outperforms WMbase by a considerable margin, thanks to its ability to focus on hard background pixels.
7.6.3 Effectiveness of Adaptive Wing loss on Training
Table 8 shows the effectiveness of our Adaptive Wing loss compare with MSE loss in terms of training loss w.r.t. the number of training epochs. Model trained with Adaptive Wing loss is able to reduce pixel-wise average MSE loss for almost 30%, and more than 23% on foreground pixels. Especially, this improvement comes at a mere epochs, showing that the AWing loss improves convergence speed.
7.7 Evaluation on human pose estimation
Although this paper mainly deals with face alignment, we have also performed experiments to prove the ability of the proposed Adaptive Wing loss in another heatmap regression task, human pose estimation. We choose LSP  (using person-centric (PC) annotations) as evaluation dataset. LSP dataset consists of 11000 training images and 1000 testing images. Each image is labeled with 14 keypoints. The goal of this experiment is to examine the capability of proposed Adaptive Wing loss to handle pose estimation task compared with baseline MSE loss, rather than achieving the stat-of-the-art in human pose estimation. Some other works [10, 57, 26, 44] obtain better results by adding MPII  into training or finetune on MPII pretrained model or use re-annotated labels with high resolution images in . Besides the MSE loss baseline, we also reported baselines from methods that trained on LSP dataset solely. We trained our model from scratch with original labeling and low resolution images to see how well our Adaptive Wing loss could handle labeling noise and low quality images. Percentage Correct Keypoints (PCK)  is used as the evaluation metric with torso dimension as the normalization factor. Please refer to supplemental materials for more implementation details. Results are shown in Table 9. Our proposed Adaptive Wing loss significantly boosts performance compared with MSE, which proves the general applicability of the proposed Adaptive Wing loss on more heatmap regression tasks.
In this paper, we located two issues in the MSE loss function in heatmap regression. To resolve these issues, we proposed the Adaptive Wing loss and Weighted Loss Map for accurate localization of facial landmarks. To further improve localization results, we also introduced boundary prediction and CoordConv with boundary coordinates into our model. Experiments show our approach is able to outperform the state-of-the-art on multiple datasets by a significant margin, using various evaluation metrics, especially on failure rate and AUC, which indicates our approach is more robust to difficult scenarios.
9 Supplementary Material
9.1 Implementation Detail of CoordConv on Boundary Information
9.2 Datasets Used in Our Experiments
The COFW  dataset includes 1,345 training images and 507 testing images annotated with 29 landmarks. This dataset is aimed to test the effectiveness of face alignment algorithms on faces with large pose and heavy occlusion. Various types of occlusions are introduced and result in a 23% occlusion on facial parts in average.
The 300W  is widely used as a 2D face alignment benchmark with 68 annotated landmarks. 300W consists of the following subsets: LFPW , HELEN , AFW , XM2VTS  and an additional dataset with 135 images with large pose, occlusion and expressions called iBUG. To compare with other approaches, we adopt the widely used protocol described in  to train and evaluate our approach. More specifically, we use the training dataset of LFPW, HELEN, and the full AFW dataset as training dataset, and the test dataset of LFPW, HELEN and the full iBUG dataset as full test dataset. The full test dataset is then further split into two subsets, the test dataset of LFPW and HELEN is called the common test dataset, and iBUG is called the challenge test dataset. There is also a 300W private test dataset for the 300W contest, which contains 300 indoor and 300 outdoor faces. We also evaluated our approach on this dataset.
The WFLW  is a newly introduced dataset with 98 manually annotated landmarks that constitutes of 7,500 training images and 2,500 testing images. In addition to denser annotations, it also provides attribute annotations including pose, expression, illumination, make-up, occlusion and blur. The six different subsets can be used for analyzing algorithm performance on subsets with different properties separately. The WFLW is considered more difficult than commonly used datasets such as AFLW and 300W due to its more densely annotated landmarks and difficult faces with occlusion, blur, large pose, makeup, expression and illumination.
For LSP  dataset, we used original label from author’s official website111http://sam.johnson.io/research/lsp.html222http://sam.johnson.io/research/lspet.html. Although images with original resolutions are also provided, we choose not to use them. Also, we did not use re-annotated labels on LSP extended 10,000 training images from . Note that occluded keypoints are annotated in LSP original dataset but not in LSP extended training dataset. During training, we did not calculate loss on occluded keypoints for LSP extended training dataset. During training and testing, we did not follow  to crop single person from images with multiple persons to retain the difficulties of this dataset. Data augmentations is performed similarly to training with face alignment datasets.
9.3 Additional Ablation Study
9.4 Experiment on different number of HG stacks
We compare the performance of different number of stacks of HG module (see details in Table 10). With reduced number of HGs, the performance of our approach remains outstanding. Even with only one HG block, our approach still outperforms previous state-of-the-arts in all datasets except the common subset and the full dataset of 300W. Note that the one HG model is able to run at 120 FPS with Nvidia GTX 1080Ti graphics card. The result reflects the effectiveness of our approach on limited computation resources.
9.5 Result Visualization
M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele.
2d human pose estimation: New benchmark and state of the art
Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pages 3686–3693, 2014.
-  P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts of faces using a consensus of exemplars. IEEE transactions on pattern analysis and machine intelligence, 35(12):2930–2940, 2013.
-  M. J. Black and P. Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer vision and image understanding, 63(1):75–104, 1996.
-  M. J. Black and A. Rangarajan. On the unification of line processes, outlier rejection, and robust statistics with applications in early vision. International Journal of Computer Vision, 19(1):57–91, 1996.
-  A. Bulat and G. Tzimiropoulos. Two-stage convolutional part heatmap regression for the 1st 3d face alignment in the wild (3dfaw) challenge. In European Conference on Computer Vision, pages 616–624. Springer, 2016.
-  A. Bulat and G. Tzimiropoulos. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In The IEEE International Conference on Computer Vision (ICCV), volume 1, page 4, 2017.
-  A. Bulat and G. Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision, volume 1, page 4, 2017.
-  X. P. Burgos-Artizzu, P. Perona, and P. Dollár. Robust face landmark estimation under occlusion. In Proceedings of the IEEE International Conference on Computer Vision, pages 1513–1520, 2013.
-  X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape regression. International Journal of Computer Vision, 107(2):177–190, 2014.
-  X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1831–1840, 2017.
-  J. Deng, J. Guo, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. arXiv:1801.07698, 2018.
-  J. Deng, Q. Liu, J. Yang, and D. Tao. M3 csr: Multi-view, multi-scale and multi-component cascade shape regression. Image and Vision Computing, 47:19–26, 2016.
-  J. Deng, G. Trigeorgis, Y. Zhou, and S. Zafeiriou. Joint multi-view face alignment in the wild. arXiv preprint arXiv:1708.06023, 2017.
-  X. Dong, Y. Yan, W. Ouyang, and Y. Yang. Style aggregated network for facial landmark detection. In CVPR, volume 2, page 6, 2018.
-  P. Dou, S. K. Shah, and I. A. Kakadiaris. End-to-end 3d face reconstruction with deep neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 21–26, 2017.
H. Fan and E. Zhou.
Approaching human level facial landmark localization by deep learning.Image and Vision Computing, 47:27–35, 2016.
-  Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu. Wing loss for robust facial landmark localisation with convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  Z.-H. Feng, J. Kittler, W. Christmas, P. Huber, and X.-J. Wu. Dynamic attention-controlled cascaded shape regression exploiting training data augmentation and fuzzy-set sample weighting. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3681–3690. IEEE, 2017.
-  S. Ganan and D. McClure. Bayesian image analysis: An application to single photon emission tomography. Amer. Statist. Assoc, pages 12–18, 1985.
-  P. Garrido, M. Zollhöfer, D. Casas, L. Valgaerts, K. Varanasi, P. Pérez, and C. Theobalt. Reconstruction of personalized 3d face rigs from monocular video. ACM Transactions on Graphics (TOG), 35(3):28, 2016.
-  R. A. Güler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos. Densereg: Fully convolutional dense shape regression in-the-wild. In CVPR, volume 2, page 5, 2017.
-  F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.
-  T. Hassner, S. Harel, E. Paz, and R. Enbar. Effective face frontalization in unconstrained images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4295–4304, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
S. Honari, P. Molchanov, S. Tyree, P. Vincent, C. Pal, and J. Kautz.
Improving landmark localization with semi-supervised learning.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision, pages 34–50. Springer, 2016.
-  S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human pose estimation. In BMVC, volume 2, page 5, 2010.
-  S. Kang, J. Lee, K. Bong, C. Kim, Y. Kim, and H.-J. Yoo. Low-power scalable 3-d face frontalization processor for cnn-based face recognition in mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2018.
-  M. Kowalski, J. Naruniec, and T. Trzcinski. Deep alignment network: A convolutional neural network for robust face alignment. In Proceedings of the International Conference on Computer Vision & Pattern Recognition (CVPRW), Faces-in-the-wild Workshop/Challenge, volume 3, page 6, 2017.
-  A. Kumar and R. Chellappa. Disentangling 3d pose in a dendritic cnn for unconstrained 2d face alignment.
-  A. Kumar and R. Chellappa. Disentangling 3d pose in a dendritic cnn for unconstrained 2d face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 430–439, 2018.
-  V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang. Interactive facial feature localization. In European conference on computer vision, pages 679–692. Springer, 2012.
-  D. Lee, H. Park, and C. D. Yoo. Face alignment using cascade gaussian process regression trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4204–4212, 2015.
-  F. Liu, D. Zeng, Q. Zhao, and X. Liu. Joint face alignment and 3d face reconstruction. In European Conference on Computer Vision, pages 545–560. Springer, 2016.
-  R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. arXiv preprint arXiv:1807.03247, 2018.
-  W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 1, 2017.
-  J.-J. Lv, X. Shao, J. Xing, C. Cheng, X. Zhou, et al. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In CVPR, volume 1, page 4, 2017.
-  I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-aware face recognition in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4838–4846, 2016.
-  D. Merget, M. Rock, and G. Rigoll. Robust facial landmark detection via a fully-convolutional local-global context network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 781–790, 2018.
-  K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. Xm2vtsdb: The extended m2vts database. In Second international conference on audio and video-based biometric person authentication, volume 964, pages 965–966, 1999.
-  X. Miao, X. Zhen, X. Liu, C. Deng, V. Athitsos, and H. Huang. Direct shape regression networks for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5040–5049, 2018.
-  A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
-  X. Peng, Z. Tang, F. Yang, R. S. Feris, and D. Metaxas. Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2226–2234, 2018.
-  L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4929–4937, 2016.
-  S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 fps via regressing local binary features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1685–1692, 2014.
-  S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment via regressing local binary features. IEEE Transactions on Image Processing, 25(3):1233–1245, 2016.
-  S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment via regressing local binary features. IEEE Transactions on Image Processing, 25(3):1233–1245, 2016.
-  J. Roth, Y. Tong, and X. Liu. Unconstrained 3d face reconstruction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In 2013 IEEE International Conference on Computer Vision Workshops, pages 397–403. IEEE, 2013.
-  X. Shao, J. Xing, J.-J. Lv, C. Xiao, P. Liu, Y. Feng, C. Cheng, and F. Si. Unconstrained face alignment without face detection. In CVPR Workshops, pages 2069–2077, 2017.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014.
-  Z. Tang, X. Peng, S. Geng, L. Wu, S. Zhang, and D. Metaxas. Quantized densely connected u-nets for efficient landmark localization. In European Conference on Computer Vision (ECCV), 2018.
T. Tieleman and G. Hinton.
Lecture 6.5-rmsprop: Divide the gradient by a running average of its
COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
-  G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos, and S. Zafeiriou. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4177–4187, 2016.
-  R. Valle and M. José. A deeply-initialized coarse-to-fine ensemble of regression trees for face alignment. In Proceedings of the European Conference on Computer Vision (ECCV), pages 585–601, 2018.
-  Y. Wang, H. Yu, J. Dong, B. Stevens, and H. Liu. Facial expression-aware face frontalization. In Asian Conference on Computer Vision, pages 375–388. Springer, 2016.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
-  W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou. Look at boundary: A boundary-aware face alignment algorithm. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  W. Wu and S. Yang. Leveraging intra and inter-dataset variations for robust face alignment. In Proceedings of the International Conference on Computer Vision & Pattern Recognition (CVPR), Faces-in-the-wild Workshop/Challenge, volume 3, page 6, 2017.
-  Y. Wu and Q. Ji. Robust facial landmark detection under significant head poses and occlusion. In Proceedings of the IEEE International Conference on Computer Vision, pages 3658–3666, 2015.
-  Y. Wu, S. K. Shah, and I. A. Kakadiaris. Godp: Globally optimized dual pathway deep network architecture for facial landmark localization in-the-wild. Image and Vision Computing, 73:1–16, 2018.
-  S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, and A. Kassim. Robust facial landmark detection via recurrent attentive-refinement networks. In European conference on computer vision, pages 57–72. Springer, 2016.
-  X. Xiong and F. De la Torre. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 532–539, 2013.
-  J. Yang, Q. Liu, and K. Zhang. Stacked hourglass network for robust facial landmark localisation. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 2025–2033. IEEE, 2017.
-  J. Yang, Q. Liu, and K. Zhang. Stacked hourglass network for robust facial landmark localisation. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 2025–2033. IEEE, 2017.
-  J. Yang, P. Ren, D. Zhang, D. Chen, F. Wen, H. Li, and G. Hua. Neural aggregation network for video face recognition. In CVPR, volume 4, page 7, 2017.
-  Y. Yang and D. Ramanan. Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence, 35(12):2878–2890, 2013.
-  J. Zhang, S. Shan, M. Kan, and X. Chen. Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. In European Conference on Computer Vision, pages 1–16. Springer, 2014.
-  Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In European Conference on Computer Vision, pages 94–108. Springer, 2014.
-  Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Learning deep representation for face alignment with auxiliary attributes. IEEE transactions on pattern analysis and machine intelligence, 38(5):918–930, 2016.
-  S. Zhu, C. Li, C. Change Loy, and X. Tang. Face alignment by coarse-to-fine shape searching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4998–5006, 2015.
-  X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2879–2886. IEEE, 2012.