Attentive One-Dimensional Heatmap Regression for Facial Landmark Detection and Tracking

04/05/2020 ∙ by Shi Yin, et al. ∙ USTC 0

Although heatmap regression is considered a state-of-the-art method to locate facial landmarks, it suffers from huge spatial complexity and is prone to quantization error. To address this, we propose a novel attentive one-dimensional heatmap regression method for facial landmark localization. First, we predict two groups of 1D heatmaps to represent the marginal distributions of the x and y coordinates. These 1D heatmaps reduce spatial complexity significantly compared to current heatmap regression methods, which use 2D heatmaps to represent the joint distributions of x and y coordinates. With much lower spatial complexity, the proposed method can output high-resolution 1D heatmaps despite limited GPU memory, significantly alleviating the quantization error. Second, a co-attention mechanism is adopted to model the inherent spatial patterns existing in x and y coordinates, and therefore the joint distributions on the x and y axes are also captured. Third, based on the 1D heatmap structures, we propose a facial landmark detector capturing spatial patterns for landmark detection on an image; and a tracker further capturing temporal patterns with a temporal refinement mechanism for landmark tracking. Experimental results on four benchmark databases demonstrate the superiority of our method.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As a fundamental computer vision task, face alignment

[32, 7]

consists of two sub-tasks, i.e., facial landmark detection on a static image, and facial landmark tracking in a video with continuous frames. Regression-based methods for face alignment can be divided into two categories, i.e., a coordinate regression approach, or a heatmap regression approach. Coordinate regression approaches directly map facial appearances to the continuous values of landmark coordinates or their displacements. This technique has difficulty handling spatial distributions around ground truth coordinates, especially for the coordinates with large variances. Therefore, as reported by Sun

et al. [27], it is sub-optimal on spatial generalization performance.

Figure 1: Illustration of the quantization error caused by low-resolution 2D heatmap.

Recently, heatmap regression approaches have been proposed. These approaches assume that a landmark coordinate obeys Gaussian distribution around its ground truth label. These approaches predict the discrete probability distribution that a landmark occurs in each point of the heatmap, and transform points on the heatmaps to the predicted landmark coordinates. Heatmap regression approaches successfully capture the spatial distributions around ground truths. They are thus theoretically capable of better spatial modeling than coordinate regression approaches. However, the quantization process, i.e., using discrete heatmaps to represent continuous landmark coordinates with both integer and fractional parts, may cause information loss, which is called as quantization error. One way to alleviate such a problem is to improve the heatmap resolution. Unfortunately, current heatmap regression methods suffer from high spatial complexity, as they use 2D heatmaps to represent joint distributions on the

and axes. Specifically, a landmark produces a heatmap with points, where is the output resolution on the and axes. The total output size of landmarks is . The space occupation of 2D heatmaps increases dramatically as increases. Due to the limited machine memory, is typically restricted to a value lower than the input face [18, 2]. This compromise may cause severe quantization errors as shown in Fig. 1, limiting the practical performance of the heatmap regression method. To reduce quantization errors, some methods [27, 29]

are proposed to estimate the fractional part of coordinates from a 2D heatmap. However, they still suffer from the information loss caused by low resolution.

(a) 1D heatmaps on the and axes are generated by CNN networks. The co-attention mechanism is applied between the output features of and to capture joint distribution on the two axes.
(b) Based on the spatial patterns captured by the detector, the tracker further captures temporal patterns by fusing features on the current frame with features from past frames. Temporal patterns are used to refine the detected heatmaps.
Figure 2: The proposed detector (a) and tracker (b).

To solve the disadvantage of 2D heatmaps, we propose a novel attentive one-dimensional heatmap regression approach that achieves prominent spatial and temporal modeling capability while significantly decreasing the output complexity. The basic idea is to replace the 2D heatmaps by 1D heatmaps that represent marginal distributions on the and axes as the output structure. Considering the correlation between and coordinates, we capture the - joint distribution implicitly by applying a co-attention mechanism on the distributional features of the two axes. Our method significantly decreases the spatial complexity of the output. The total output size of 1D heatmaps on two axes is (), much smaller than the 2D heatmaps (). The small output size allows us to fully boost the resolution despite limited GPU memory, significantly alleviating quantization errors. Based on the proposed heatmap structure, we design a facial landmark detector and a tracker, as shown in Fig. 2 and 2

, respectively. The detector captures spatial patterns in a static image, while the tracker integrates both spatial and temporal patterns from the facial sequence in a video to locate landmarks. For the detector, we design two groups of convolutional neural networks (CNNs) to compress the facial representation tensor to 1D heatmaps on the


axes. The co-attention mechanism is adopted between the features extracted by the CNNs. In the tracker, spatial patterns from the current frame are extracted by the proposed detector, temporal patterns among multiple frames are embedded by a novel temporal refinement mechanism integrating features on the current frame and features from past frames.

We conduct facial landmark detection experiments on the 300W dataset and the AFLW dataset, and conduct facial landmark tracking experiments on the 300VW dataset and the TF dataset. Experimental results on these datasets show that the proposed method outperforms state-of-the-art coordinate regression and heatmap regression methods.

The main contributions of our method are three folds. First, we are the first that propose to predict 1D heatmaps on the and axes instead of using 2D heatmaps to locate landmarks and successfully alleviate the quantization error with a fully boosted output resolution. Second, we propose a co-attention module to capture the joint coordinate distribution on the two axes. Third, based on the proposed heatmap regression method, we design a facial landmark detector and tracker which achieve state-of-the-art performance.

2 Related Work

2.1 Coordinate Regression Methods

Coordinate regression approaches [34, 4, 3, 21, 28, 36, 14, 33] directly predict coordinate values or their increments by a mapping function. As early works, Xiong et al. [34] proposed a Supervised Descent Method (SDM) method, which maps Scale-Invariant Feature Transform (SIFT) features to landmark displacements between the current output and the ground truth by minimizing a nonlinear least square objective function. Cao et al. [4] and Burgos-Artizzu et al. [3] learned fern regressors to predict landmark increments. Ren et al. [21]

proposed to learn local binary features for every facial landmark by random forests.

Recently, deep learning based methods are proposed for face alignment. Sun

et al. [28] proposed a CNN based method to predict facial landmarks in a cascaded way. Zhang et al. [36] proposed Tasks-Constrained Deep Convolutional Network (TCDCN), a multi-task learning method that learns to predict landmark coordinates as well as other facial attributes, including expression, gender, etc. Liu et al. [14] combined a CNN network with a RNN based encoder-decoder network to learn spatial and temporal patterns of landmarks in adjacent frames.

To capture dependencies among landmark labels, some works adopted probabilistic graphical models, such as a Dynamic Bayesian Network (DBN)


, or a Restricted Boltzmann Machine (RBM)

[33], as shape constraints. Yin et al. [35] utilized adversarial learning to explore the inherent dependencies among the movement of facial landmarks.

Despite these progresses, it is still challenging for a coordinate regression method to capture spatial distributions around ground truth coordinates, especially when the coordinate values vary in a wide range. This weakness leads to a sub-optimal spatial generalization performance of these methods.

2.2 Heatmap Regression Methods

Heatmap regression methods [18, 2, 19, 8, 31, 5, 27, 29] capture spatial distributions around ground truths by likelihood heatmaps of landmarks. Newell et al. [18]

proposed a stacked hourglass network to generate heatmaps for 2D human pose estimation. Bulat and Tzimiropoulos

[2] enhanced the stacked hourglass network with hierarchical, parallel and multi-scale residual blocks. Xi et al. [19] proposed a spatial and temporal recurrent learning method for landmark detection and tracking. Based on the heatmap technique, Chu et al. [8] proposed a multi-context attention mechanism to focus on informative feature regions. Wu et al. [31] proposed to estimate the heatmap of facial boundary as auxiliary features to locate landmarks. Chen et al. [6] designed an adversarial learning method to learn structural patterns among landmarks. Chen et al. [5] proposed a Conditional Random Field (CRF) method to embed geometric relationships among landmarks based on their heatmaps. Liu et al. [15] proposed a heatmap correction unit which uses global shape constraints to refine heatmaps.

The heatmap structure achieves good theoretical performance for spatial generalization. However, it suffers from huge spatial complexity. Under limited space, the heatmap resolution is typically compressed to a value smaller than the input face. That leads to serious quantization errors. To address this, Sun et al. [27] proposed to integrate all point locations weighted by their probabilities in the 2D heatmap as the predicted coordinate. Tai et al. [29] proposed Fractional Heatmap Regression (FHR), which uses three heatmap points to estimate the fractional parts of landmark coordinates according to a Gaussian function. However, these 2D heatmap-based methods still suffer from the information loss caused by low resolution.

To address the disadvantages of 2D heatmap regression, we propose a new regression method based on 1D heatmaps which represent the marginal distribution on each axis. We capture joint distributions between the and axes by a co-attention mechanism, instead of using 2D heatmaps. The proposed regression method is much more space-efficient and the output resolution can be fully boosted despite limited space. Therefore, the quantization error is significantly alleviated.

3 Analysis on the Quantization Error of 2D Heatmap

Conventional heatmap regression methods, denoted as , predict discrete 2D joint distributions (heatmaps) for predefined landmarks from a facial image , as shown in Equation (1):


where is the parameters of and denotes the heatmap of the th landmark with resolution . The ground truth heatmap, represented as , is considered a discrete Gaussian distribution centered on the ground truth position, i.e., . The probability density on an arbitrary heatnap point follows Equation (2):


where is the variance of the Gaussian distribution.

The heatmap structure models the spatial distributions of landmarks to obtain spatial generalization. However, using a discrete heatmap to represent continuous landmark coordinates with both integer and fractional parts may cause quantization error. This is because the rounding-down operation is applied to convert the continuous coordinate to the discrete heatmap point , as shown in Equation (3):


The rounding-down operation drops the fractional part of its input. Therefore, we could only recover an approximate value of from , as shown in Equation (4):


where is the coordinate recovered from the heatmap. Quantization error is defined as the Euclidean distance between and , as shown in Equation (5):


From Equation (3) and Equation (4), we find that the higher the value of , the closer comes to and comes to . In other words, the quantization error is decreasing. For example, suppose is . When , . When increases to , decreases to . Therefore, with a given , one way to reduce is to improve .

Unfortunately, the 2D heatmap structure is very spatially complex. For landmarks, a total of heatmap points are generated. Due to the limited machine memory, current heatmap regression methods usually set as a value smaller than , resulting in severe quantization errors.

4 Methodology

The key to reducing quantization errors is to boost the output resolution despite limited machine memory. For that purpose, we propose a new method with good capability of spatial and temporal modeling while significantly decreasing output complexity compared to 2D heatmap regression methods. Instead of predicting joint distributions on the and axes explicitly by 2D heatmaps that occupy huge space, we propose to model joint distributions implicitly and just predict 1D heatmaps that represent marginal distributions.

Based on such an idea, we propose a new landmark detector, as depicted in Fig. 2. The detector with parameters is formalized as Equation (6), where and are the predicted 1D heatmaps on the and axes, respectively, for the th landmark.


Coordinate prediction for the th landmark (,) on a facial image is obtained from the maximum points of and , as shown in Equation (7):


We also extend the detector as a tracker, as depicted in Fig. 2. The tracker with parameters predicts landmark positions in a facial video, denoted as , where is the th frame of the video. For the th frame, the tracker captures not only spatial patterns on , but also temporal patterns inherent in the sequence from to . The tracker is formalized as Equation (8).


where and are the output heatmaps on the th frame.

The detector and the tracker output and (or and ) for landmarks, and the total output size for a face is , much smaller than that of the 2D heatmaps (). In other words, the light-weight structure of 1D heatmap allows us to boost its resolution to a large value without heavy space occupation, and therefore the quantization error is significantly alleviated.

4.1 Detector

4.1.1 Generating 1D Heatmaps Using CNNs:

First, a facial representation F is learned from the input facial image through a stacked hourglass network [2], as shown in the left part of Fig. 2. Then, 1D heatmaps on two axes are generated by two groups of CNNs, respectively, as depicted in the green border boxes of Fig. 2. The first group of CNNs, composed of and , converts F to 1D heatmaps on the axis, i.e., , by compressing features along the

axis with a striding operation. At the end of

, a deconvolution module is adopted to generate heatmaps and the heatmap resolution is proportional to the kernel and stride size () of deconvolution. The second group of CNNs, composed of and , generates heatmaps on the axis by compressing features along the axis.

4.1.2 Capturing Joint Distribution on the and Axes by Co-Attention:

Co-attention [16] is a category of attention methods which capture correlation between pairwise features. We design a co-attention module to capture the - joint distributions of landmark coordinates, as shown in the red border box of Fig. 2. First, we encode correlations between features representing distributions on the and axes as affinity matrices. Second, features are converted by the affinity matrices and then fused together to embed the joint distribution into their representations.

The co-attention mechanism is adopted between the output feature of and , denoted as and , respectively. Both and have multiple channels, and represent their th channel. is a feature representing the distribution on the axis, while represents distribution on the axis. The shape of is the same as . Two affinity matrices, i.e., and , are adopted to encode the correlation between and , as shown in Equation (9):


where P and Q are parameter matrices. The

operation is applied to each row vector of its input matrix, and

is the column number of P and Q. Following Vaswani et al. [30], is used as a normalization factor to keep from the region with an extremely small gradient. Based on the affinity matrices, and are fused to capture the joint distribution on the and axes, as shown in Equation (10):


where is the weight of the attentive feature. Next, is fed into as an input channel, and is fed into .

4.1.3 Loss Function:

is trained by supervised regression. The error between the prediction and ground truth is minimized as shown in Equation (11). is optimized by an Adam optimizer with a learning rate of 1e-4.


where and are the ground truths. They are the marginal distributions of , as shown in Equation (12):


4.2 Tracker

4.2.1 Integrating Spatial and Temporal Patterns:

First, spatial patterns on the current frame are encoded by the proposed detector, as shown in the left part of Fig. 2. The detected heatmaps on the and axes are stacked as matrices, denoted as and , respectively, for the th frame.

Second, the detected heatmaps are refined by temporal patterns. This is beneficial because the facial appearance on the current frame may not be reliable due to some “in the wild” disturbances, such as occlusions or uneven illuminations. Integrating temporal patterns from previous frames may help locate landmarks when spatial features on the current frame are unreliable. As depicted in the green border boxes of Fig. 2, and are encoded by and to features denoted as and , respectively. In the feature space, and are fused with features from the past frames, as shown in Equation (13):


where and are features embedded with temporal patterns, is a hyper-parameter to attenuate the weights of frames far from the current. To generate heatmap refinements, decodes to and decodes to . and are used to refine and respectively by adding together with them as the tracking results.

4.2.2 Loss Function:

Similar to the detector, the tracker is also trained by supervised regression. The training loss is shown in Equation (14):


5 Experiments

5.1 Experimental Conditions

Facial landmark detection experiments are conducted on the 300W [23] and the AFLW dataset, both of them are image datasets. Facial landmark tracking experiments are conducted on the 300VW [24] and the Talking Face (TF) [12] dataset, they are video datasets.

The 300W dataset contains 68 pre-defined landmarks. For experiments on the 300W dataset, the proposed method is trained on its training set with images, and evaluated on its public testing set, composed of a common subset with images and a challenging subset with images. For saving space, in the following part of the paper, their names are simplified as 300W com and cha, respectively. The full testing set is simplified as 300W full.

The AFLW dataset has 21 pre-defined landmarks. For experiments on the AFLW dataset, we just use 19 landmarks as previous work did [10]. The proposed method is trained on the training set with 20000 images, and evaluated on the testing set with 4386 images.

The 300VW dataset contains 68 pre-defined landmarks. It has a training set with 50 videos, a total of 95192 frames, and a testing set consisting of 60 videos from three difficulty levels, i.e., well-lit (scenario 1), mild unconstrained (scenario 2) and challenging (scenario 3). Their names are simplified as 300VW S1, S2, and S3, respectively. For experiments on the 300VW dataset, following Yin et al. [35], our method is trained on the union of training sets from the 300VW and 300W dataset, and evaluated on the 300VW testing set. Since the 300W dataset only contain images with no temporal information, for the tracking task, we only use it to train the detector inside the tracker.

Since the TF dataset only contains one video with 5000 frames, the method is trained on 300VW dataset and evaluated on the TF dataset. Due to the different landmark definitions between the TF and the 300VW dataset, we follow Liu et al. [14] to apply the seven common landmarks for testing.

The detector and tracker are evaluated by the accuracy of their predictions, which is quantified by the Normalized Root Mean Squared Error (NRMSE) between the predicted landmark coordinates and the ground truths. A lower NRMSE corresponds to a better accuracy. Following previous works [23, 10, 35], on the 300W, 300VW, and TF datasets, the error is normalized by the inter-ocular distance of a face. On the AFLW dataset, it is normalized by the face size.

All experiments are conducted by Tensorflow 1.9.0 on a NVIDIA TESLA V100 GPU with 32GiB memory. A face is firstly cropped from the bounding box and scaled to

pixels, then fed into the detector or the tracker. The training batch size is set as . The resolution of the 1D heatmap is set to three times the face size. For the experiment on each dataset, the training set is splitted as 10 folds to conduct cross validation and select optimal values for in Equation (10) and in Equation (13). and are searched from . The optimal found on the 300W training set, the AFLW training set, and the 300VW training set are , and , respectively. The optimal found on the 300VW training set is .

In the following parts, we make quantitative evaluation on the proposed detector and tracker under different parameter settings, and compare them with related works. All hyper-parameters are assigned with their optimal values except of that we want to further study on. We also visualize the detecting and tracking results of our method in the supplementary material.

64 128 256 64 128 256 384 512 640 768
0.25 0.5 1.0 0.25 0.5 1.0 1.5 2.0 2.5 3.0
Method 2D heatmap-based detector The proposed 1D heatmap-based detector


300W com 3.53 3.32 OOM 3.50 3.34 3.19 3.11 3.03 2.96 2.91
300W cha 6.46 6.14 OOM 6.22 5.89 5.74 5.52 5.37 5.36 5.31
300W full 4.10 3.88 OOM 4.03 3.84 3.69 3.58 3.49 3.43 3.38
Method 2D heatmap-based tracker The proposed 1D heatmap-based tracker


300VW S1 4.53 4.27 OOM 3.61 3.47 3.37 3.29 3.20 3.12 3.06
300VW S2 4.60 4.34 OOM 3.90 3.71 3.54 3.43 3.33 3.24 3.17
300VW S3 5.97 5.72 OOM 4.75 4.52 4.39 4.32 4.24 4.18 4.12
Table 1: NRMSE (%) of the 2D-heatmap based detector, the proposed detector, the 2D-heatmap based tracker and the proposed tracker with different output resolutions ().The input face resolution (F) is fixed as 256. OOM is the abbreviation of “Out of Memory”.

5.2 Analysis on the Effect of Different Heatmap Resolutions

Table 1 shows the NRMSE performance of the proposed detector and tracker with different heatmap resolutions. Due to the space limitation of the paper, Table 1 only displays the results on the 300W and the 300VW dataset. To compare with 2D heatmaps, we also implement a 2D heatmap-based detector and tracker and display their performance in Table 1. The 2D heatmap-based detector is the same as FAN [2], which takes the output of the stacked hourglass network as 2D heatmaps. Based on the 2D heatmaps predicted by the detector, the temporal recurrent learning tracker proposed by Xi et al. [19] is adopted as the compared tracker.

From Table 1, we have the following observations. First, when the heatmap resolution increases to , the 2D heatmap-based methods crash by memory overload, which means the memory requirement is larger than the capacity (32 GiB) of the GPU. The huge output complexity of 2D heatmaps restricts to a low value, i.e., , which is lower than the face size and causes great quantization errors. Second, compared to the 2D heatmap regression methods, the proposed method can achieve a much larger output resolution under limited GPU memory because the low spatial complexity () of 1D heatmap does not take significant memory overhead. Third, as increases, NRMSE decreases. This is because with the increase of , the quantization error is reduced and more detailed spatial and temporal patterns are captured. As increases from to , the NRMSE of our method decreases by , , and on the 300W com, cha, and full, and by , and on the three scenarios of 300VW. The light-weight 1D heatmap allows us to fully boost its resolution despite limited space, and the quantization error is well alleviated. With a much higher output resolution, the proposed detector and tracker outperform the 2D heatmap-based detector and tracker significantly on accuracy.

Figure 3: NRMSE (%) performance on (a) 300W com, (b) 300W cha, (c) 300W full, (d) AFLW, (e) 300VW S1, (f) 300VW S2, (g) 300VW S3 and (h) TF with different .

5.3 Ablation Study for the Co-Attention Module

The proposed co-attention module shares distributional features between the and axes to implicitly capture joint distributions. According to Equation (10), the weight of the attentive feature is controlled by the parameter . When is , the co-attention module is discarded and the method only models marginal distributions separately on the two axes. As increases, the weight of the attentive feature in the fused representation and grows. We display NRMSE performance with different values in Fig. 3.

From Fig. 3, we find that there is a significant boost on detecting and tracking accuracy when increases from 0.0 to 0.4. Specifically, NRMSE decreases by , and on 300W com, cha and full, respectively. It also decreases by on the AFLW dataset; and by , and on the three scenarios of the 300VW dataset; and by on the TF dataset. That demonstrates the effectiveness of the co-attention module.

5.4 Ablation Study for the Temporal Refinement Mechanism of the Tracker

The tracker refines the heatmaps predicted by the detector by integrating temporal patterns from past frames, as shown in Equation (13). The weight of features from past frames is determined by the parameter . We make ablation study for the temporal refinement mechanism by comparing the results of two experimental settings. For the first setting, temporal refinement is discarded by assigning as . For the second setting, temporal refinement is kept by assigning as , the optimal value found by cross validation. Results of the two settings are shown in Table 2 and splitted by a slash. From Table 2, we find that the temporal refinement mechanism is beneficial to tracking accuracy. When is , NRMSE decreases by , , and on the three scenarios of the 300VW dataset and the TF datasets, respectively. This comparison demonstrates the effectiveness of the temporal refinement mechanism.

Dataset 300VW S1 300VW S2 300VW S3 TF
NRMSE 3.31/3.06 3.45/3.17 4.42/4.12 2.02/1.97
Table 2: NRMSE (%) of the proposed tracker without/with temporal refinement.

5.5 Comparison with State-of-the-art Methods

The proposed method is compared to other state-of-the-art landmark localization methods. From these approaches, coordinate regression methods include SDM [34], TSCN [25], IFA [1], CFSS [38], TCDCN [36], TSTN [14], DSRN [17], ODN [37], STA [29], Sun et al.’s work [26] and GAN [35]. Heatmap regression methods include HG [18], SAN [9], LAB [31], CNN-CRF [5], LaplaceKL [22], FHR [29], GHCU [15] and Chen et al.’s work [6]

. From these methods, CFSS, TCDCN, DSRN, ODN, HG, SAN, LaplaceKL, FHR, GHCU, Chen et al.’s work, LAB and CNN-CRF are detection methods. TSCN, TSTN, STA, Sun et al.’s work and GAN are tracking methods. SDM, IFA contain both a detection method and a tracking method. Some newly proposed semi-supervised learning or unsupervised learning methods

[11, 10, 20] for landmark detection are trained under different conditions with our method so are not included in our comparison.

Table 3 lists the NRSME performance of the proposed detector and the compared methods on the image datasets, i.e., the 300W public testing set and the AFLW dataset. Tables 4 and 5 respectively list the NRSME performance of the proposed tracker and the compared methods on the video datasets, i.e., the 300VW and the TF dataset. Performances of the compared methods are directly copied from literature, except for that of FHR in Table 3

because we could not find its published results on the respective datasets. We just re-implement it by its open source code

111 Other methods lacking published results or evaluated under different metrics or normalization standards are just left blank.

From Table 3, we find that our method achieves state-of-the-art detecting accuracy on the image datasets. We outperform FHR, which also tries to capture the fractional parts of coordinates and reduce quantization errors. FHR uses a Gaussian function to approximate the fractional parts of landmark coordinates from a heatmap. However, it still maintains a 2D heatmap structure with low resolution (128) as the output, which also suffers from the information loss caused by the quantization process. Our method significantly boosts the output resolution with a light-weight output structure, i.e., the 1D heatmap, which captures more detailed distributional information and significantly alleviates the quantization error. We also combine the proposed method with LAB [31], an approach that uses the heatmaps of facial boundary to help locating landmarks, by substituting the 2D heatmap-based boundary recognizer used in LAB with our method. As shown in Table 3, the result of Ours+LAB outperforms the original LAB. Such comparison further demonstrates the superiority of our method to methods based on 2D heatmaps.

Method SAN CNN-CRF DSRN LAB LaplaceKL ODN FHR [6] HG Ours Ours+LAB
300W com 3.34 3.33 4.12 2.98 3.19 3.56 3.04 - 3.30 2.91 2.75
300W cha 6.60 6.29 9.68 5.19 6.87 6.67 6.21 - 5.69 5.31 4.94
300W full 3.98 3.91 5.21 3.49 3.91 4.17 3.66 - 3.77 3.38 3.18
AFLW 1.91 - 1.86 1.25 1.97 1.63 1.58 1.39 1.95 1.32 1.20
Table 3: NRSME (%) of the proposed detector and the compared methods on the image datasets.
300VW S1 7.41 12.54 7.68 7.66 5.36 3.85 3.56 4.82 4.21 3.50 3.06
300VW S2 6.18 7.25 6.42 6.77 4.51 3.46 3.88 4.23 4.02 3.67 3.17
300VW S3 13.04 13.13 13.67 14.98 12.84 7.51 5.02 7.09 5.64 4.43 4.12
Table 4: NRSME (%) of the proposed tracker and the compared methods on the 300VW dataset
TF 4.01 3.52 2.36 3.45 2.13 2.07 2.10 2.03 1.97
Table 5: NRSME (%) of the proposed tracker and the compared methods on the TF dataset
Figure 4: NRMSE (%) performance on each facial area.
Figure 5: Average value () of coordinate variances within each facial area.

Tables 4 and 5 show that the proposed tracker achieves state-of-the-art tracking accuracy on the video datasets. Among the compared methods, GAN, a combination of coordinate regression and adversarial learning, achieves the best performance after our method. Our method outperforms the NRMSE of GAN by , and on the 300VW scenarios 1, 2, 3 and by on the TF dataset, respectively. For fine-grained analysis, we gather all testing samples in the three scenarios of 300VW and group their landmarks into five facial areas, i.e., eyes, contour, nose, eyebrows and mouth. We display the NRMSE performance on each area in Fig. 5. We also calculate the variances of and coordinates for each landmark and depict their average value () within each area in Fig. 5. From Fig. 5, our method decreases NRMSE on every facial area compared to GAN, especially in contour by . From Fig. 5, we find that the contour area has the highset coordinate variances on both the and axes. Although GAN uses adversarial learning for spatio-temporal modeling and promotes the robustness of landmark localization in the challenging facial area, it is still sub-optimal on handling coordinates with large variances due to the intrinsic weakness of the coordinate regression technique. We significantly alleviate this problem with the proposed heatmap regression methods which are more adept at capturing spatial and temporal distributions.

6 Conclusion

To address the issues caused by the huge spatial complexity of 2D heatmaps, we propose a new method that predicts 1D heatmaps as marginal distributions on each axis to detect facial landmarks. Instead of modeling joint distribution explicitly that occupies much memory, we design a co-attention mechanism to share features between the and axes and implicitly capture joint distributions. The light-weight 1D heatmap structure enables us to boost the output resolution to a large value despite limited GPU space, making more accurate predictions. Based on such an idea, we propose a novel landmark detector and a tracker. The detector captures spatial patterns in an image, while the tracker further captures temporal patterns in a video by a temporal refinement mechanism. Experiments on the 300W, AFLW, 300VW and TF datasets demonstrate that the proposed detector and tracker outperform state-of-the-art methods. Our method can be extended to other applications in need of key point detection, such as pose estimation.


  • [1] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic (2014) Incremental face alignment in the wild. In CVPR, pp. 1859–1866. Cited by: §5.5.
  • [2] A. Bulat and G. Tzimiropoulos (2017) How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In ICCV, pp. 1021–1030. Cited by: §1, §2.2, §4.1.1, §5.2.
  • [3] X. P. Burgos-Artizzu, P. Perona, and P. Dollár (2013) Robust face landmark estimation under occlusion. In ICCV, pp. 1513–1520. Cited by: §2.1.
  • [4] X. Cao, Y. Wei, F. Wen, and J. Sun (2014) Face alignment by explicit shape regression. IJCV 107 (2), pp. 177–190. Cited by: §2.1.
  • [5] L. Chen, H. Su, and Q. Ji (2019) Deep structured prediction for facial landmark detection. In NeurIPS, Cited by: §2.2, §5.5.
  • [6] Y. Chen, C. Shen, H. Chen, X. Wei, L. Liu, and J. Yang (2019) Adversarial learning of structure-aware fully convolutional networks for landmark localization.. TPAMI. Cited by: §2.2, §5.5, Table 3.
  • [7] G. G. Chrysos, E. Antonakos, P. Snape, A. Asthana, and S. Zafeiriou (2018) A comprehensive performance evaluation of deformable face tracking “in-the-wild”. IJCV 126 (2-4), pp. 198–232. Cited by: §1.
  • [8] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang (2017) Multi-context attention for human pose estimation. In CVPR, pp. 5669–5678. Cited by: §2.2.
  • [9] X. Dong, Y. Yan, W. Ouyang, and Y. Yang (2018) Style aggregated network for facial landmark detection. In CVPR, pp. 379–388. Cited by: §5.5.
  • [10] X. Dong and Y. Yang (2019) Teacher supervises students how to learn from partially labeled images for facial landmark detection. In ICCV, Cited by: §5.1, §5.1, §5.5.
  • [11] X. Dong, S. Yu, X. Weng, S. Wei, Y. Yang, and Y. Sheikh (2018) Supervision-by-registration: an unsupervised approach to improve the precision of facial landmark detectors. In CVPR, pp. 360–368. Cited by: §5.5.
  • [12] FGNET (2014) Talking face video. Note: Cited by: §5.1.
  • [13] Y. Li, S. Wang, Y. Zhao, and Q. Ji (2013) Simultaneous facial feature tracking and facial expression recognition. IEEE Transactions on Image Processing 22 (7), pp. 2559–2573. Cited by: §2.1.
  • [14] H. Liu, J. Lu, J. Feng, and J. Zhou (2018)

    Two-stream transformer networks for video-based face alignment

    TPAMI 40 (11), pp. 2546–2554. Cited by: §2.1, §2.1, §5.1, §5.5.
  • [15] Z. Liu, X. Zhu, G. Hu, H. Guo, M. Tang, Z. Lei, N. M. Robertson, and J. Wang (2019-06) Semantic alignment: finding semantically consistent ground-truth for facial landmark detection. In CVPR, Cited by: §2.2, §5.5.
  • [16] J. Lu, J. Yang, D. Batra, and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In NIPS, pp. 289–297. Cited by: §4.1.2.
  • [17] X. Miao, X. Zhen, X. Liu, C. Deng, V. Athitsos, and H. Huang (2018) Direct shape regression networks for end-to-end face alignment. In CVPR, pp. 5040–5049. Cited by: §5.5.
  • [18] A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In ECCV, pp. 483–499. Cited by: §1, §2.2, §5.5.
  • [19] X. Peng, R. S. Feris, X. Wang, and D. N. Metaxas (2016) A recurrent encoder-decoder network for sequential face alignment. In ECCV, pp. 38–56. Cited by: §2.2, §5.2.
  • [20] S. Qian, K. Sun, W. Wu, C. Qian, and J. Jia (2019) Aggregation via separation: boosting facial landmark detector with semi-supervised style translation. In ICCV, Cited by: §5.5.
  • [21] S. Ren, X. Cao, Y. Wei, and J. Sun (2014) Face alignment at 3000 FPS via regressing local binary features. In CVPR, pp. 1685–1692. Cited by: §2.1.
  • [22] J. P. Robinson, Y. Li, N. Zhang, Y. Fu, and S. Tulyakov (2019) Laplace landmark localization. In ICCV, Cited by: §5.5.
  • [23] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic (2016) 300 faces in-the-wild challenge: database and results. Image and vision computing 47, pp. 3–18. Cited by: §5.1, §5.1.
  • [24] J. Shen, S. Zafeiriou, G. G. Chrysos, J. Kossaifi, G. Tzimiropoulos, and M. Pantic (2015) The first facial landmark tracking in-the-wild challenge: benchmark and results. In ICCV Workshops, pp. 50–58. Cited by: §5.1.
  • [25] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In NIPS, pp. 568–576. Cited by: §5.5.
  • [26] K. Sun, W. Wu, T. Liu, S. Yang, Q. Wang, Q. Zhou, Z. Ye, and C. Qian (2019) FAB: a robust facial landmark detection framework for motion-blurred videos. In ICCV, Cited by: §5.5, Table 4.
  • [27] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei (2018) Integral human pose regression. In ECCV, pp. 536–553. Cited by: §1, §1, §2.2, §2.2.
  • [28] Y. Sun, X. Wang, and X. Tang (2013) Deep convolutional network cascade for facial point detection. In CVPR, pp. 3476–3483. Cited by: §2.1, §2.1.
  • [29] Y. Tai, Y. Liang, X. Liu, L. Duan, J. Li, C. Wang, F. Huang, and Y. Chen (2019) Towards highly accurate and stable face alignment for high-resolution videos. In AAAI, pp. 8893–8900. Cited by: §1, §2.2, §2.2, §5.5.
  • [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §4.1.2.
  • [31] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou (2018) Look at boundary: A boundary-aware face alignment algorithm. In CVPR, pp. 2129–2138. Cited by: §2.2, §5.5, §5.5.
  • [32] Y. Wu and Q. Ji (2019) Facial landmark detection: A literature survey. IJCV 127 (2), pp. 115–142. Cited by: §1.
  • [33] Y. Wu, Z. Wang, and Q. Ji (2014) A hierarchical probabilistic model for facial feature detection. In CVPR, pp. 1781–1788. Cited by: §2.1, §2.1.
  • [34] X. Xiong and F. De la Torre (2013) Supervised descent method and its applications to face alignment. In CVPR, pp. 532–539. Cited by: §2.1, §5.5.
  • [35] S. Yin, S. Wang, G. Peng, X. Chen, and B. Pan (2019) Capturing spatial and temporal patterns for facial landmark tracking through adversarial learning. In IJCAI, pp. 1010–1017. Cited by: §2.1, §5.1, §5.1, §5.5.
  • [36] Z. Zhang, P. Luo, C. C. Loy, and X. Tang (2016) Learning deep representation for face alignment with auxiliary attributes. TPAMI 38 (5), pp. 918–930. Cited by: §2.1, §2.1, §5.5.
  • [37] M. Zhu, D. Shi, M. Zheng, and M. Sadiq (2019-06) Robust facial landmark detection via occlusion-adaptive deep networks. In CVPR, Cited by: §5.5.
  • [38] S. Zhu, C. Li, C. Change Loy, and X. Tang (2015) Face alignment by coarse-to-fine shape searching. In CVPR, pp. 4998–5006. Cited by: §5.5.