## I Introduction

Facial landmark detection (FLD), also known as face alignment, refers to locating the predefined landmarks (eye corners, nose tip, mouth corners, etc.) of a face. As a topical issue in computer vision, FLD has drawn much attention recently as it can provide rich geometric information for other face analysis tasks such as face recognition

[huang2017learning, cao2019fr], face verification [Sun2014DeepLF], face frontalization [Hassner2015EffectiveFF] and 3D face reconstruction [Garrido2016ReconstructionOP].In recent years, heatmap regression-based methods [Yang2017StackedHN, Dong2018StyleAN, Liu2019SemanticAF] have become one of the mainstream approaches to solve FLD problems. Due to the effective encoding of part constraints and context information, heatmap regression-based FLD methods have achieved considerable performance on faces under variations on poses, expressions, occlusions and illuminations. However, as shown in Fig. 1

, for faces with severe occlusions or extremely large poses, the accuracy of heatmap regression-based FLD methods is greatly reduced as 1) this kind of method reduces or even lose shape/geometric constraints between landmarks due to predicting landmarks in isolation. 2) occlusions or large poses may mislead neural networks on feature representation learning and shape/geometric constraints learning. Recent works

[Gao2018GlobalSP, Dai2019SecondOrderAN]have shown the second-order statistics (information) in deep convolutional neural networks can effectively model the spatial and channel correlations between features and mine the useful information contained in the features themselves. They can fully explore more representative information and are more discriminative than first-order statistics. However, how to introduce the high-order information for enhancing shape/geometric constraints (generating more effective landmark and boundary heatmaps) and improving the performance of facial landmark detection are still open questions.

In this paper, we propose a Multi-order Multi-constraint Deep Network (MMDN) for both implicit and explicit shape constraints learning (see Fig. 2). Specifically, we propose an Implicit Multi-order Correlating Geometry-aware (IMCG) model to enhance shape constraints by introducing the multi-order spatial correlations and multi-order channel correlations. The multi-order spatial correlations contain both the auto-correlation of intra-layer features and the cross-correlation of inter-layer features, which can be used to capture hierarchical patterns and attain global receptive fields. The multi-order channel correlations can be utilized to selectively emphasize informative features and suppress less useful ones by considering both the first-order and second-order statistics of features, thus performing feature recalibration and improving the quality of representations. Moreover, an Explicit Probability-based Boundary-adaptive Regression (EPBR) method is developed to handle facial landmark detection by integrating boundary constraints and specific semantically consistent constraints into the final predicted landmarks, enhancing the robustness to faces with large poses and heavy occlusions. Therefore, our method obtains better robustness and accuracy compared with other state-of-the-art FLD methods.

The main contributions of this work are summarized as follows:

1) With the well-designed IMCG model, multi-order spatial and channel correlations can be introduced to implicitly enhance geometric constraints and context information for more effective feature representations when dealing with complicated cases, especially for faces with extremely large poses and heavy occlusions.

2) An EPBR method is presented to explore how to incorporate the boundary heatmap regression with landmark heatmap regression to generate boundary-adaptive landmark heatmaps through the proposed searching mechanism, and further explicitly add both boundary and semantically consistent constraints to the final predicted landmarks. Hence, the EPBR method is able to enhance the robustness when the dataset is corrupted by strong noise.

3) A novel algorithm called MMDN is developed to seamlessly integrate the IMCG model and EPBR method into a Multi-order Multi-constraint Deep Network to handle facial landmark detection in the wild. With more effective representations and regression, our algorithm outperforms state-of-the-art methods on challenging benchmark datasets such as COFW [Burgosartizzu2013Robust], 300W [Sagonas2016300FI], AFLW [Zhu2016UnconstrainedFA] and WFLW [Wu2018LookAB].

The rest of the paper is organized as follows. Section II gives an overview of the related work. Section III shows the proposed method, including the IMCG model and the EPBR method. A series of experiments are conducted to evaluate the performance of the proposed method in Section IV. Finally, Section V concludes the paper.

## Ii Related Work

During the past decades, plenty of facial landmark detection (FLD) methods have been proposed in the computer vision area. In general, existing methods can be categorized into three groups: model-based methods, coordinate regression-based methods, and heatmap regression-based methods.

Model-based methods. The model-based FLD methods represented by active shape model [cootes1995active], active appearance model [cootes2001active] and constrained local model [Cristinacce2006FeatureDA]

depend on the goodness of the error function. They use parametric models (facial shape model

[cootes1995active], facial appearance model [cootes2001active] or part model [Cristinacce2006FeatureDA]) to constrain the shape variations and improve the performance of FLD. Active shape model [cootes2001active]uses Principal Component Analysis to establish a facial shape model and an appearance model, then these two models are combined to enhance the shape constraints and texture information for improving the performance of facial landmark detection. Constrained local model

[Cristinacce2006FeatureDA] firstly constructs a shape model and a patch model, then it can detect more accurate landmarks by searching and matching them on the neighborhood regions around each landmark in the initial shape. By jointly optimizing a global shape model and a part-based appearance model with an efficient and robust Gaussian Newton optimization, the performance of Gauss-Newton deformable part models [Tzimiropoulos2014GaussNewtonDP] can be improved and the computation costs have been reduced. However, these methods are sensitive to faces under large poses and partial occlusions.Coordinate regression-based methods.

This category of FLD method directly learns the mapping from facial appearance features to the landmark coordinate vectors by using different kinds of regressors such as fern

[cao2014face][ren2014face], convolution neural networks [Merget2018RobustFL, shi2018deep][Trigeorgis2016MnemonicDM] or hourglass network [Yang2017StackedHN]. In explicit shape regression [cao2014face] and local binary features [ren2014face], fern and random forest are used separately to extract features and regress. With these two excellent regressors, they both achieve good results. In tasks-constrained deep convolutional network [zhang2014facial], the convolution neural network is used for feature extraction, and the author proposes to utilize auxiliary attributes in the fitting process and applies the multi-task learning framework to improve the performance of face alignment. In mnemonic descent method

[Trigeorgis2016MnemonicDM], a convolutional recurrent neural network is used to extract the task-based features and enhance the generalization ability of the algorithm against faces with large poses and partial occlusions. Merget et al. [Merget2018RobustFL] directly introduce global context into a fully convolutional neural network, and then the kernel convolution and the dilated convolutions are utilized to smooth the gradients and reduce over-fitting, thus enhancing its robustness. Wu et al. [Wu2018LookAB] firstly introduce an adversarial network and message passing layers to generate more robust boundary heatmaps. Then the generated boundary heatmaps will be used as part of the input to enhance shape constraints and further improve the accuracy of FLD. In occlusion-adaptive deep networks [Zhu2019RobustFL], the geometry-aware module, the distillation module and the low-rank learning module are combined to overcome the occlusion problem for robust FLD. With these discriminative features and effective constraints, these methods are more robust to faces with variations on poses, expressions, illuminations and partial occlusions.Heatmap regression-based methods. Heatmap regression-based FLD methods [Yang2017StackedHN, Dong2018StyleAN, Liu2019SemanticAF] directly predict the landmark coordinates by traversing the generated landmark heatmaps. This kind of method can better encode the part constraints and context information, and effectively drive the network to focus on the important part in facial landmark detection, thus achieving state-of-the-art accuracy. Yang et al. [Yang2017StackedHN]

firstly adopt a supervised face transformation to reduce the variance of the regression target and then use a stacked hourglass network to increase the capacity of the regression model. By paying more attention to the feature with high confidence in an explicit manner, the robustness of this method is enhanced. Dong et al. propose a style aggregated network

[Dong2018StyleAN] that can generate different styles of images from one image. The generated images and the original ones are used together to train a more robust model that can explicitly handle problems caused by the variation of image styles in FLD. Liu et al. [Liu2019SemanticAF] propose a novel latent variable optimization strategy to find the semantically consistent annotations and alleviate random noises during the training stage, the ground-truth shape can be updated and the predicted shape becomes more accurate.Until now, most heatmap regression-based FLD methods do not fully utilize the inter-feature dependencies and inter-channel dependencies, and they ignore the high-order information higher than the first-order information, thus they cannot effectively explore the discriminative power of features. Recent works [Gao2018GlobalSP, Dai2019SecondOrderAN] have shown the second-order statistics in deep convolutional neural networks are more discriminative than the first-order ones. Moreover, heatmap regression-based FLD methods predict landmarks in isolation, which causes some predicted landmarks to deviate from the facial boundary, and the shape constraints among landmarks may be lost during the training. Inspired by the above observations, we proposed a Multi-order Multi-constraint Deep Network by incorporating the IMCG model with the EPBR method for robust FLD.

## Iii Robust Facial Landmark Detection by Multi-order Multi-constraint Deep Networks

In this section, we first elaborate on the proposed Implicit Multi-order Correlating Geometry-aware (IMCG) model in Section III. A, and then present the Explicit Probability-based Boundary-adaptive Regression (EPBR) method in Section III. B. Section III. C illustrates the proposed Multi-order Multi-constraint Deep Network. Finally, Section III. D optimizes the proposed MMDN.

### Iii-a Implicit Multi-correlating Geometry-aware (IMCG) model

Heatmap regression-based methods [Yang2017StackedHN, Dong2018StyleAN, Liu2019SemanticAF] have achieved state-of-the-art accuracy as they can effectively encode the part constraints and context information. However, they suffer from performance degradation under faces with extremely large poses and heavy occlusions, as in these cases the extracted features are not robust enough and the facial geometric constraints (e.g., part constraints and global constraints) among landmarks may be lost. More recently, the high-order information [Gao2018GlobalSP, Wang2019DeepGG] has been shown to be beneficial to obtain more robust and discriminative representations for many computer vision tasks. However, how to effectively introduce and utilize the high-order information to enhance shape constraints for robust FLD is still challenging. Hence, in this paper, a Multi-order Correlating Geometry-aware (IMCG) model is proposed to explore more discriminative representations and more effective geometric constraints by introducing the high-order information, i.e., the multi-order spatial correlations and the multi-order channel correlations. The proposed IMCG model contains multiple multi-order spatial correlations modules (see Fig. 3) and one multi-order channel correlations module (see Fig. 4), which can be integrated into a stacked hourglass network to generate more effective and accurate heatmaps for robust FLD.

#### Iii-A1 Multi-order Spatial Correlations Module

The stacked hourglass network [Yang2017StackedHN, Wu2018LookAB] is able to extract multi-scale discriminative features for its bottom-up and top-down structure and can function as a regressor to predict the final landmarks. The proposed multi-order spatial correlations module follows an hourglass network unit (see Fig. 3). We use and to separately represent the input and output of the hourglass network unit, and . , where , and represent the width, height and channels of the feature maps, respectively. and represent the first-order spatial correlations that can be introduced by max or average pooling of features in neural networks. denotes the -th row of the matrix and represents the feature corresponding to the -th pixel in the feature maps. can be regarded as the pairwise interaction of two features in the feature maps. Therefore, we use the outer product of and to represent the auto-correlation of intra-layer features for capturing geometric correlations of facial landmark regions. is used to denote the cross-correlation of inter-layer features for capturing landmark geometric correlations at different scales (deep layer and shallow layer). As the outer product operation is similar to a quadratic kernel expansion and is indeed a non-local operation, it can be used to effectively model local pairwise feature correlations for capturing long-range dependencies [Zhu2019RobustFL, lin2017bilinear]. Therefore, , and can effectively preserve long-range geometric constraints and we compute the sum of them to construct the second-order spatial correlations which can be expressed as follows:

(1) |

where represents the second-order spatial correlations. The second-order spatial correlations can capture the auto-correlation of intra-layer features and the cross-correlation of inter-layer features, which helps effectively mine the information inherent in different convolutional layers and explore more discriminative representations for enhancing geometrical constraints.

To better explore the spatial correlations, we multiply and to obtain the third-order spatial correlations . The third-order spatial correlations can be regarded as the high-order cross-layer attention feature maps, i.e., transforming the pairwise feature correlations to the high-order cross-layer attention that can be acted on the original output feature maps . Based on this, we derive the multi-order spatial correlation which contains the first-order, second-order and third-order spatial correlations and can be expressed as follows:

(2) |

(3) |

where denotes the multi-order spatial correlations. is initialized as 0 and a larger weight is gradually learned to assign to during training. Hence, by fusing the first-order, second-order and third-order spatial correlations, the multi-order spatial correlations module can capture effective hierarchical patterns, which would be further used to enhance geometric constraints. Then, the multi-order spatial correlations module is integrated into stacked hourglass networks, so that the multi-order spatial correlations can be updated and propagated to obtain more effective geometric constraints for robust FLD.

#### Iii-A2 Multi-order Channel Correlations Module

With multiple multi-order spatial correlations modules, we can obtain effective spatial correlations and robust representations for enhancing the geometric constraints. Then, to fully explore the discriminative abilities of features, we further propose a multi-order channel correlations module to adaptively re-calibrate channel-wise feature maps by explicitly modeling inter-channel dependencies. As shown in Fig .4, the multi-order channel correlations module mainly contains the first-order channel correlations block and the second-order channel correlations block. By introducing the global average pooling and global covariance pooling [Li2017TowardsFT] operations, the first-order channel correlations and the second-order channel correlations can be introduced to effectively model the inter-channel dependencies. Then, with the further fusion of the first-order and second-order channel correlations, the multi-order channel correlations module can selectively emphasize informative features and suppress less useful ones, i.e., perform more effective channel re-calibration.

Global covariance pooling. Compared to average pooling, covariance pooling [Li2017TowardsFT] can better explore the distributions of features and realize the full potential of CNN. Moreover, covariance pooling can introduce the second-order statistics which are more discriminative and representative than the first-order statistics. Therefore, we begin by briefly reviewing the global covariance pooling operation. Let

denote the output (with ReLU) of the IMCG model, where

and . , and are used to denote the width, height and channels of the feature maps, respectively. Then the sample covariance matrix can be computed as:(4) |

(5) |

where and are the identity matrix and matrix of all ones, respectively.

is a symmetric positive semi-definite matrix, which has a unique square root and can be computed accurately by Eigenvalue Decomposition or Singular Value Decomposition. The whole process is illustrated as follows:

(6) |

where

is an orthogonal matrix and

. denotes the diagonal matrix of eigenvalues of . Then the square root of can be computed as follows:(7) |

where . To obtain

, we need to perform Eigenvalue Decomposition or Singular Value Decomposition on

. However, both Eigenvalue Decomposition or Singular Value Decomposition are not well supported on GPU, which leads to inefficient training. Therefore, according to [Li2017TowardsFT], Newton-Schulz Iteration [Higham2008FunctionsOM] is applied to achieve a fast computing of the square root . Specifically, given and , for , the Newton-Schulz iteration is then updated alternately as follows:(8) |

(9) |

where is the number of iterations. After several iterations, and can converge to and , respectively. However, the Newton-Schulz iteration only converges locally. According to [Li2017TowardsFT], we first pre-normalize by trace or Frobenius norm, and then perform the post-compensation procedure to compensate the data magnitude caused by pre-normalization, thus producing the final normalized covariance matrix

(10) |

where denotes the trace of . With such global covariance pooling, our proposed multi-order channel correlations module is ready to utilize the second-order statistics to enhance the representative ability of features.

The proposed multi-order channel correlations module contains a first-order channel correlations block and a second-order channel correlations block (see Fig. 4). The normalized covariance matrix characterizes the second-order channel correlations, which can be used as a channel descriptor by the multi-order channel correlations module to perform channel re-calibration. At the same time, the first-order channel correlations can be introduced by the first-order channel correlations block. Then, by integrating the first-order channel correlations block and the second-order channel correlations block into a multi-order channel correlations module, the multi-order channel correlations can be utilized to perform effective channel re-calibration and enhance the representative capability of features.

First-order channel correlations block. Let , the first-order channel correlations can be obtained by performing global average pooling to (see Fig. 4)). Then the -th dimension of can be computed as:

(11) |

where denotes the global average pooling, and the first-order channel correlations can be introduced by the operation. After that, a simple gating mechanism [Hu2017SqueezeandExcitationN] with a sigmoid activation is employed to learn a non-mutually-exclusive channel relationship, which ensures that multiple channels are allowed to be emphasized (rather than enforcing a one-hot activation). The whole process can be stated as:

(12) |

where denotes the sigmoid activation and denotes the scaling factor. and are the parameters of convolution layers, whose channel dimensions are set to and , respectively. The output of the first-order channel correlations block can be stated as follows:

(13) |

where and denote the scaling factor and the feature map in the -th channel. With the first-order channel correlations block, the first-order channel correlations are able to perform channel recalibration.

Second-order channel correlations block. Let , can be used to represent the second-order channel correlations among the -th channel and all channels. The -th dimension of the second-order channel correlations is computed as:

(14) |

To fully exploit feature inter-dependencies from the aggregated information by global covariance pooling, we apply a gating mechanism. As explored in [Hu2017SqueezeandExcitationN]

, the simple sigmoid function can serve as a proper gating function, which can be illustrated as:

(15) |

where denotes the scaling factor. and are the parameters of convolution layers, whose channel dimensions are set to and , respectively. The output of the second-order channel correlations block can be represented as follows:

(16) |

where and denote the scaling factor and the feature map in the -th channel. After that, we combine the first-order channel correlations block and second-order channel correlations block to obtain the multi-order channel correlations, which can be illustrated as follows:

(17) |

By integrating the multi-order spatial correlations module and multi-order channel correlations module via the proposed IMCG model, the multi-order spatial correlations and the multi-order channel correlations can be utilized to enhance the geometrical constraints and context information for generating more robust and accurate heatmaps.

### Iii-B Explicit Probability-based Boundary-adaptive Regression method

The facial boundary information [Wu2018LookAB] has been shown to be beneficial for facial landmark detection as almost all the landmarks are defined lying on the facial boundaries. By utilizing the facial boundary heatmaps, Wu et al. [Wu2018LookAB] can effectively utilize the boundary information to enhance the shape constraints between landmarks and improve the accuracy of FLD. However, when faces are suffering from extremely large poses or heavy occlusions, it is hard to generate accurate boundary heatmaps and add effective shape constraints to the predicted landmarks, resulting in performance degradation. Therefore, in this paper, we propose an Explicit Probability-based Boundary-adaptive Regression (EPBR) method to explore how to use the boundary information for robust facial landmark detection. Specifically, by integrating the boundary heatmap regression and landmark heatmap regression via the proposed searching mechanism, the EPBR method is able to generate more accurate and effective boundary-adaptive landmark heatmaps and then explicitly add both boundary and semantically consistent constraints to the final predicted landmarks to improve the accuracy of FLD.

In the probabilistic view, training a CNN-based landmark detector can be formulated as a likelihood maximization problem:

(18) |

where are the coordinates of ground-truth landmarks and is the number of landmarks, denotes the input image and

represents the parameters of neural networks. For heatmap regression-based FLD methods, one pixel value on the heatmap works as the confidence of one particular landmark at that pixel. Hence, the whole heatmap represents the probability distribution over the image. In order to introduce the boundary information during learning, Eq. (18) can be reformulated as:

(19) |

where denotes the boundary heatmap and represents the landmark heatmap. We aim to generate more accurate and effective boundary-adaptive landmark heatmaps by designing and optimizing .

#### Iii-B1 Landmark Heatmap

The ground-truth landmark heatmap is generated by a two-dimensional Gaussian distribution which can be illustrated as:

(20) |

where and denotes the ground-truth face shape, represents the coordinates of a landmark in . denotes the ground-truth landmark heatmap, while means a pixel in . For any position , the more the heatmap region around follows a standard Gaussian, the more the position is likely to be . Therefore, we can use the distribution distance between the predicted and the ground-truth landmark heatmap to model , that is, the Jensen-Shannon divergence loss (JSDL) is used. Compared to the mean squared error, 1) JSDL can pay more attention to the foreground area of heatmaps rather than treating the whole heatmap equally, 2) JSDL can accurately measure the difference between two distributions. Therefore, JSDL can be utilized to generate more effective heatmaps and better fit deep neural networks, which can be stated as:

(21) |

where and denote the distributions of the generated and the ground-truth landmark heatmaps, respectively.

means the Kullback-Leibler Divergence loss. After that, the joint probability

can be modeled as follows:(22) |

where denotes the landmark index. By optimizing Eq. (22), we can generate more accurate and effective landmark heatmaps.

#### Iii-B2 Boundary Heatmap

Following LAB [Wu2018LookAB]

, there are 13 boundary heatmaps which contain the corresponding landmarks. For each boundary, landmarks on this boundary are firstly interpolated to get a dense boundary line and then transformed into a ground-truth boundary heatmap by using a Gaussian distribution with a standard deviation

. Specifically, we firstly use a binary map and a distance transform function to get distance map , and then transform the distance map into a ground-truth boundary heatmap (see Fig. 5). With such facial boundary heatmaps, the geometric structure of the human face can be identified easily, and more accurate landmarks will be detected. The whole process of generating the ground-truth boundary heatmap with the distance map can be illustrated as follows:(23) |

where is a small constant, means the boundary index and . We also use the JSDL to measure the distribution difference which can be stated as:

(24) |

where and denote the distributions of the generated and the ground-truth boundary heatmaps, respectively. After that, the joint probability can also be modeled as a product of boundary heatmap similarity maximized over all boundaries:

(25) |

#### Iii-B3 Boundary-adaptive Landmark Heatmap

After generating landmark heatmaps and boundary heatmaps, we can predict landmark by traversing the generated landmark heatmaps and the accuracy of predicted landmark is improved by using the boundary information. However, when faces are suffering from large poses and partial occlusions, the predicted landmark may deviate from the corresponding facial boundary (i.e., not on the corresponding facial boundary) or landmark (on the corresponding facial boundary, but far away from the ground-truth landmark ). As the boundary heatmaps are more robust to partial occlusions and large poses for their continuous characteristics, we firstly construct boundary-adaptive landmark heatmaps and then search the optimal landmark around the predicted landmark in . can be formulated as follows:

(26) |

where means the boundary-adaptive landmark heatmap of the -th landmark, represents the corresponding boundary heatmap and means element-wise multiplication. By the operation, the boundary information is ready to improve the accuracy of the predict landmarks.

#### Iii-B4 Searching Mechanism

For the ground-truth boundary-adaptive landmark heatmaps, , and should represent the same landmark. However, when faced with large poses and heavy occlusions, the predicted landmark usually deviates from the ground-truth landmark . Hence, a searching mechanism is proposed to firstly search for the optimal landmark around in , and then the error between and is used to fit the neural network so that semantically consistent boundary-adaptive landmark heatmaps can be generated. Therefore, the proposed searching mechanism should make the optimal meet the following two criteria. First, should have larger confidence in the generated boundary-adaptive landmark heatmap, which corresponds to . Second, should be close to (i.e., , and should represent the same landmark), which corresponds to . Hence, can be defined as the Gaussian similarity over all pairs:

(27) |

With the proposed searching mechanism, we add a new and clear constraint to the network so that it can adaptively adjust parameters to generate more accurate and consistent landmark and boundary heatmaps. Then, by incorporating the JSDL with the searching mechanism, both boundary and semantically consistent constraints can be explicitly added to the predicted landmarks to improve the accuracy of FLD. Therefore, we name it Explicit Probability-based Boundary-adaptive Regression (EPBR) method.

Finally, by integrating , and in one framework and take log function, the objective function of the EPBR method is formulated as follows:

(28) |

where denotes the weight corresponding to . By incorporating the JSDL with the searching mechanism, the EPBR method can be used to better fit the neural networks and generate more effective boundary-adaptive landmark heatmaps for more robust FLD.

Layer | Shape_in | Shape_out | Kernel/Stride |

input | - | - | |

conv/batch_norm | |||

residual block | , , | ||

avg_pooling | |||

residual block | , , | ||

conv | |||

multi-order spatial correlations module | |||

branch1 | , , | ||

max_pooling1 | |||

residual block | , , | ||

branch2 | , , | ||

max_pooling2 | |||

residual block | , , | ||

branch3 | , , | ||

max_pooling3 | |||

residual block | , , | ||

branch4 | , , | ||

max_pooling4 | |||

residual block | , , | ||

upsampling4 | - | ||

add_branch4 | - | ||

residual block | , , | ||

upsampling3 | - | ||

add_branch3 | - | ||

residual block | , , | ||

upsampling2 | - | ||

add_branch2 | - | ||

residual block | , , | ||

upsampling1 | - | ||

add_branch1 | - | ||

residual block (Q) | , , | ||

- | - | ||

multi-order channel correlations module | |||

- | - | ||

GAP | - | ||

conv1_1 | /1 | ||

conv1_2 () | /1 | ||

GCP | - | ||

conv2_1 | /1 | ||

conv2_2 () | /1 | ||

deconv | /2 | ||

out | /1 |

### Iii-C Multi-order Multi-constraint Deep Networks

The proposed Multi-order Correlating Geometry-aware (IMCG) model can obtain more discriminative representations by introducing the high-order information (i.e., the multi-order spatial correlations and the multi-order channel correlations) to implicitly enhance the geometric constraints and context information. Then, the Explicit Probability-based Boundary-adaptive Regression (EPBR) method is proposed to incorporate the boundary heatmap regression with landmark heatmap regression and further explicitly add both boundary constraints and semantically consistent constraints to the final predicted landmarks. Finally, by integrating the IMCG model and the EPBR method into a Multi-order Multi-constraint Deep Network (MMDN) via a seamless formulation, we can generate more accurate landmark heatmaps and boundary heatmaps that can be further used to improve the accuracy of FLD under large poses and heavy occlusions. The detailed network structure of the proposed MMDN is shown in Table I.

### Iii-D Optimization

Combining Eq. (18), (19), (22), (25), (27) and (28), the objective function of MMDN can be reformulated as:

(29) |

where denotes the parameters of the proposed MMDN. By inputting a face image, MMDN can output the corresponding landmark heatmaps and boundary heatmaps. Then, we construct and search the boundary-adaptive landmark heatmaps to obtain the optimal landmarks.

Since the proposed MMDN contains multiple variables, we design an iterative algorithm to alternately optimize them. In each iteration, firstly, the parameter of MMDN is fixed, the optimal landmarks can be predicted according to the boundary-adaptive landmark heatmaps. Then the predicted optimal landmarks are fixed and is updated under the supervision of . The whole optimization process can be iterated alternately in the following two steps.

Step 1: Given , we have the following sub-problem:

(30) |

where all the variables are known except the . We search by going through all the pixels around , and the one with the maximum value in Eq. (30) is the solution.

Step 2: When the optimal landmarks are fixed, we have the following sub-problem:

(31) |

The optimization is a typical network training process under the supervision of Eq. (31). With the proposed IMCG model, we can obtain more representative and discriminative features. Then, by integrating the IMCG model and the EPBR method into the MMDN, the optimal landmarks can be easily predicted, which helps achieve more robust FLD.

## Iv Experiments

In this section, we firstly introduce the evaluation settings including the datasets and methods for comparison. Then, we compare our algorithm with the state-of-the-art FLD methods on challenging benchmark datasets such as 300W [Sagonas2016300FI], COFW [Burgosartizzu2013Robust], AFLW [Zhu2016UnconstrainedFA] and WFLW [Wu2018LookAB].

### Iv-a Dataset and Implementation details

300W (68 landmarks): It consists of some present datasets including LFPW [Zhu2012FaceDP], AFW [Belhumeur2011LocalizingPO], Helen [le2012interactive], and IBUG [Sagonas2016300FI]. With a total of 3148 pictures, the training sets are made up of the training sets of AFW, LFPW and Helen while the testing set includes 689 images with such testing sets as IBUG, LFPW and Helen. The testing set is further divided into three subsets: 1) Challenging subset (i.e., IBUG dataset). It contains 135 images that are collected from a more general “in the wild” scenarios, and experiments on IBUG dataset are more challenging. 2) Common subset (554 images, including 224 images from LFPW test set and 330 images from Helen test set). 3) Fullset (689 images, containing the challenging subset and common subset).

COFW (29 landmarks): It is another very challenging dataset on occlusion issues which is published by Burgos-Artizzu et al. [Burgosartizzu2013Robust]. It contains 1345 training images and the testing set of COFW contains 507 face images under heavy occlusions, large pose and expression variations.

AFLW (19 landmarks): It contains 25993 face images with jaw angles ranging from to and pitch angles ranging from to . Meantime, these pictures are very different in pose and occlusion. AFLW-full divides the 24386 images into two parts: 20000 for training and 4386 for testing. In the meantime, AFLW-frontal selects 1165 images out of the 4386 testing images to evaluate the alignment algorithm on frontal faces.

WFLW (98 landmarks): It contains 10000 faces (7500 for training and 2500 for testing) with 98 landmarks. Apart from landmark annotation, WFLW also has rich attribute annotations (such as occlusion, pose, make-up, illumination, blur and expression.) that can be used for a comprehensive analysis of existing algorithms.

Evaluation metric Normalized Mean Error (NME) [Belhumeur2011LocalizingPO, cao2014face] is commonly used to evaluate FLD methods. First, the error between the predicted positions and the ground-truth is calculated, then it is further normalized to avoid larger errors in higher resolution images. For the 300W [Sagonas2016300FI] and the COFW [Burgosartizzu2013Robust], NME normalized by the inter-pupil distance is used. For the AFLW [Zhu2016UnconstrainedFA], we use NME normalized by the face size. For the WFLW [Wu2018LookAB], NME normalized by inter-ocular distance is adopted.

Implementation Details In our experiments, all the training and testing images are cropped and resized to according to the provided bounding boxes. To perform data augmentation, we randomly sample the angle of rotation () and the bounding box scale () from a standard Gaussian distribution. We use the Hourglass Network [Yang2017StackedHN] as our backbone to construct the proposed Multi-order Multi-constraint Deep Network, and the output heatmap size is . The training of MMDN takes 100000 iterations and the staircase function is used to set the learning rate. The initial learning rate is and then it is divided by 5, 2 and 2 at iteration 5000, 20000 and 50000, respectively. and are set to 3, and are set to 4 and 10 respectively. During the searching of the optimal landmark , we predict from a region centered on the landmark

in the generated boundary-adaptive landmark heatmap. The MMDN is trained with Pytorch on 8 Nvidia Tesla V100 GPUs.

Experiment Settings To better evaluate the effectiveness of each module proposed in this paper, we firstly use the Stacked Hourglass Network (SHN) [Yang2017StackedHN] as the baseline, and then we separately combine SHN with the proposed multi-order spatial correlations (MSC) module, multi-order channel correlations (MCC) module, the IMCG model and the EPBR method. We also combine the SHN with the first-order channel correlations (FCC) block to test the effectiveness of channel re-calibration. The detailed experimental results are shown below.

### Iv-B Evaluation under Normal Circumstances

Faces in the 300W common subset and 300W full set have fewer variations on the head pose, facial expression and occlusion. Therefore, we evaluate the effectiveness of our method under normal circumstances on these two subsets. Table II shows the experimental results in comparison to the existing benchmarks. From Table II, we can see that our method achieves 3.17% NME on the 300W common subset and 3.74% NME on the 300W full set, which outperforms state-of-the-art methods on faces under normal circumstances. These results indicate that our MMDN can improve the accuracy of FLD under normal circumstances, mainly because 1) the IMCG model can obtain more discriminative representations by introducing the multi-order spatial and channel correlations to enhance the geometrical constraints. 2) the EPBR method can effectively incorporate the boundary heatmap regression with landmark heatmap regression and further explicitly add both boundary and semantically consistent constraints to the predicted landmarks.

Method | Commom | Challenging | Full |

LBF [ren2014face] | 4.95 | 11.98 | 6.32 |

RAR(ECCV16) [Xiao2016Robust] | 4.12 | 8.35 | 4.94 |

DCFE(ECCV18) [Valle2018ADC] | 3.83 | 7.54 | 4.55 |

Wing(CVPR18) [Feng2017WingLF] | 3.27 | 7.18 | 4.04 |

LAB(CVPR18) [Wu2018LookAB] | 3.42 | 6.98 | 4.12 |

SBR(CVPR18) [Dong2018SupervisionbyRegistrationAU] | 3.28 | 7.58 | 4.10 |

Liu et al. (CVPR19) [Liu2019SemanticAF] | 3.45 | 6.38 | 4.02 |

ODN(CVPR19) [Zhu2019RobustFL] | 3.56 | 6.67 | 4.17 |

SHN | 4.43 | 7.56 | 5.04 |

SHN+FCC | 4.22 | 7.26 | 4.82 |

SHN+MCC | 4.07 | 7.04 | 4.65 |

SHN+MSC | 3.67 | 6.63 | 4.25 |

SHN+IMCG | 3.43 | 6.38 | 4.01 |

SHN+EPBR | 3.82 | 6.97 | 4.44 |

SHN+IMCG+EPBR (MMDN) | 3.17 | 6.08 | 3.74 |

Method | NME | Failure |

human | 5.6 | - |

PCPR [Burgosartizzu2013Robust] | 8.50 | 20.00 |

HPM [Ghiasi2014Occlusion] | 7.50 | 13.00 |

CCR [Feng2015Cascaded] | 7.03 | 10.9 |

DRDA [Zhang2016OcclusionFreeFA] | 6.46 | 6.00 |

RAR [Xiao2016Robust] | 6.03 | 4.14 |

DAC-CSR(CVPR17) [Feng2017DynamicAC] | 6.03 | 4.73 |

CAM [Wan2019FaceAB] | 5.95 | 3.94 |

CRD [wan2020robust] | 5.72 | 3.76 |

LAB(CVPR18) [Wu2018LookAB] | 5.58 | 2.76 |

ODN(CVPR19) [Zhu2019RobustFL] | 5.30 | - |

SHN | 6.21 | 5.52 |

SHN+FCC | 6.01 | 4.54 |

SHN+MCC | 5.86 | 3.75 |

SHN+MSC | 5.52 | 2.76 |

SHN+IMCG | 5.32 | 2.17 |

SHN+EPBR | 5.62 | 3.16 |

SHN+IMCG+EPBR (MMDN) | 5.01 | 1.78 |

Method | Testset | Pose Subset | Expression Subset | Illumination Subset | Make-Up Subset | Occlusion Subset | Blur Subset |

ESR [cao2014face] | 11.13 | 25.88 | 11.47 | 10.49 | 11.05 | 13.75 | 12.20 |

SDM [xiong2013supervised] | 10.29 | 24.10 | 11.45 | 9.32 | 9.38 | 13.03 | 11.28 |

CCFS(CVPR15) [zhu2015face] | 9.07 | 21.36 | 10.09 | 8.30 | 8.74 | 11.76 | 9.96 |

LAB(CVPR18) [Wu2018LookAB] | 5.27 | 10.24 | 5.51 | 5.23 | 5.15 | 6.79 | 6.32 |

Wing(CVPR18) [Feng2017WingLF] | 5.11 | 8.75 | 5.36 | 4.93 | 5.41 | 6.37 | 5.81 |

SHN | 6.78 | 14.42 | 7.19 | 6.13 | 6.21 | 7.97 | 7.64 |

SHN+IMCG | 5.25 | 8.32 | 5.21 | 4.87 | 4.98 | 6.42 | 6.01 |

SHN+EPBR | 5.46 | 8.55 | 5.46 | 5.15 | 5.21 | 6.76 | 6.39 |

SHN+IMCG+EPBR (MMDN) | 4.87 | 8.15 | 4.99 | 4.61 | 4.72 | 6.17 | 5.72 |

### Iv-C Evaluation of Robustness against Occlusion

Most state-of-the-art methods have got promising results on FLD under constrained environments. However, when the face image is suffering heavy occlusions and complicated illuminations, these methods will degrade greatly. In order to test the performance of our MMDN on faces with occlusions, we conduct the experiments on three heavy occluded benchmark datasets: the COFW dataset, the 300W challenging subset and the WFLW dataset.

On the COFW dataset, the failure rate is defined by the percentage of test images with more than 10% detection error. As illustrated in Table III, we compare the proposed MMDN with other representative methods on the COFW dataset. Our MMDN boosts the NME to 5.01% and the failure rate to 1.78%, which outperforms the state-of-the-art methods. It suggests that our proposed IMCG model and EPBR method play an important role in boosting the ability to address the occlusion problems.

We also compare our approach against the state-of-the-art methods on the 300W challenging subset in Table II and Fig. 6. As shown in Table II, our method achieves 6.08% NME on the 300W challenging subset, which outperforms state-of-the-art methods on occluded faces. Furthermore, the Cumulative Error Distribution(CED) curve in Fig. 6 also depicts that our model achieves superior performance in comparison with other methods.

The WFLW dataset contains complicated occluded subsets such as the Illumination subset, the Make-Up Subset and the Occlusion Subset. As shown in Table IV, our MMDN outperforms other state-of-the-art FLD methods, which also demonstrates the effectiveness of the proposed MMDN.

Hence, from the experimental results on the COFW dataset, the 300W challenging subset and the WFLW dataset, we can conclude that 1) the IMCG model can be used to enhance the geometric constraints and context information for faces with heavy occlusions. 2) by fusing the boundary heatmap regression and landmark heatmap regression via the proposed EPBR method, we can generate more effective boundary-adaptive landmark heatmaps, and more accurate landmarks can be predicted. 3) by integrating the proposed IMCG model and EPBR method into a Multi-order Multi-constraint Deep Network (MMDN), our method is more robust against occlusions.

Method | full | frontal |

SDM [xiong2013supervised] | 4.05 | 2.94 |

PCPR [Burgosartizzu2013Robust] | 3.73 | 2.87 |

ERT [Kazemi2014OneMF] | 4.35 | 2.75 |

LBF [ren2014face] | 4.25 | 2.74 |

CCL [Zhu2016UnconstrainedFA] | 2.72 | 2.17 |

DAC-CSR(CVPR17) [Feng2017DynamicAC] | 2.27 | 1.81 |

SAN [Dong2018StyleAN](CVPR18) | 1.91 | 1.85 |

ODN [Zhu2019RobustFL](CVPR19) | 1.63 | 1.38 |

SHN | 2.46 | 1.92 |

SHN+FCC | 2.28 | 1.78 |

SHN+MCC | 2.17 | 1.69 |

SHN+MSC | 1.74 | 1.48 |

SHN+IMCG | 1.62 | 1.37 |

SHN+EPBR | 2.05 | 1.58 |

SHN+IMCG+EPBR (MMDN) | 1.41 | 1.21 |

### Iv-D Evaluation of Robustness against Large Poses

Faces with large poses are another great challenge for FLD. To further evaluate the effectiveness of our proposed method, we carry out experiments on the AFLW dataset, the 300W challenging subset and the WFLW dataset. For the AFLW dataset, Table V shows that our method achieves 1.41% NME on the AFLW-full testing set and 1.21% NME on the AFLW-frontal testing set, which outperforms the state-of-the-art methods. Furthermore, the Cumulative Error Distribution(CED) curve in Fig. 7 also depicts that our model exceeds the other methods. For the WFLW dataset, the NME on the Pose Subset and the Expression Subset surpasses the other methods. Hence, from the experimental results of the three datasets (see Table II, IV, V, Fig. 6 and Fig. 7), we can conclude that our method is more robust to faces with large poses, mainly because the proposed IMCG model and EPBR method can utilize the implicit and explicit geometric constraints to generate more accurate boundary-adaptive landmark heatmaps and further improve the accuracy of FLD.

### Iv-E Self Evaluations

Time and memory analysis. The IMCG model contains multiple multi-order spatial correlations modules and one multi-order channel correlations module. The multi-order spatial correlations module only increases the computation costs as it introduces the matrix operations such as the matrix addition and the matrix outer product. The multi-order channel correlations module increases both the computation costs and memory costs, i.e., the first-order channel correlations block and the second-order channel correlations block introduce additional convolutional kernels and . Moreover, the second-order channel correlations block also needs 0.125MB to store and for the covariance matrix. The EPBR method only increases the computation costs that do not need to be calculated in the reference. During training, our MMDN takes thrice as long as SHN [Yang2017StackedHN], and in the reference, our MMDN takes twice as long as SHN that can reach 100FPS on a single tesla v100. Among all, compared to SHN, our MMDN introduces some additional parameters which are insignificant compared to 16GB memory on a tesla v100, and the impact of the consumed computation costs will be decreased with the rapid development of the hardware. To reduce the computation costs, we also conduct the experiment by replacing the last hourglass network unit of the SHN with the proposed MSC module, and the NME on the 300W challenging dataset will only rise from 6.63% to 6.81%, which still achieves state-of-the-art accuracy.

Datasets | landmark | landmark and boundary | ||||
---|---|---|---|---|---|---|

MSE1 | MSE2 | JSDL1 | JSDL2 | EPBR1 | EPBR2 | |

Challenging | 7.56 | 7.34 | 7.10 | 7.05 | 7.01 | 6.97 |

COFW | 6.21 | 6.04 | 5.82 | 5.77 | 5.68 | 5.62 |

AFLW-full | 2.46 | 2.32 | 2.14 | 2.10 | 2.08 | 2.05 |

The effect of loss functions on benchmark datasets (NME (%))

Loss function. To evaluate the effectiveness of the proposed EPBR method, we conducted experiments on the 300W, COFW and AFLW datasets by separately using the Mean Squared Error (MSE1 and MSE2), the Jensen-Shannon Divergence Loss (JSDL1 and JSDL2) and our proposed EPBR (EPBR1, EPBR2) method. MSE1 means using the MSE to generate landmark heatmaps and traversing the generated landmark heatmaps to predict landmarks, i.e., the baseline SHN [Yang2017StackedHN]. MSE2 and JSDL1 use MSE and JSDL separately to generate landmark heatmaps and boundary heatmaps, then they predict landmarks by traversing the generated landmark heatmaps. JSDL2 utilizes JSDL to generate landmark heatmaps and boundary heatmaps, and predicts landmarks by constructing and searching the boundary-adaptive landmark heatmaps. EPBR1 and EPBR2 both mean using the EPBR method to generate landmark heatmaps and boundary heatmaps, but EPBR1 predicts landmarks by traversing the generated landmark heatmaps, and EPBR2 predicts landmarks by searching the generated boundary-adaptive landmark heatmaps. From Table VI, we can conclude that: 1) generating landmark heatmaps and boundary heatmaps at the same time can improve the performance of FLD. 2) compared to MSE, the JSDL can help generate more effective landmark heatmaps and boundary heatmaps, which can help to detect more accurate facial landmarks. 3) the searching mechanism can not only be utilized to generate more accurate boundary-adaptive landmark heatmaps, but also can help to detect more accurate landmarks for the boundary-regression FLD methods as a post-processing method. 4) by integrating the JSDL and the searching mechanism via the proposed EPBR method, we can generate more robust boundary-adaptive landmark heatmaps, which can be further utilized to enhance the robustness to large poses and heavy occlusions.

Heatmap quality. Comparisons of predicted landmark heatmaps and boundary heatmaps corresponding to the upper side of the upper lip boundary are shown in Fig. 8. The first, second, third and fourth rows represent the boundary regression method with MSE, IMCG+MSE, IMCG+JSDL and IMCG+EPBR, respectively. MSE means using the mean squared error (MSE) to generate landmark heatmaps and boundary heatmaps at the same time. IMCG+MSE, IMCG+JSDL and IMCG+EPBR mean combining the MSE, JSDL and our proposed EPBR method with the proposed IMCG model to generate landmark and boundary heatmaps, respectively. From Fig. 8, we have the following observations and corresponding analyses.

(1) From the first row and the second row in Fig. 8, we can see that IMCG+MSE can generate more accurate landmark heatmaps and boundary heatmaps than MSE, mainly because our proposed IMCG model can effectively encode the geometric constraints and context information by introducing the multi-order spatial correlations and the multi-order channel correlations.

(2) From the second row and the third row in Fig. 8, we can see that IMCG+JSDL outperforms IMCG+MSE. This indicates that generating heatmaps with the probability-based distribution constraints is more effective to enhance the accuracy and robustness of heatmap regression-based FLD than the pure pixel difference constraints, mainly because (a) the JSDL can accurately measure the difference between two distributions and pay more attention to the foreground area of heatmaps instead of equally treat the whole heatmap, (b) the probability-based regression method is closer to reality and with better universality and generalization.

(3) From the third row and the fourth row in Fig. 8, we can see that IMCG+EPBR exceeds IMCG+JSDL. This indicates that (a) by integrating the JSDL and the searching mechanism via the proposed EPBR method, the quality of heatmaps can be further improved. (b) Compared to IMCG+JSDL, IMCG+EPBR can add semantically consistent constraints to the predicted landmarks by introducing the searching mechanism, i.e., the landmarks in facial boundary heatmaps can be semantically moved to improve the accuracy of FLD.

### Iv-F Ablation Study

Our proposed Multi-order Multi-constraint Deep Networks contain two pivotal components, namely, the IMCG model and the EPBR method. Hence, we conduct the following experiments.

(1) As the IMCG model contains the multi-order spatial correlations (MSC) module and multi-order channel correlations (MCC) module, we conduct experiments by combing the MSC module and the MCC module with the baseline SHN [Yang2017StackedHN] respectively. As shown in Table II, III and V, we can conclude that 1) performing feature re-calibration can help obtain more discriminative representations for robust FLD (SHN+FCC vs SHN). 2) the second-order statistics can help better model the channel correlations and detect more accurate landmarks (SHN+MCC vs SHN+FCC). 3) the multi-order spatial correlations can be utilized to enhance the geometrical constraints for robust FLD (SHN+MSC vs SHN). 4) By fusing the multi-order spatial correlations and the multi-order channel correlations, the accuracy of FLD can be further improved (SHN+IMCG vs (SHN+MSC and SHN+MCC)).

(2) To evaluate the effectiveness of the proposed EPBR method, we conduct experiments by combing the SHN [Yang2017StackedHN] with the EPBR method. The experimental results from benchmark datasets (see Table II – VI and Fig. 8) demonstrate that by integrating the JSDL and the searching mechanism via the proposed EPBR method, we can explicitly add both boundary and semantically consistent constraints to the final predicted landmarks, achieving robust FLD.

(3) Finally, the proposed MMDN is obtained by combining the SHN [Yang2017StackedHN] with the IMCG model and the EPBR method. From the experimental results in Table II – V, we can see that MMDN (SHN+IMCG+EPBR) surpasses the SHN+IMCG and the SHN+EPBR respectively, showing the complementary of these two components.

## V Conclusion

Unconstrained facial landmark detection is still a very challenging topic due to large poses and partial occlusions. In this work, we present a Multi-order Multi-constraint Deep Network (MMDN) to address FLD under extremely large poses and heavy occlusions. By fusing the Implicit Multi-order Correlating Geometry-aware (IMCG) model and the Explicit Probability-based Boundary-adaptive Regression (EPBR) method with a seamless formulation, our MMDN is able to achieve more robust FLD. It is shown that the IMCG model can enhance the representative and discriminative abilities of features and effectively encode the geometric constraints and context information by introducing the multi-order spatial correlations and multi-order channel correlations. The EPBR method can explicitly utilize the boundary information to enhance the geometric constraints and generate more effective boundary-adaptive landmark heatmaps, which further improve the performance of FLD. Experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art FLD methods. It can also be found from the experiment that by fusing the searching mechanism and JSDL via the proposed EPBR method, the landmarks in a facial boundary can be semantically moved to further improve the accuracy of FLD.

Comments

There are no comments yet.