Style Aggregated Network for Facial Landmark Detection, CVPR 2018
Recent advances in facial landmark detection achieve success by learning discriminative features from rich deformation of face shapes and poses. Besides the variance of faces themselves, the intrinsic variance of image styles, e.g., grayscale vs. color images, light vs. dark, intense vs. dull, and so on, has constantly been overlooked. This issue becomes inevitable as increasing web images are collected from various sources for training neural networks. In this work, we propose a style-aggregated approach to deal with the large intrinsic variance of image styles for facial landmark detection. Our method transforms original face images to style-aggregated images by a generative adversarial module. The proposed scheme uses the style-aggregated image to maintain face images that are more robust to environmental changes. Then the original face images accompanying with style-aggregated ones play a duet to train a landmark detector which is complementary to each other. In this way, for each face, our method takes two images as input, i.e., one in its original style and the other in the aggregated style. In experiments, we observe that the large variance of image styles would degenerate the performance of facial landmark detectors. Moreover, we show the robustness of our method to the large variance of image styles by comparing to a variant of our approach, in which the generative adversarial module is removed, and no style-aggregated images are used. Our approach is demonstrated to perform well when compared with state-of-the-art algorithms on benchmark datasets AFLW and 300-W. Code is publicly available on GitHub: https://github.com/D-X-Y/SAN.READ FULL TEXT VIEW PDF
Style Aggregated Network for Facial Landmark Detection, CVPR 2018
Facial landmark detection aims to detect the location of predefined facial landmarks, such as the corners of the eyes, eyebrows, the tip of the nose. It has drawn much attention recently as it is a prerequisite in many computer vision applications. For example, facial landmark detection can be applied to a large variety of tasks, including face recognition[74, 30]
, head pose estimation, facial reenactment  and 3D face reconstruction , to name a few.
Recent advances in facial landmark detection mainly focus on learning discriminative features from abundant deformation of face shapes and poses, different expressions, partial occlusions, and others [58, 73, 59, 20]
. A very typical framework is to construct features to depict the facial appearance and shape information by the convolutional neural networks (ConvNets) or hand-crafted features, and then learn a model, i.e., a regressor, to map the features to the landmark locations[64, 10, 7, 42, 72, 67, 40]. Most of them apply a cascade strategy to concatenate prediction modules and update the predicted locations of landmarks progressively [67, 10, 73].
However, the issue from image style variation has been overlooked by recent studies on facial landmark detection. In real-world applications, face images collected in the wild usually are additionally under unconstrained variations [46, 73]. Large intrinsic variance of image styles, e.g., grayscale vs. color images, light vs. dark, intense vs. dull, is introduced when face images are collected under different environments and camera settings. The variation in image style causes the variation in prediction results. For example, Figure 1 shows three different styles of a face image and the facial landmark predictions on them when applying a well-trained detector. The contents of the three images are the same, but the visual styles are quite distinct, including original, grayscale and light. We can observe that the location predictions of a same facial landmark on them can be different. The zoom-in parts show the detailed deviation among the predicted locations of the same facial landmark on different styled images. This intrinsic variance of image styles would distort the prediction of the facial landmark detector and further degenerate the accuracy, which will be empirically demonstrated later. This problem commonly exists in the face in-the-wild landmark detection datasets [23, 46] (see Figure 2), and becomes inevitable for such face images captured under uncontrolled conditions.
Motivated by the issue of large variance of different image styles, we propose a Style-Aggregated Network (SAN) for facial landmark detection, which is insensitive to the large variance of image styles. The key idea of SAN is to first generate a pool of style-aggregated face images by the generative adversarial network (GAN) . Then SAN exploits the complementary information from both the original images and the style-aggregated ones. The original images contain undistorted appearance contents of faces but may vary in image styles. The style-aggregated images contain stationary environments around faces, but may lack certain shape information due to the less fidelity caused by GAN. Therefore, our SAN takes both the original and style-aggregated faces together as complementary input, and applies a cascade strategy to generate the heatmap predictions which can be robust to the large variance of image styles.
To summarize, our contributions include:
To the best of our knowledge, we are the first to explicitly handle the problem caused by the variation of image styles in facial landmark detection problems, which has been overlooked in recent studies. We further empirically verify the performance degeneration caused by the large variance of image styles.
We design a ConvNets architecture, i.e., Style-Aggregated Network (SAN), which exploits the mutual benefits of genuine appearance contents of faces and stationary environments around faces by simultaneously taking both original face images and style-unified ones.
In empirical studies, we verify the observation that the large variance of image styles would degenerate the performance of facial landmark detectors. Moreover, we show the insensitivity of SAN to the large variance of image styles and the state-of-the-art performance of SAN on benchmark datasets.
Increasing researchers focus on facial landmark detection . The goal of facial landmark detection is to detect key-points in human faces, e.g., the tip of the nose, eyebrows, the eye corner and the mouth. Facial landmark detection is a prerequisite for a variety of computer vision applications. For example, Zhu et al.  take facial landmark detection results as input of 3D Morphable model. Wu et al.  propose a unified framework to deal with facial landmark detection, head pose estimation, and facial deformation analysis simultaneously, which couples each other. Thies et al.  use facial landmark detection confidences of keypoints in feature alignment for facial reenactment. Therefore, it is important to predict precise and accurate locations of the facial landmark.
A common approach to facial landmark detection problem is to learn a regression model [31, 64, 75, 5, 73, 7, 63]. Many of them leverage deep CNN to learn facial features and regressors in an end-to-end fashion [51, 31, 73] with a cascade architecture to progressively update the landmark estimation [73, 51, 10]. Yu et al.  propose a deep deformation network to incorporates geometric constraints within the CNN framework. Zhu et al.  leverage cascaded regressors to handle extreme head poses and rich shape deformation. Zhu et al.  utilize a coarse search over a shape space with diverse shapes to overcome the poor initialization problem. Lv et al.  present a deep regression architecture with two-stage reinitialization to explicitly deal with the initialization problem.
Another category of facial landmark detection methods takes the advantages of end-to-end training from deep CNN model to learn robust heatmap for facial landmark detection [27, 57, 6, 4]. Wei et al.  and Newell et al.  take the location with the highest response on the heatmap as the coordinate of the corresponding landmarks. Li et al.  enhance the facial landmark detection by multi-task learning. Bulat et al.  propose a robust network structure utilizing the state-of-the-art residual architectures.
These existing facial landmark detection algorithms usually focus on the facial shape information, e.g., the extreme head pose  or rich facial deformation . However, few of them engage in a consideration of the intrinsic variance of image styles, e.g., grayscale vs. color images, light vs. dark and intense vs. dull. We also empirically demonstrate the performance fall caused by such intrinsic variance of image styles. This issue has been overlooked by recent studies but becomes inevitable as increasing web images are collected from various sources. Therefore, it is necessary to investigate the approach to dealing with the style variance, which is the focus of this paper.
We leverage the generator of trained GAN to generate faces into different styles to combat the large variance of face image styles.
GANs are first proposed in  to estimate generative models via an adversarial process. Following that, many researchers devoted great efforts to improve this research topic regarding theory [2, 8, 25, 35, 54] and applications [36, 41, 50, 71]. Some of them contribute to face applications, such as makeup-invariant face verification  and face aging . In this work, we leverage a recently proposed technique, CycleGAN , to integrate a face generation model in our detection network. There are two different main focuses between this work and the previous works. First, we aim to group images into specific styles in an unsupervised manner, while they usually assume a stationary style in a dataset. Second, sophisticated face generation methods are not our target.
How to design a neural network that is insensitive to the style variations for facial landmark detection? As illustrated in Figure 3, we design a network by combine two sub-modules to solve this problem: (1) The face generation module learns a neutral style of face images to combat the effect of style variations, i.e., transform faces with different styles into an aggregated style. (2) The landmark prediction module leverages the complementary information from the neutral face and the original face to jointly predict the final coordinate for each landmark.
This module is motivated by the recent advances on image-to-image translation[19, 71] and style-transfer [14, 15, 56]. They can transform face images into a different style, whereas they require the style of images are already known in the training procedure as well as testing. However, face images in facial landmark detection datasets are usually collected from multiple sources. These images can have various styles, but we have no labels of these styles. Therefore, current facial landmark datasets do not align with the settings of image-to-image translation, and can thus not directly apply their techniques to our problem.
We design an unsupervised approach to learn a face generation model to first transfer faces into different styles and then combine them into an aggregated style. We first transfer the original dataset into three different styles by Adobe Photoshop (PS) 111Three styles: Light, Gray and Sketch. See details in Sec 4.5.. These three transferred datasets accompanying with the original dataset are regarded as four classes to fine-tune the classification model [48, 17, 52, 11, 65, 62, 18]. The fine-tuned feature of the average-pooling layer thus has the style-discriminative characteristic, because the style information is learned in the training procedure by machine-generated style supervision.
To learn the stylized face generation model, we need to obtain the style information. For most face in-the-wild datasets, we can identify that faces have different styles. Figure 2 illustrates some examples of faces in various styles from 300-W 
. However, it is hard to label such datasets with different styles due to two reasons: (1) Some style definitions are ambiguity, e.g., a face with light style can also be classified as the color. (2) It requires substantial labors to label the style information. Therefore, we leverage the learned style-discriminative feature to automatically cluster the whole dataset into
hidden styles by k-means.
Lastly, we regard the face images in different clusters as different hidden styles, and we then train face generation models to transfer styles via CycleGAN. CycleGAN is capable of preserving the structure of the input image because its cycle consistency loss guarantees the reconstructed images will match closely to the input images. The overall pipeline is illustrated in Figure 4. The final output is several face generation models that can transfer face images into different styles, and average the transferred faces into the style-aggregated ones.
The facial landmark prediction module leverages the mutual benefit of both the original images and the style-aggregated ones to overcome negative effects caused by style variations. This module is illustrated in Figure 3, where the green stream indicates the style-aggregated face and the blue stream represents the faces in the original styles. The blue stream contains undistorted appearance contents of faces but may vary in image styles. The green stream contains stationary environments around faces, but may lack certain shape information due to the less fidelity caused by GAN. By leveraging their complementary information, we can generate more robust predictions. The architecture is inspired by CPM . We use the first four convolutional blocks from VGG-16 
followed by two additional convolution layers as feature extraction part. The feature extraction part takes the face imagein the original styles and the one from the style-aggregated stream as input, where and represent the width and the height of image. In this part, each of the first three convolution blocks is followed by one pooling layer. It thus outputs the features with eight times down-sample size compared to the input image , where and is the channel of the last convolutional layer. The output features from the original and the style-aggregated faces are represented as and , respectively. Three subsequent stages are used to produce 2D belief maps 
. Each stage is a fully-convolution structure. Its output tensorhas the same spatial size of the input tensor, where indicates the number of landmarks. The first stage takes and as inputs and generate the belief maps for each of them, and . The second stage takes the concatenation of , , and as inputs, and output the belief map for stage-2:
The last stage is similar to the second one, which can be formulated as follows:
, we minimize the following loss functions for each face image during the training procedure:
where represents the ideal belief map.
To generate the final landmark coordinates, we first up-sample the belief map
to the original image size using bicubic interpolation. We then use the argmax function on each belief map to obtain the coordinate of each landmark.
300-W . This dataset annotates five face datasets with 68 landmarks, LFPW , AFW , HELEN , XM2VTS, IBUG. Following the common settings in [72, 31], we regard all the training samples from LFPW, HELEN and the full set of AFW as the training set, in which there is 3148 training images. 554 testing images from LFPW and HELEN form the common testing subset; 135 images from IBUG are regarded as the challenging testing subset. Both of these two subsets form the full testing set.
AFLW . This dataset contains 21997 real-world images with 25993 faces in total. They provide at most 21 landmark coordinates for each face but excluding invisible landmark. Faces in AFLW usually have different pose, expression, occlusion or illumination, therefore causes difficulties to train a robust detector. Following the same setting as in [31, 73], we do not use the landmarks of two ears. There are two types of AFLW splits, AFLW-Full and AFLW-Frontal following . AFLW-Full contains 20000 training samples and 4386 testing samples. AFLW-Front uses the same training samples as in AFLW-Full, but only use the 1165 samples with the frontal face as the testing set.
We use PyTorch
for all experiments. To train the style-discriminative feature, we regard the original dataset and the PS-generated three datasets as four different classes. We then use them to fine-tune ResNet-152 ImageNet pre-trained model, and we train the model with the learning rate of 0.01 for two epochs in total. We use k-means to cluster the whole dataset intogroups, and regard the group with the maximum element and the group with the minimum as two different style sets by default. These two different groups are then used to train our style-unified face generation module via Cycle-GAN . We follow the similar training settings as in , whereas we train our model with the batch size of 32 on two GPUs, and also set the identity loss in 
as 0.1. To train the facial landmark prediction module, the first four convolutional blocks are initialized by VGG-16 ImageNet pre-trained model, and other layers are initialized using a Gaussian distribution with the variance of 0.01. Lastly, we train the facial landmark prediction model with the batch size of 8 and weight decay of 0.0005 on two GPUs. We start the learning rate at 0.00005 and reduce the learning rate at 30th/35th/40th/45th epochs by 0.5, and we then stop training at 50th epoch. The face bounding box is expanded by the ratio of 0.2. We use the random crop for pre-processing during training as data argumentation.
|Methods||SDM ||ERT ||LBF ||CFSS ||CCL ||Two-Stage ||SAN|
Evaluation. Normalized Mean Error (NME) is usually applied to evaluate the performance for facial landmark predictions [31, 43, 73]. For 300-W dataset, we use the interocular distance to normalize mean error following the same setting as in [46, 31, 7, 43]. For AFLW dataset, we use the face size to normalize mean error . We also use Cumulative Error Distribution (CED) curve to compare the algorithms provided in . Area Under the Curve (AUC) @ 0.08 error is also employed for evaluation [6, 55].
Results on 300-W. Table 1 shows the performance of different facial landmark detection algorithms on the 300-W. We compare our approach with recently proposed state-of-the-art algorithms [31, 61, 20]. We compare our approaches based on two types of face bounding boxes: (1) ground truth bounding box, denoted as GT; (2) official detector, denoted as OD. SAN achieves very competitive results compared with others by using the same face bounding box (OD). We improve the performance of NME on 300-W common set by relative 21.8% compared to the state-of-the-art method. It can further enhance our approach by applying a better initialization (GT). This implies that SAN has potential to be more robust by incorporating the face alignment  or landmark refinement [73, 55] methods.
Results on AFLW. We use the training/testing splits and the bounding box provided from [73, 72]. Table 2 shows the performance comparison on AFLW. Our SAN also achieves the very competitive NME results, which are better than the previous state-of-the-art by more than 11% on AFLW-Full. On the AFLW-Front testing set, our result is also better than state-of-the-art by more than 14%. We find that more clusters and more generation models in style-aggregated face generation module will obtain a similar result as , we thus use the setting of by default.
SAN achieves new state-of-the-art results on two benchmark datasets, e.g., 300-W and AFLW. It takes two complementary images to generate predictions which are insensitive to style variations. The idea of using the two-stream input for facial landmark detection can be complementary to other algorithms [20, 31, 61, 73]. They usually do not consider the effect of image style, while the style-aggregated face in the two-steam input can handle this problem.
In this section, we first verify the significance of each component in our proposed SAN. Figure 5 shows the comparison regarding CED curves for our SAN and two variants of SAN on the 300-W common and testing sets. As we can observe, the performance will significantly be deteriorated if we remove the original face image or the generated style-aggregated face image. This observation demonstrates that taking two complementary face images as the input benefits the facial landmark prediction results.
Figure 6 shows the results of k-means clustering on 300-W dataset. 300-W dataset is the face in-the-wild dataset, where face images have large style variations but this style information is not approachable. Our style-discriminative feature is capable of distinguishing images with different hidden styles. We can find that most of the face images in one cluster share a similar style. The mean face images generated from three clusters contain different styles. If we directly use ImageNet pre-trained features for k-means clustering, we can not guarantee to group faces into different hidden styles. In experiments, we find that ImageNet pre-trained features tend to group face images by the gender or other information.
Facial landmark detection datasets with constrained face images  usually have the similar environment for each image. There are only small style changes in these datasets, and they may also not be applicable for real-world applications due to the small face variance. We thus do not discuss these datasets in this paper. The face in-the-wild datasets [46, 23] contain face images with large intrinsic variance. However, this intrinsic variance information is not available from the official datasets, but can also affect the predictions of the detector. Therefore, we propose two new datasets, 300W-Style and AFLW-Style, to facilitate the style analysis for facial landmark detection problem.
|SAN w/o GAN|
As shown in Figure 7, 300W-Style consists of four different styles, original, sketch, light and gray. The original part is the original 300-W datasets, and the other three synthetic styles are generated using PS. Each image in 300W-Style is corresponding to one image in the 300-W dataset, and we thus directly use the annotation provided from 300-W for our 300W-Style. AFLW-Style is similar as 300W-Style, which transfer the AFLW dataset into three different styles. For training and testing split, we follow the common settings of the original datasets [46, 23].
Can PS-generated images be realistic? Internet users usually use PS (or similar software) to change image styles and/or edit image content; thus PS-generated images are indeed realistic in many real-world applications. In addition, we have chosen three representative filters to generate images of different styles. These filters have been widely used by users to edit their photos and upload to the Internet. Therefore, the proposed datasets are realistic.
|SAN w/o GAN|
|SAN w/o GAN|
Effect of SAN for style variances. These two proposed datasets can be used to analyze the effect of face image styles for facial landmark detection. We consider the situation that testing set has a different style with the training set. For example, we train the detector on the light-style 300-W training set and evaluate the well-trained detector on 300-W testing sets with different styles. Table 3, Table 4 and Table 5 show the evaluation results of 16 training and testing style combinations, i.e., four different training styles multiply four different testing styles. Our SAN algorithm is specifically designed to deal with style variances for face landmark detection. When style variance between the training and testing sets is large (e.g., light and gray), our approach usually obtains a significant improvement. However, if style variance between the training and testing sets is not that large (e.g., gray and sketch), the improvement of SAN is less significant. On average, SAN obtains 7% relative improvement on the full testing set of the 300W-Style dataset when the training style is different from the testing style. Moreover, our SAN achieves consistent improvements over all the 16 different train-test style combinations. This demonstrates the effectiveness of our method.
Self-Evaluation: We compare two variants of our SAN: (1) train SAN without GAN using the training set of AFLW-Style and the testing set of AFLW. This can be considered as data argumentation, because the amount of training data that we use is four times larger than the original one. In this case, our SAN can achieve 79.82 AUC@0.08 on AFLW-Full by only using the original AFLW training set, while the data argumentation one achieves a worse performance, 78.99 AUC@0.08, than SAN. SAN is better than the data argumentation way, which uses our PS-generated images as additional training data. (2) replace the style-aggregated stream of SAN by a Photo-generated face image. If we train the detector on the original style 300-W training set and test it on the gray style 300-W challenging test set, our SAN can achieve 6.91 NME. However, replacing the style-aggregated stream by light style images can only achieve 7.30 NME, which is worse than ours. SAN can always achieve better results than the replaced variant, except for replacing the style-aggregated stream by the testing style. SAN can automatically learn the hidden styles in the dataset and generate the style-aggregated face images. This automatic way is better than providing images with a fixed style.
Error Analysis: The faces in uncontrolled conditions have large variations regarding the image style. Detectors will usually fail when image style changes a lot, whereas our SAN is insensitive to this style change. Figure 8 shows the qualitative results of our SAN and the base detector on 300-W. The first line shows the ground truth landmarks. The second and third lines show the predictions from SAN without GAN and SAN, respectively. In the first column, the base detector fails for the predictions on the face contour, while the predictions from SAN still preserves the overall structure. In the fourth column, some perdition from the base detector drifts to the right, while SAN not.
The large intrinsic variance of image styles, which comes from their uncontrolled collection sources, has been overlooked by recent studies in facial landmark detection. To deal with this issue, we propose a style-aggregated network (SAN). SAN takes two complementary images for each face, one in the original style and the other in the aggregated style that is generated by GAN. Empirical studies verify that style variations degenerate the performance of landmark detection, and SAN is robust to the large variance of image styles. Additionally, SAN achieves state-of-the-art performance on 300-W and AFLW datasets.
The first step of SAN is to generate the style-aggregated images. This step can be decoupled from our landmark detector, and potentially used to improve other landmark detectors [7, 43, 72, 68, 37]. Moreover, the intrinsic variance of image styles also exists in other computer vision tasks, such as object detection [12, 44, 38, 9, 29] and person re-identification [60, 69, 70, 32]. Therefore, the style-aggregation method can also be used to solve the problem of the style variance in other applications. In our future work, we will explore how to generalize the style-aggregation method for other computer vision tasks.
Acknowledgment. Yi Yang is the recipient of a Google Faculty Research Award. Wanli Ouyang is supported by SenseTime Group Limited. We acknowledge the Data to Decisions CRC (D2D CRC) and the Cooperative Research Centres Programme for funding this research.
Image-to-image translation with conditional adversarial networks.In CVPR, 2017.
Multi-source deep learning for human pose estimation.In CVPR, 2014.
Image classification by cross-media active learning with privileged information.IEEE Transactions on Multimedia, 18(12):2494–2502, 2016.