A High-Efficiency Framework for Constructing Large-Scale Face Parsing Benchmark

05/13/2019 ∙ by Yinglu Liu, et al. ∙ 0

Face parsing, which is to assign a semantic label to each pixel in face images, has recently attracted increasing interest due to its huge application potentials. Although many face related fields (e.g., face recognition and face detection) have been well studied for many years, the existing datasets for face parsing are still severely limited in terms of the scale and quality, e.g., the widely used Helen dataset only contains 2,330 images. This is mainly because pixel-level annotation is a high cost and time-consuming work, especially for the facial parts without clear boundaries. The lack of accurate annotated datasets becomes a major obstacle in the progress of face parsing task. It is a feasible way to utilize dense facial landmarks to guide the parsing annotation. However, annotating dense landmarks on human face encounters the same issues as the parsing annotation. To overcome the above problems, in this paper, we develop a high-efficiency framework for face parsing annotation, which considerably simplifies and speeds up the parsing annotation by two consecutive modules. Benefit from the proposed framework, we construct a new Dense Landmark Guided Face Parsing (LaPa) benchmark. It consists of 22,000 face images with large variations in expression, pose, occlusion, etc. Each image is provided with accurate annotation of a 11-category pixel-level label map along with coordinates of 106-point landmarks. To the best of our knowledge, it is currently the largest public dataset for face parsing. To make full use of our LaPa dataset with abundant face shape and boundary priors, we propose a simple yet effective Boundary-Sensitive Parsing Network (BSPNet). Our network is taken as a baseline model on the proposed LaPa dataset, and meanwhile, it achieves the state-of-the-art performance on the Helen dataset without resorting to extra face alignment.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Face parsing, aiming to assign pixel-level semantic labels for face images, has attracted much attention due to its wide application potentials, such as facial beautification (Ou et al., 2016), face image synthesis (Zhang et al., 2018), etc

. In recent years, deep learning promotes the development of artificial intelligence in computer vision and multimedia, especially on the face related fields. As is well known, adequate training data is crucial for achieving good results of deep learning methods. However, there are rare public datasets for face parsing due to the difficulty and high cost of pixel-level annotation.

It is a feasible way to utilize the dense facial landmarks to guide the face parsing annotation. However, annotating tens of hundreds of landmarks encounters the same issues of the parsing annotation whereas well-trained annotators are needed. Therefore, the public datasets (Sagonas et al., 2013; Le et al., 2012; Ramanan and Zhu, 2012; Gross et al., 2010; Messer et al., 1999; Wu et al., 2018; Koestinger et al., 2011) for facial landmark localization are limited either on the number of training samples or the number of landmarks. Specifically, most of existing landmark datasets are annotated with less than 100 points, which are not enough to depict the shape of facial parts with fine details. For example, the widely used 68-point landmarks in 300W (Sagonas et al., 2013) describe the eyebrow with only 5 points on the upper boundary while leaving the lower boundary unmarked. The recent 98-point landmarks in WFLW (Wu et al., 2018) do not include the position of nose wing. Needless to say other makeups such as 21-point in AFLW (Koestinger et al., 2011) and 6-point in AFW (Baltrusaitis et al., 2013), they could be applied for face geometric normalization but are incompetent to represent boundaries of facial parts. In contrast, the Helen (Le et al., 2012) dataset contains 194-point landmarks, but the number of samples is only 2,330 and no landmarks are located on the nose bridge.

To remedy the above problems, in this paper we develop a high-efficiency framework for face parsing annotation, which is composed of two consecutive modules. In the first module, we develop a semi-automatic labeling tool for 106-point facial landmarks. With the help of an auxiliary landmark localization model, the coarse position for each landmark will be given firstly, so that annotators only need to adjust a small number of points in difficult cases as shown in Fig. 3. The initial landmarks play a role of strong and confident reference, and thus the annotators do not need heavy training to learn each landmark’s definition. Benefit from these advantages, this tool significantly reduces the workload and speeds up the process of annotation for dense landmarks. In the second module, we propose a category-wise fitting approach, which draws the accurate contour for each facial part according to the landmarks from the first module; moreover, a coarse-to-fine segmentation strategy is employed to label the hair and face skin region. Finally, The outputs are merged hierarchically to produce a 11-category pixel-level semantic label map. By using the proposed method, we construct a new benchmark for face parsing, named LaPa. It consists of 22,000 images, which cover large variations in facial expression, pose, occlusion, etc. Each image is provided with accurate annotation of a 11-category pixel-level label map, namely hair, face skin, left/right eyebrow, left/right eye, nose, upper/lower lips, inner mouth and background along with the coordinates of 106-point landmarks.

Furthermore, to make full use of our LaPa dataset with abundant face shape and boundary priors, we propose a simple yet effective Boundary-Sensitive Parsing Network (BSPNet), which improves the face parsing performance by focusing more on the boundary pixels from two aspects: 1) the boundary-aware features are integrated into semantic-aware features to preserve more boundary details implicitly. 2) the semantic loss of boundary pixels is weighted by the boundary map to reinforce the boundary effect explicitly. The experiments on the Helen and LaPa datasets demonstrate the effectiveness of our network.

The contribution of this paper are summarized as follows:

  • We develop a high-efficiency framework for face parsing annotation, which is composed of a Dense Landmark Annotation (DLA) module and a Pixel-Level Parsing Annotation (PPA) module. This framework considerably simplifies and speeds up the pixel-level parsing annotation.

  • Based on the proposed framework, we construct a new large benchmark for face parsing. It contains 22,000 images. Each image is provided with a 11-category pixel-level label map and coordinates of 106-point landmarks. To the best of our knowledge, this is the largest public dataset for face parsing so far. It will be released to the community soon.

  • We propose an effective boundary-sensitive parsing network, which is taken as the baseline method on the proposed LaPa dataset. Meanwhile, we evaluate it on the public Helen dataset, and our model achieves the state-of-the-art performances on all categories.

2. Related Work

2.1. Face Parsing

2.1.1. Dataset

Due to the aforementioned problems in pixel-level annotation, there are few face parsing datasets published. The most commonly used public datasets for face parsing methods are LFW-PL (Kae et al., 2013) and Helen (Le et al., 2012; Smith et al., 2013). LFW-PL is a subset of the Labeled Faces in the Wild (LFW) funneled images which is a database of face photographs dedicated to the unconstrained face recognition. This dataset contains 2,927 face images. All the images are first segmented into superpixels, and then each superpixel is manually assigned with one of the hair/skin/background categories. The annotations for facial parts are not provided in this dataset. The original Helen dataset (Le et al., 2012) is composed of 2,330 face images with densely-sampled, manually-annotated keypoints around the semantic facial parts. Smith et al. (Smith et al., 2013) generated segmentation ground truths of eye, eyebrow, nose, inside mouth, upper lip and lower lip automatically by using the contours, together with facial skin and hair categories generated from manually annotated boundaries and automatic matting algorithm (Levin et al., 2008).

2.1.2. Methods

In recent years, increasing attention has been drawn in face parsing due to its great application potentials. Early works mainly focus on hand-crafted features and probabilistic graphical models. Warrell et al. (Warrell and Prince, 2009) proposed to use priors to model facial structure and get facial parts labels through a Conditional Random Field (CRF). Smith et al. (Smith et al., 2013) adopted SIFT features to select examplers and computed segmentation map of a test image by propagating labels from the aligned exemplar images. Kae et al(Kae et al., 2013)

combined CRF with a Restricted Boltzmann Machine (RBM) to model both local and global structures for face labeling. More recently, certain works attempt to tackle the face parsing task with the help of deep learning to break the performance bottleneck of traditional methods. Luo

et al(Luo et al., 2012)

proposed a hierarchical face parsing framework with Deep Belief Networks (DBNs) as facial parts and components detectors. Liu

et al(Liu et al., 2015)

exploited a Convolution Neural Network (CNN) to model both unary likelihoods and pairwise label dependencies. Yamashita

et al. (Yamashita et al., 2015) proposed a weighted cost function to improve performances for certain classes like eyes. Jackson et al.(Jackson et al., 2016) proposed a two-stage parsing framework with Fully Convolutional Networks (FCNs). Liu et al(Liu et al., 2017)

designed a light-weight network which combines a shallow CNN with a spatially variant Recurrent Neural Network (RNN) and a coarse-to-fine approach for accurate face parsing. Wei

et al(Wei et al., 2017) introduced an automatic method for selecting receptive fields and achieved accurate parsing results for face images. Guo et al(Guo et al., 2018) adopted a prior mechanism to refine the Residual Encoder Decoder Netwrk (RED-Net), and achieved state-of-the-art performance on both LFW-LP and Helen datasets.

2.2. Facial Landmark Localization

2.2.1. Datasets

There are more public datasets for facial landmark localizatioon than that for face parsing. However, the existing datasets have limitations on either the scale of training samples or the number of landmarks. For example, the Helen dataset (Zhou et al., 2013) contains densely defined 194-point landmarks, whereas only 2,330 images are included. The AFLW dataset (Koestinger et al., 2011) contains about 25k annotated faces in real-world images while only annotating at most 21 landmarks for each image. Moreover, certain of them are captured under controlled conditions. For example, images in XM2VTS dataset (Hasan et al., 2013) are captured under laboratory conditions with the same illumination conditions and neutral expression. In order to remedy these shortcomings, IBUG 222https://ibug.doc.ic.ac.uk/home built a new dataset called 300W (Sagonas et al., 2013), which consists of several datasets (AFW (Baltrusaitis et al., 2013), Helen (Zhou et al., 2013), LFPW (Jaiswal et al., 2013), IBUG (Sagonas et al., 2013), etc) and re-annotated them with 68-point landmarks. Wu et al. (Wu et al., 2018) released a new dataset called WFLW, which contains 10,000 faces and annotates 98 landmarks per face image.

2.2.2. Methods

The methods for facial landmark localization mainly fall into two categories: Model based methods and Regression based methods, both of which push the state-of-the-art performance. 1) Model based methods usually build the basic model first, and then learn the transformation for each specific sample. For example, Cootes et al. (Cootes et al., 1995) proposed the Active Shape Model (ASM) approach. The first step in this approach is to train a mean shape model, represented by the concatenation of a sequence of landmarks, and then is to search the locations according the basic shape model and the local features for each landmark. In another work (Cootes et al., 2001), the Active Appearance Model (AAM) is proposed by adding a texture model besides the shape model. Besides, Zhu et al. (Zhu et al., 2016) proposed to fit a dense 3D Morphable Model (3DMM (Blanz and Vetter, 2003)) to the image via cascaded convolutional neural netowrks, which could be used to synthesize face images in profile views to provide abundant samples for training. 2) There are usually two strategies for regression based methods. One strategy is to directly regress the coordinates of landmarks from the input image. For example, Xiong et al(Xiong and De la Torre, 2013) proposed a Supervised Descent Method (SDM) which concatenates the SIFT features around each landmark as the shape-index feature and learn the regression matrix by minimizing a Non-linear Leaset Squares (NLS) function. Sun et al. (Sun et al., 2013) firstly proposed a three-level cascaded CNN method to solve the facial landmark detection problem. In (Zhou et al., 2013), the 68 landmarks are divided into two categories (inner points and contour landmarks) and localized from coarse to fine. Yang et al. (Zhang et al., 2014) adopted a multi-task method to learn the coordinates of the landmarks along with the attributes. Lai et al. (Lai et al., 2018) proposed an end-to-end recurrent convolutional system for face alignment from coarse to fine. The other strategy is to generate heatmaps for each landmark respectively and the coordinates of landmarks could be obtained by a post-processing (Yang et al., 2017; Merget et al., 2018; Bulat and Tzimiropoulos, 2017; Dong et al., 2018). For example, Yang et al. (Yang et al., 2017) proposed to adopt the hourglass network for facial landmark localization and achieved the first place in 300W challenge. Heatmap regression methods usually achieve higher performance than coordinate regression methods, while the computational cost is usually higher than the latter one.

3. High-efficiency Framework for Face Parsing Annotation

In this section, we will describe the proposed efficient framework for face parsing. As Fig.2 shows, it is composed of two consecutive modules, named Dense Landmark Annotation (DLA) and Pixel-level Parsing Annotation (PPA). These two modules are introduced in Section 3.1 and Section 3.2 in details, respectively. In Section 3.3, we introduce the information about the proposed LaPa dataset.

Figure 2. The proposed framework for face parsing. DLA denotes Dense Landmark Annotation and PPA denotes Pixel-level Parsing Annotation. First, DLA module outputs 106-point landmarks on the input color image, and then PPA module produces the pixel-level label map automatically according to the landmarks.

3.1. Dense Landmark Annotation (DLA) Module

The purpose of DLA module is to annotate face images with dense landmarks efficiently. First, we develop a semi-automatic facial landmark labeling tool with user interface (Fig.3). This tool can give a reference position for each landmark by an auxiliary facial landmark localization model, so that annotators only need to adjust a small number of points for difficult cases rather than annotating all from scratch. In this paper, 1-stack hourglass network (Newell et al., 2016; Bulat_2017_ICCV) is employed , which is trained with mere 2,000 manually annotated images at the beginning and updated once 2,000 additional images are labeled. The Normalized Mean Error (NME) and Area Under Curve (AUC) with the number of training samples are reported in Fig. 4. We can see that the performance of the auxiliary model keeps improving along with the increasing samples accumulated by the semi-automatic labeling process. This tool significantly simplifies and speeds up the process of dense landmark annotation. In this work, the DLA module takes the 106-point landmark definition. The outputs of this module will be fed to the PPA module.

Figure 3. Our semi-automatic facial landmark labeling tool. The pink points refer to the initial results by the auxiliary model. The green points refer to the results after light manual correction. The white arrows show the adjustment trajectories. Our tool provides common functions such as undo, next, save, etc. Best viewed in color and zoom.

3.2. Pixel-Level Parsing Annotation (PPA) Module

This module takes a color image and coordinates of 106-point landmarks as input, and outputs a pixel-level semantic label map corresponding to 11 categories. As Fig. 6 shows, it consists of three stages:

1) Coarse-to-fine segmentation for hair and face skin

Parsing hair and skin are important for many facial applications, such as face beautification, hair coloring, etc. However, conventional facial landmarks are not defined on the forehead and hair region, in part because forehead regions are usually covered by hair of different styles. We solve this problem with the help of a human parsing dataset CIHP 333http://sysu-hcp.net/lip/overview.php. It is the first standard and comprehensive benchmark for instance-level human parsing. Because test set does not supply the ground truth, we just adopt the training and validation set, totally 33,280 images.

In the coarse segmentation stage, we firstly map twenty categories in CIHP into two by taking hair and skin as foreground and others as background. Then we crop the regions of interest from original images according to the mapped labels incorporating with the instance labels. Usually, one image in CIHP could produce several sub-images in which only one major face exists. After filtering the images of which the width or height is less than 80 pixels, we collect about 26,000 images for training. Here we adopt the advanced Pyramid Scene Parsing Network (PSPNet) (Zhao et al., 2017) to segment the foreground (hair and skin) from the background. This stage could be considered as a face detection operation to preserve the hair region while regular face detectors usually focus on the face region and may lose part of hair.

In the fine segmentation stage, we process the data in a similar way as the coarse segmentation stage but retaining hair and skin regions as two separate categories. In order to obtain more accurate segmentation results, the proposed Boundary-Sensitive Parsing Network (BSPnet) is adopted in this stage. The network will be introduced in details in Sec. 4.

Figure 4. Performance of the auxiliary model for landmark localization w.r.t. the number of training samples. The horizontal axis refers to the number of training samples which is accumulated by our semi-auto facial landmark labeling tool. The vertical axis refers to the corresponding evaluation performance.

2) Category-wise fitting for facial parts

Facial parts include left/right eyebrow, left/right eye, nose, upper/lower lip and inner mouth. In order to obtain more natural and accurate contours, we develop different fitting schemes for different facial parts according to their characteristics. For eyebrow, outer contour of mouth and jawline, we adopt polygon fitting to generate approximated contours. The pixels within each polygon are assigned to the corresponding category. In some cases, direct connection of long-distance neighboring landmarks may cause piecewise linear effect. To overcome this problem, prior knowledge is leveraged to make the results smoother by interpolation. For eye and inner mouth, two parabolas are applied to sketch the upper and lower boundary separately. For nose, we separate it into left and right parts to handle the profile case, and piecewise fitting is adopted due to the complex shape of nose. Note that all the partial landmarks are fitted in the transformed space where each part is aligned with a standard pose. The visualization results shown in Fig. 

5 demonstrate the effectiveness of our approach.

Figure 5. The effectiveness of category-wise fitting approach for facial parts. The left image shows the result by directly connecting neighboring landmarks in each category. The right image is the result by our method. Best viewed in zoom.
Figure 6. The framework for the proposed PPA module. First, the coarse segmentation is applied on the whole image and then the face region is cropped and finely segmented into three categories including hair, skin and background. Meanwhile, the annotation of facial parts are produced by our category-wise fitting approach according to the landmarks. Finally, the outputs are merged hierarchically as a complete annotation.

3) Fusion

After the above two steps, we could obtain the label maps for hair/skin and facial parts. Then we merge them into a unified label map hierarchically. We emphasize that the order of fusion is important for producing correct results. For example, eye is always beyond the skin but sometimes behind hair or sunglasses, so if the label map of eye covers the label map of hair, the results will be unreasonable. Therefore, we merge the label maps in the order of skin, facial parts, hair and background.

3.3. Dense Landmark Guided Face Parsing Benchmark

We collect 22,000 images from two popular datasets-the landmark localization dataset 300W-LP (Sagonas et al., 2013; Zhu et al., 2016) and the face recognition dataset Megaface (Kemelmacher-Shlizerman et al., 2016). We randomly select 1000/2000 images from Megaface as validation/test set. The remaining images are taken as training set. All the images are annotated by the DLA module first, and then the color image and the landmarks are fed to the PPA module to obtain 11-category pixel-level semantic annotation. The label of each pixel denotes the semantic category according to the visible texture. Therefore, some categories may not present due to occlusion. For example, eye may be invisible due to large pose or occlusion by other objects such as sunglasses or hair. In this work, we just focus on single face parsing, and thus only the major face is annotated even if multiple faces exist in an image. Fig.1 shows some examples of the proposed LaPa dataset.

Figure 7. The network structure of BSPNet. It consists of three branches. The semantic-aware branch runs a multi-category semantic segmentation task. The boundary-aware branch runs a two-category boundary segmentation task. The fusion branch takes the combination of the features from the former two branches as input and employs the boundary map to weight the semantic segmentation loss.

Comparison with relevant datasets. Helen (Smith et al., 2013) is a widely used dataset for face parsing, while it still has several limits: 1) The labeling is not accurate enough especially for hair and face skin categories produced by matting. As a result, most works based on Helen only focus on facial components, while ignoring hair and skin. 2) The limited number of samples and the lack of diversity in poses make it difficult to support training large-scale practical models. The LFW-PL (Kae et al., 2013) dataset has the same issue of lack of training images while only hair and facial skin are annotated without considering facial parts. Compared to the existing datasets, the proposed LaPa dataset contains sufficient training samples which cover a wide range of variations, and provides annotation of fine-grained facial component categories, along with accurate hair and facial skin segments. Besides, with the advantage of the semi-automatic labeling framework, the LaPa dataset could be scaled up easily in future. Tab. 1 gives statistics of the public datasets for face parsing. Meanwhile, we re-annotate the hair and facial skin categories on the Helen dataset by our framework introduced in Sec. 3.2. The visual comparison results shown in Fig. 8 demonstrate the superiority of our framework.

Dataset #Training #Validation #Test #Category
LFW-PL (Kae et al., 2013) 1500 500 927 3
Helen (Smith et al., 2013) 2000 230 100 11
LaPa (Ours) 19000 1000 2000 11
Table 1. Statistics of the public datasets for face parsing. We show the number of images in training and test sets as well as the nubmer of categories including background.

4. Boundary-Sensitive Parsing Network

Although face could be approximately considered as a rigid body in which the deformation is limited, face parsing is still difficult due to the variations in facial expression and pose. Furthermore, the parsing regions such as eye, nose, etc. are usually smaller than general objects. All these reasons make it difficult to solve this specific task with general object segmentation or scene parsing methods (Zhao et al., 2017; Long et al., 2015; Badrinarayanan et al., 2017; Lin et al., 2017; Chen et al., 2018; Chen et al., 2017).

To overcome the above limitations and make full use of our LaPa dataset, we propose a novel Boundary-Sensitive Parsing Network (BSPNet). As Fig.7 shows, the BSPNet consists of three branches. The upper branch of the network is called semantic-aware branch. The purpose is to learn semantic-aware features and infer accurate semantic label maps from input images. Any existing segmentation network structure could be adopted in this branch. Here we employ ResNet-101 (He et al., 2016)

as the feature extraction backbone. In order to reduce the resolution loss caused by pooling or convolution with stride larger than 1, dilation convolution 

(Chen et al., 2018)

is adopted in the fifth residual block, therefore the resolution of the output is 1/16 rather than 1/32 of the input. In order to leverage the global texture information, pyramid spatial pooling with different scales are used before the classifier learning. Then, the output feature maps with different resolutions are concatenated with high-resolution feature maps generated by the last residual block after interpolation to the same scale. The integrated feature maps are used to predict the semantic label for each pixel.

Similar to (Tao Ruan, 2018), the lower branch of the network is for boundary-aware feature learning, called boundary-aware branch. First, it extracts shared features from different layers of ResNet-101 in the semantic-aware branch, and projects them into a new space where boundary details are well preserved. The output of this branch is a boundary map in which each value refers to the confidence score that pixel is located on the boundary without considering semantics. The ground truths of this branch are computed according to the gradients of the label map.

It is a common issue that the boundary pixels are difficult to distinguish due to confusion with the adjacent pixels belonging to different categories. Therefore, we further develop a fusion branch to boost the segmentation performance for these ”hard samples”. This branch takes the combination of features learned by the semantic- and boundary- aware branches as input, which are rich in semantics while boundary details are well preserved. As the same in semantic-aware branch, the output of fusion branch is a confidence map with channels, where denotes the number of semantic categories. Meanwhile, a weight map computed from the boundary map is used to enlarge the loss of boundary pixels.

The loss functions are defined as follows:

(1)
(2)
(3)
(4)

where refers to the total loss. , and denote the loss of the semantic-aware, boundary-aware, and fusion branches, respectively. , and are hyper-parameters to balance the loss of different branches. denotes the number of pixels in the whole image while denotes the number of parsing categories. Here equals to 1 if the semantic label of pixel is , and otherwise. is a indicator variable of which the value is 1 if pixel is located on the boundaries and 0 otherwise. and are prediction values for the three branches, respectively. To enhance the effect of boundaries, we introduce a new parameter , if and , otherwise. is usually set to a positive number to increase the weight for boundary pixels. During test phase, are taken as the output of the BSPNet.

5. Experiments

In this section, we first prove the effectiveness of the proposed BSPNet on the LaPa dataset. Then we evaluate our network on the public Helen dataset. Without utilizing any prior, our model achieves the best results over other state-of-the-art methods.

hair skin left right left right nose upper inner lower background mean
eyebrow eyebrow eye eye lip mouth lip
Model A 94.86 95.95 81.79 81.61 81.50 81.69 93.79 80.10 84.10 80.45 98.76 85.58
Model B 95.30 96.27 83.32 82.82 82.45 82.72 94.43 81.25 84.73 81.24 98.86 86.45
Model C (Baseline) 95.32 96.54 84.34 84.27 84.86 85.17 94.66 82.45 85.63 82.31 98.86 87.55
Table 2. Ablation study on the LaPa dataset. Model A is trained only by the semantic-aware branch. Model B is trained by Model A plus the boundary-aware features. Model C is trained by Model B plus the boundary-aware weighted loss. The performances of each category, together with the mean F1-score over the 10 foreground categories are listed.
skin nose upper-lip inner-mouth lower-lip brows eyes mouth overall
Smith et al. (Smith et al., 2013) 88.2 92.2 65.1 71.3 70.0 72.2 78.5 85.7 80.4
Liu et al. (Liu et al., 2015) 91.2 91.2 60.1 82.4 68.4 73.4 76.8 84.9 85.9
Liu et al. (Liu et al., 2017) 92.1 93.0 74.3 89.1 81.7 77.0 86.8 89.1 88.6
Guo et al. (Guo et al., 2018) 93.8 94.1 75.8 83.7 83.1 80.4 87.1 92.4 90.5
BSPNet 94.8 94.5 78.0 86.2 86.9 81.3 87.5 93.5 91.0
BSPNet+LaPa 95.1 94.7 80.2 86.6 86.9 81.9 87.8 93.8 91.4
Table 3. Comparison with state-of-the-art methods on the Helen dataset. To keep consistent with other methods, the performances of the hair category and other fine-grained categories (e.g. left/right eyes) are not given. The overall scores are computed by combining the merged brows/eyes/mouth and nose categories. BSPNet+LaPa means the model pretrained on the LaPa dataset is employed as the model initialization.

5.1. Experimental Setting

For both LaPa and Helen datasets, we adopt similar network configurations. For the semantic-aware branch, the network parameters of ResNet-101 are initialized from the model pretrained on the ImageNet dataset

(Deng et al., 2009). The input size of the network is 473473, and dilation convolution is used in the last residual block to retain the feature resolution. The extracted features are then processed by a spatial pyramid pooling module with four different scales of , , and to aggregate global and local contextual information. For the boundary-aware branch, the feature maps from conv2_3, conv3_4 and conv4_23 in ResNet-101 are concatenated as input. As the same in (Tao Ruan, 2018), we adopt a positive/negative sample balancing strategy which takes the ratio of pixels belonging to specific class as the weights of the opposite one. For the fusion branch, the last feature maps before predictors of the semantic- and boundary- aware branches are concatenated as the input features. The ground truth of this branch is the same as semantic-aware branch, while the ground truth of the boundary-aware branch is adopted to generate the weight map.

The network is trained by minimizing the objective function defined in Eq. (1

). We use mini-batch gradient descent as the optimizer with the momentum of 0.9, weight decay of 0.0005 and batch size of 64. ”Poly” learning rate policy is used to update parameters and the initial learning rate is set to 0.001. Synchronized Batch Normalization is adopted to accelerate the training procedure. The

, and are set to 1, 1, and 2 respectively.

is set to 200. All the hyper-parameters are determined on the validation set. Our experiments are implemented by Pytorch framework, and all the models are trained on 4 NVIDIA Tesla P40 with 24GB memory. Like

(Smith et al., 2013; Liu et al., 2015; Guo et al., 2018)

, We adopt F1-score as the quantitative evaluation metric, which is the harmonic mean of precision and recall.

(5)
Figure 8. Comparison results on the Helen dataset. The first row refers to the original color image. The second row is the ground truth provided by Smith et al. (Smith et al., 2013). The third row shows the results where the hair and face skin categories are re-annotated by the proposed framework.
Figure 9. Visualizing the results on the LaPa dataset. The first column is the original image, in which the red dashed rectangles indicate the challenging parts. The next three columns are zoom-in results of the comparison methods (i.e., Model A, Model B and Model C). The last column shows the ground truth of the corresponding images.

5.2. Ablation Study

To evaluate the effectiveness of the proposed network, three kinds of models are trained on the LaPa dataset. Model A is trained only by the semantic-aware branch, which does not utilize any auxiliary boundary information. Model B is trained by all three branches without additional loss weights for boundary pixels, which is equivalent to set for all pixels in Eq. (4). Model C is trained by the proposed BSPNet, which utilizes both the boundary-aware features and boundary-aware weighted loss. As Tab. 2 shows, the performances of Model B are consistently better than that of Model A on all categories, while Model C achieves the best results over other models. Specifically, the mean F1-score over 10 semantic categories (without background) of Model B is 86.45%, improved by 0.87% over Model A, while Model C further improves the accuracy to 87.55%, 1.97% and 1.1% higher than Model A and Model B, respectively. For hair, face skin and nose categories, of which the sizes are relatively large, Model C achieves 0.46%, 0.59% and 0.87% improvements over Model A, respectively. For the small-size categories of left/right eyebrow, left/right eye, upper/lower lip and inner mouth, the performances of Model C are significantly improved by 2.55%/2.66%, 3.36%/3.48%, 2.35%/1.86% and 1.53% over Model A, respectively. Fig. 9 shows the visualization results of the three models. The experimental results demonstrate that the proposed BSPNet is effective for face parsing, especially for categories with small sizes. The performance of our BSPNet (Model C in Tab. 2) could be taken as a baseline result for further researches on the proposed LaPa benchmark.

5.3. Comparison with State-of-the-art Methods

5.3.1. Dataset and ground truth

Since the original Helen dataset (Le et al., 2012) is made for facial feature localization instead of parsing, Smith et al(Smith et al., 2013) take several steps to convert densely labeled landmarks into segmentation maps. Specifically, ground truth segments of facial parts are automatically generated from manually-annotated contours. For facial skin, the jawline contour is used as the lower boundary, while for the upper boundary, an automatic matting algorithm (Levin et al., 2008) is used to separate the forehead from hair. The same matting strategy is adopted to recover the hair region. To make the dataset adaptive to semantic segmentation tasks, we first transform the confidence maps ranging from 0 to 255 to label maps by selecting the category with the maximum confidence value. As Fig. 8 shows, this will cause incorrect ground truth for some cases, especially for the hair category. As the number of training images are limited, many methods adopt exemplar based approach, which needs to split some images as exemplar during the training and test phase. As in (Liu et al., 2015; Guo et al., 2018), 230 images are splited as exemplars. In contrast, our method could directly output the prediction result for each pixel by the network, thus we use all the training images and exemplar images for training model, and test images are the same as (Liu et al., 2015; Guo et al., 2018).

5.3.2. Experimental results

As the previous works (Smith et al., 2013; Liu et al., 2015, 2017; Guo et al., 2018) do not report the performance of hair and fine-grained (i.e. left/right eyebrow and left/right eye) categories on the Helen dataset, the mean score of foreground categories cannot be computed as Tab. 2 shows. To keep consistent with the previous methods, we report our results on the skin, nose, upper-lip, inner-mouth, lower-lip, merged brows, merged eyes and merged mouth categories. The overall scores are computed by combining the merged brows, merged eyes, merged mouth and nose categories without considering fine-grained categories. As Tab. 3 shows, our model achieves the best results over other state-of-the-art methods on all categories. We emphasize that the performances of upper-lip, inner-mouth and lower-lip categories by ours are significantly improved by 2.2%, 2.5% and 3.8% over Guo et al(Guo et al., 2018), while the accuracy of the merged mouth category is only 1.1% higher than Guo et al(Guo et al., 2018). That indicates the categories of upper lip, inner mouth and lower lip are severely confused by Guo et al. (Guo et al., 2018) while the merged accuracy ignores the confusion among them. Our model achieves the best overall score of 91.0%, outperforming Smith et al. (Smith et al., 2013), Liu et al. (Liu et al., 2015), Liu et al. (Liu et al., 2017) and Guo et al(Guo et al., 2018) by 10.6%, 5.1%, 2.4% and 0.5%, respectively. In addition, we employ the model pretrained on the LaPa dataset as the model initialization, and finetune it on the Helen dataset. We can see that the accuracy of each category is further improved, and the overall score is achieved by 91.4%, which demonstrate the superiority of our dataset.

6. Conclusion

In this paper, we develop a high-efficiency framework for face parsing annotation, which significantly simplifies the pixel-level semantic annotation for face parsing with high accuracy. Benefit from this novel framework, we construct a new benchmark for face parsing. It consists of 22,000 face images and each image is provided with a 11-category semantic label map along with coordinates of 106-point landmarks. To the best of our knowledge, this is the largest public dataset for face parsing so far. Furthermore, we propose a simple yet effective boundary-sensitive parsing network, which boosts the segmentation performance by integrating boundary-aware features implicitly and weighting boundary-pixel loss explicitly. Experiments on Helen and the proposed LaPa datasets demonstrate the effectiveness of our network.

References

  • (1)
  • Badrinarayanan et al. (2017) Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39, 12 (2017), 2481–2495.
  • Baltrusaitis et al. (2013) Tadas Baltrusaitis, Peter Robinson, and Louis-Philippe Morency. 2013. Constrained local neural fields for robust facial landmark detection in the wild. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 354–361.
  • Blanz and Vetter (2003) Volker Blanz and Thomas Vetter. 2003. Face recognition based on fitting a 3d morphable model. IEEE Transactions on pattern analysis and machine intelligence 25, 9 (2003), 1063–1074.
  • Bulat and Tzimiropoulos (2017) Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision. 1021–1030.
  • Chen et al. (2018) Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2018. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40, 4 (2018), 834–848.
  • Chen et al. (2017) Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).
  • Cootes et al. (2001) Timothy F Cootes, Gareth J Edwards, and Christopher J Taylor. 2001. Active appearance models. IEEE Transactions on Pattern Analysis & Machine Intelligence 6 (2001), 681–685.
  • Cootes et al. (1995) Timothy F Cootes, Christopher J Taylor, David H Cooper, and Jim Graham. 1995. Active shape models-their training and application. Computer vision and image understanding 61, 1 (1995), 38–59.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    . Ieee, 248–255.
  • Dong et al. (2018) Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. 2018. Style aggregated network for facial landmark detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 379–388.
  • Gross et al. (2010) Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. 2010. Multi-pie. Image and Vision Computing 28, 5 (2010), 807–813.
  • Guo et al. (2018) Tianchu Guo, Youngsung Kim, Hui Zhang, Deheng Qian, ByungIn Yoo, Jingtao Xu, Dongqing Zou, Jae-Joon Han, and Changkyu Choi. 2018. Residual Encoder Decoder Network and Adaptive Prior for Face Parsing. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Hasan et al. (2013) Md Hasan, Christopher Pal, and Sharon Moalem. 2013. Localizing facial keypoints with global descriptor search, neighbour alignment and locally linear models. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 362–369.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Jackson et al. (2016) Aaron S Jackson, Michel Valstar, and Georgios Tzimiropoulos. 2016. A CNN cascade for landmark guided semantic part segmentation. In European Conference on Computer Vision. Springer, 143–155.
  • Jaiswal et al. (2013) Shashank Jaiswal, Timur Almaev, and Michel Valstar. 2013.

    Guided unsupervised learning of mode specific models for facial point detection in the wild. In

    Proceedings of the IEEE International Conference on Computer Vision Workshops. 370–377.
  • Kae et al. (2013) Andrew Kae, Kihyuk Sohn, Honglak Lee, and Erik Learned-Miller. 2013. Augmenting CRFs with Boltzmann machine shape priors for image labeling. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2019–2026.
  • Kemelmacher-Shlizerman et al. (2016) Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard. 2016. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4873–4882.
  • Koestinger et al. (2011) Martin Koestinger, Paul Wohlhart, Peter M Roth, and Horst Bischof. 2011. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE international conference on computer vision workshops (ICCV workshops). IEEE, 2144–2151.
  • Lai et al. (2018) Hanjiang Lai, Shengtao Xiao, Yan Pan, Zhen Cui, Jiashi Feng, Chunyan Xu, Jian Yin, and Shuicheng Yan. 2018. Deep recurrent regression for facial landmark detection. IEEE Transactions on Circuits and Systems for Video Technology 28, 5 (2018), 1144–1157.
  • Le et al. (2012) Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, and Thomas S Huang. 2012. Interactive facial feature localization. In European conference on computer vision. Springer, 679–692.
  • Levin et al. (2008) Anat Levin, Alex Rav-Acha, and Dani Lischinski. 2008. Spectral matting. IEEE transactions on pattern analysis and machine intelligence 30, 10 (2008), 1699–1712.
  • Lin et al. (2017) Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. 2017. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1925–1934.
  • Liu et al. (2017) Sifei Liu, Jianping Shi, Ji Liang, and Ming-Hsuan Yang. 2017. Face parsing via recurrent propagation. arXiv preprint arXiv:1708.01936 (2017).
  • Liu et al. (2015) Sifei Liu, Jimei Yang, Chang Huang, and Ming-Hsuan Yang. 2015. Multi-objective convolutional learning for face labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3451–3459.
  • Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431–3440.
  • Luo et al. (2012) Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2012. Hierarchical face parsing via deep learning. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2480–2487.
  • Merget et al. (2018) Daniel Merget, Matthias Rock, and Gerhard Rigoll. 2018. Robust facial landmark detection via a fully-convolutional local-global context network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 781–790.
  • Messer et al. (1999) Kieron Messer, Jiri Matas, Josef Kittler, Juergen Luettin, and Gilbert Maitre. 1999. XM2VTSDB: The extended M2VTS database. In Second international conference on audio and video-based biometric person authentication, Vol. 964. 965–966.
  • Newell et al. (2016) Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016.

    Stacked hourglass networks for human pose estimation. In

    European Conference on Computer Vision. Springer, 483–499.
  • Ou et al. (2016) Xinyu Ou, Si Liu, Xiaochun Cao, and Hefei Ling. 2016. Beauty emakeup: A deep makeup transfer system. In Proceedings of the 24th ACM international conference on Multimedia. ACM, 701–702.
  • Ramanan and Zhu (2012) Deva Ramanan and Xiangxin Zhu. 2012. Face detection, pose estimation, and landmark localization in the wild. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2879–2886.
  • Sagonas et al. (2013) Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 2013. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 397–403.
  • Smith et al. (2013) Brandon M Smith, Li Zhang, Jonathan Brandt, Zhe Lin, and Jianchao Yang. 2013. Exemplar-based face parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3484–3491.
  • Sun et al. (2013) Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2013. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3476–3483.
  • Tao Ruan (2018) Zilong Huang Yunchao Wei Shikui Wei Yao Zhao Thomas Huang Tao Ruan, Ting Liu. 2018. Devil in the Details: Towards Accurate Single and Multiple Human Parsing. arXiv:1809.05996 (2018).
  • Warrell and Prince (2009) Jonathan Warrell and Simon JD Prince. 2009. Labelfaces: Parsing facial features by multiclass labeling with an epitome prior. In 2009 16th IEEE international conference on image processing (ICIP). IEEE, 2481–2484.
  • Wei et al. (2017) Zhen Wei, Yao Sun, Jinqiao Wang, Hanjiang Lai, and Si Liu. 2017. Learning adaptive receptive fields for deep image parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2434–2442.
  • Wu et al. (2018) Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. 2018. Look at boundary: A boundary-aware face alignment algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2129–2138.
  • Xiong and De la Torre (2013) Xuehan Xiong and Fernando De la Torre. 2013. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition. 532–539.
  • Yamashita et al. (2015) Takayoshi Yamashita, Takaya Nakamura, Hiroshi Fukui, Yuji Yamauchi, and Hironobu Fujiyoshi. 2015. Cost-alleviative learning for deep convolutional neural network-based facial part labeling. IPSJ Transactions on Computer Vision and Applications 7 (2015), 99–103.
  • Yang et al. (2017) Jing Yang, Qingshan Liu, and Kaihua Zhang. 2017. Stacked hourglass network for robust facial landmark localisation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 79–87.
  • Zhang et al. (2018) He Zhang, Benjamin S Riggan, Shuowen Hu, Nathaniel J Short, and Vishal M Patel. 2018. Synthesis of High-Quality Visible Faces from Polarimetric Thermal Faces using Generative Adversarial Networks. International Journal of Computer Vision (2018), 1–18.
  • Zhang et al. (2014) Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2014. Facial landmark detection by deep multi-task learning. In European conference on computer vision. Springer, 94–108.
  • Zhao et al. (2017) Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2881–2890.
  • Zhou et al. (2013) Erjin Zhou, Haoqiang Fan, Zhimin Cao, Yuning Jiang, and Qi Yin. 2013. Extensive facial landmark localization with coarse-to-fine convolutional network cascade. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 386–391.
  • Zhu et al. (2016) Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li. 2016. Face alignment across large poses: A 3d solution. In Proceedings of the IEEE conference on computer vision and pattern recognition. 146–155.