End-to-end semantic face segmentation with conditional random fields as convolutional, recurrent and adversarial networks

03/09/2017 ∙ by Umut Güçlü, et al. ∙ 0

Recent years have seen a sharp increase in the number of related yet distinct advances in semantic segmentation. Here, we tackle this problem by leveraging the respective strengths of these advances. That is, we formulate a conditional random field over a four-connected graph as end-to-end trainable convolutional and recurrent networks, and estimate them via an adversarial process. Importantly, our model learns not only unary potentials but also pairwise potentials, while aggregating multi-scale contexts and controlling higher-order inconsistencies. We evaluate our model on two standard benchmark datasets for semantic face segmentation, achieving state-of-the-art results on both of them.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation is a very important topic in computer vision primarily because of its many applications in object recognition, image annotation, image coding, scene understanding and biomedical image processing. One specific field of semantic segmentation is face segmentation in which the task is to correctly assign labels of face regions such as nose, mouth, eye, hair, etc. to each pixel in a face image. Face segmentation techniques are frequently used in security systems and in the field of human computer interaction, mainly in order to facilitate the problems of face detection and recognition 

[1, 2, 3, 4], and emotion/expression recognition [5, 6, 7]. Further specialized entertainment oriented applications of face segmentation include style transfer [8], virtual make-up application [9], virtual face-swapping [10] and 3D performance capturing [11]. Additionally, utilization of face segmentation techniques could potentially improve the performance in several other computer vision tasks involving the processing of face images such as apparent personality prediction [12] and face hallucination [13].

Semantic segmentation of faces is a difficult problem because of the large number of variable conditions that need to be considered, especially when applied to face pictures taken in uncontrolled environments. These conditions include variations in facial expression, skin color, lighting, image quality, pose, hair texture and style, as well as the presence of varying amounts of background clutter and occlusions. Furthermore, despite extensive studies in face segmentation, correctly classifying hair pixels still remains a particularly challenging task 

[14], largely due to the inherent properties of hair such as color similarity to background, non-rigidity and non-unique shape.

Recently, there has been a sizable number of advances in semantic segmentation. In the context of semantic image segmentation, [15] showed that formulating the iterative update equation of a CRF over a fully-connected graph [16]

as a recurrent neural network (RNN) resulted in state-of-the-art accuracy on Pascal VOC 2012 dataset 

[17]. While this model did not learn the pairwise potential of the CRF and relied on fixed Gaussian kernels, it was end-to-end trainable. In the context of semantic face segmentation, [18]

showed that formulating the unary potential and the pairwise potential of a conditional random field (CRF) over a four-connected graph as a convolutional neural network (CNN) resulted in state-of-the-art accuracy on the Part Labels dataset 

[19, 20] and the Helen dataset [21, 22]. While this model was not end-to-end trainable and relied on graph cuts, it learned both the the unary potential and the pairwise potential of the CRF.

Furthermore, [23] showed that the results of convolutional semantic segmentation models can be improved by using dilated kernels instead of regular kernels, which increase receptive field size without decreasing receptive field resolution. Similarly, [24]

showed that results of convolutional semantic segmentation models can be improved by using an adversarial loss function in addition to a segmentation loss function, which enforces higher-order consistencies without explicitly taking into account any higher-order potentials.

Here, our goal is to formulate a model for semantic face segmentation by combining the respective strengths of the aforementioned models. That is, the model should be end-to-end trainable like [15], and learn both the unary potential and the pairwise potential of a CRF over a four-connected graph like [18] while aggregating multi-scale contexts like [23] and controlling higher-order inconsistencies like [24]. Table 1 shows an overview of the differences and the similarities between our model (i.e., CnnRnnGan), its variants (i.e., Cnn, CnnGan and CnnRnn), and the recent models that they are based on.

adversarial
training
conditional random field
dilated
conv.
() () (end-to-end)
Yu and Koltun (2015)  [23] —– —– —–
Liu et al. (2015)  [18]
Zheng et al. (2015)  [15]
Luc et al. (2016)  [24] —– —– —–
Cnn (Ours) —– —– —–
CnnGan (Ours) —– —– —–
CnnRnn (Ours)
CnnRnnGan (Ours)
Table 1: An overview of the differences and the similarities between the variants of our model, and the recent semantic segmentation models that they are based on. and denote learned instead of fixed unary potentials and pairwise potentials, respectively.

The contributions of our work are the following:

  1. We propose an end-to-end trainable convolutional and recurrent network formulation of a conditional random field over a four-connected graph with learnable unary potentials and pairwise potentials in which dilated convolutions and adversarial training are used for aggregating multi-scale contexts and controlling higher-order inconsistencies, respectively.

  2. We exploit the structured nature of faces by conditioning the model on face landmarks, and/or training multiple models for different face landmarks and combining their outputs akin to part-based models.

  3. We evaluate the model on two standard semantic face segmentation datasets (i.e., Part Labels and Helen), achieving state-of-the-art results on both of them while considerably improving the segmentation accuracy of challenging face parts such as hair.

The rest of this paper is organized as follows: In the next section, we overview the recent work on semantic segmentation in general and semantic face segmentation in particular. In Section 3, we present our model. In Section 4, we present the results of the main experiments, in which we evaluate our model on both the Part Labels dataset and the Helen dataset, compare the obtained results versus the state-of-the-art, and present the results of the ablation experiments, in which we evaluate variants of our model on the same datasets. In the last section, we conclude with an overview of our work.

2 Related work

Semantic segmentation has been widely studied in computer vision in a wide spectrum of domains. For a comprehensive review of classical approaches for semantic segmentation, we refer the reader to [25]. In this section, we review recent work on semantic segmentation in general and semantic face segmentation in particular.

The most recent state-of-the-art semantic segmentation models almost exclusively rely on convolutional neural networks. In contrast to earlier approaches where recognition architectures were directly used for semantic segmentation [26], current approaches utilize architectures that are carefully adapted for the task at hand. [27] proposed the first such approach, where the fully-connected layers of popular architectures such as AlexNet [28], VGGNet [29] and GoogLeNet [30] were replaced with (de)convolution layers and combined with earlier layers to enable dense and high resolution predictions. Since then, this approach has been continuously improved by the introduction of more sophisticated architectures, which enabled denser [31, 32, 33], higher resolution predictions [34], and/or encoder-decoder [35] predictions. In particular, [23] proposed dilated convolutions for dense prediction, where contexts could be aggregated by multiscale levels without loss of neither resolution nor coverage. This idea has been extended by [36] to enable a larger field of view through spatial pyramid pooling. Such approaches enjoy the benefits of dense and high resolution predictions without the burden of extra parameters.

At the same time, conditional random fields have been used in semantic segmentation for postprocessing outputs of region-level or pixel-level semantic segmentation models. While the relatively small number of outputs of region-based semantic segmentation models could be postprocessed by CRFs with dense pairwise connectivity [20], the relatively large number of outputs of pixel-level semantic segmentation models could only be postprocessed by CRFs with sparse pairwise connectivity [18]. In a seminal work, [16] proposed an efficient iterative algorithm for approximate inference in fully-connected CRFs with Gaussian edge potentials, which has been widely adopted for postprocessing outputs of pixel-level segmentation models [31, 32, 33]. [15] formulated this algorithm as a recurrent neural network, which is trained along with a pixel-level segmentation model instead of postprocessing it. This formulation is reminiscent of the pixel-level semantic segmentation model in [37], whose outputs were iteratively refined with a recurrent convolutional neural network.

Recently, generative adversarial networks (GANs) [38] have received particular attention in computer vision [39, 40, 41]. The idea behind GANs is training a discriminator and a generator by letting them play a two-player minimax game. In this game, the objective of the discriminator is distinguishing samples that are drawn from the data distribution from samples that are drawn from the model distribution, and the objective of the generator is fooling the discriminator. While GANs have been proposed for estimating generative models via an adversarial process, they have been widely adopted for other tasks such as inpainting [42], style transfer [43]

and super-resolution 

[44] as loss functions. In particular, [24] estimated a semantic segmentation model via an adversarial process by training a discriminator for distinguishing ground-truths from outputs of the semantic segmentation model and the semantic segmentation model for fooling the discriminator. They showed that this process leads to improved results on the Stanford Background dataset [45] and the PASCAL VOC 2012 dataset [17].

There has been relatively fewer semantic face segmentation models that rely on convolutional neural networks. Most earlier models were based on CRFs [20], hand designed features [46] and exemplars [22]. Kae et al. [20]

modeled global part dependencies using a restricted Boltzmann machine to have an overall realistic shape while local shape details were modeled through a CRF, whereas Smith et al. 

[22] used exemplar-based non-rigid warping for face segmentation. Despite the progress in the models, hair segmentation is still the most challenging part due to its color and style variability. Earlier works include attempts of modeling hair, skin and background color [47, 48], mixture of hair styles [49] or MRF/CRF labeling [50]. As a specialized hair segmentation, Wang et al. [51]

applied co-occurrence probabilities of face components identified by a Markov random field. The final segmentations were constrained in a tree-structured model built over part co-occurrences. Among CNN-based face segmentation methods, 

[52]

segmented faces based on a hierarchical part detection process, where the face was detected as the root of the hierarchy and the smallest components of the face were detected at the bottom of such hierarchy. Then, an autoencoder network transformed those detected components into label maps. As a result of using hierarchies, partially occluded faces could be easily handled. Recently, 

[53] applied the part detection idea by training one network for each part and mapping the segmentation result to the original image. [18] generated pairwise terms as class edge potentials through a four-connected graph. Such edge potentials extracted in a multi-objective network along with unary terms were trained using non-structured loss functions, and provided prior knowledge to the network by including inaccurate segmentations as an additional network input. This study showed the benefits of including prior knowledge for improving face segmentation.

3 End-to-end semantic face segmentation

For end-to-end semantic face segmentation, we formulate a conditional random field as a composition of a convolutional neural network and a recurrent neural network (Section 3.1). The convolutional neural network is used for obtaining the unary potential and the pairwise kernels of the conditional random field as a function of an input face and its initial segmentation (Section 3.2). The recurrent neural network is used for obtaining the label compatibility function and a mean field approximation of the Gibbs distribution of the conditional random field as a function of the unary potential and the pairwise kernels of the conditional random field (Section 3.3). In the training phase, a discriminator and the conditional random field play a two-player minimax game, in which the objective of the discriminator is distinguishing ground-truth segmentations from final segmentations, and the objective of the conditional random field is fooling the discriminator (Section 3.4). Fig. 1 illustrates our model. The following sections present the components of our model in detail.

Figure 1: Our semantic face segmentation model. The conditional random field is formulated as a composition of two neural networks: i) A convolutional neural network, which nonlinearly transforms an input face and its initial segmentation to the unary potential and the pairwise kernels of the conditional random field. ii) A recurrent neural network, which transforms the unary potential and the pairwise kernels of the conditional random field to the final segmentation of the input face. In the training phase, a discriminator and the conditional random field play a two-player minimax game, in which the objective of the discriminator is distinguishing ground-truth segmentations from final segmentations, and the objective of the conditional random field is fooling the discriminator.

Prior to entering the model, an input face is preprocessed as follows: A template face is obtained by averaging the faces in the training set of the Part Labels dataset. Sixty-eight landmarks of the template face and the input face are detected by using the dlib implementation [54] of an ensemble of regression trees [55].111Note that the dataset that was used for training the landmark detection model provided by dlib contains some of the images that we use to test our final segmentation model. To avoid circular analysis, we retrained the landmark detection model on the same dataset that it was originally trained on after removing these images. An initial segmentation of the input face is obtained by filling the regions that are formed by connecting the landmarks around background, face skin, left eyebrow, right eyebrow, left eye, right eye, nose, upper lip, inner mouth and lower lip. A similarity transformation from the landmarks of the input face to the landmarks of the template face is estimated. The input face and its initial segmentation are warped to the template face by using the similarity transformation, and resized to 500 pixels 500 pixels. The final segmentation of the input face is obtained by using our model. Optionally, the final segmentation of the input face can be resized back from 500 pixels 500 pixels and warped back from the template face by using the inverse of the similarity transformation. Fig. 2 illustrates our preprocessing pipeline.

Figure 2: Our semantic face segmentation pipeline. 1. Sixty-eight landmarks of the the input face are detected. 2. An initial segmentation of the input face is obtained. 3. The input face and the initial segmentation of the input face are warped to the template face by using a similarity transformation, and resized to 500 pixels 500 pixels. 4. The final segmentation of the face is obtained by using our model. 5. Optionally, the final segmentation of the input face can be resized back from 500 pixels 500 pixels and warped back from the template face by using the inverse of the similarity transformation.

3.1 Conditional random field

We begin the exposition of our model by considering a conditional random field over a four-connected graph. Let and be random fields, where and

are the color vector and the label of the pixel

, respectively. Let be a four-connected graph, where contains all pixels, and contains all pixel pairs that have a taxicab metric of one.

The conditional random field over is defined by the following Gibbs distribution:

(1)

where is the partition function, and is the following Gibbs energy:

(2)

where is the unary potential, which is the cost of assigning the label to the pixel , and is the pairwise potential, which is the cost of assigning the labels and to the pixels and , respectively. Note that we omit conditioning on for notational convenience. The pairwise potential is of the following form:

(3)

where is a label compatibility function, which is not assumed to be symmetric since it was shown that this assumption improves semantic segmentation results [15], and is arbitrary pairwise kernels.

Following [16]

, we approximate the Gibbs distribution with the mean field distribution that minimizes the Kullback–Leibler divergence between the Gibbs distribution and the distributions that are of the following form:

222While the Gibbs energy can be converted to a submodular energy, which makes exact inference (e.g. with combinatorial min cut/max flow algorithms) possible, we resort to approximate inference (i.e. with mean field theory) to be able to formulate it as a recurrent neural network, which makes end-to-end training possible.

(4)

This approximation results in the following iterative update equation:

(5)

3.2 Convolutional neural network

Following [18], we formulate and as a convolutional neural network, whose architecture is inspired by recent architectures proposed in [23, 56, 57].

The network comprises the following layers:

  1. One convolution layer that has 32 kernels of size with no nonlinearities.

  2. Five blocks, where each block comprises the following layers:

    1. Two parallel convolution layers that have 64 kernels of size with no nonlinearities (i.e. bias layer) and 64 dilated kernels of size with gated activation units [56] (i.e. weight layer). The input of the bias layer is the initial segmentation. The output of the bias layer is summed with the activation of the weight layer. The output of the weight layer becomes the input of the next layer.

    2. Two parallel convolution layers that have 64 kernels of size with no nonlinearities (i.e. residual layer) and 64 kernels of size

      with rectified linear units (i.e. skip layer). The output of the residual layer is summed with the input of the block, which becomes the input of the next layer. The output of the skip layer is concatenated with the outputs of the skip layers of the remaining blocks along the channel axis, which becomes the input of the next layer after the last block.

  3. One convolution layer that has 160 kernels of size with rectified linear units.

  4. Two parallel convolution layers that have kernels of size with no nonlinearities (i.e., ) and four kernels of size with exponential units (i.e., ).

Dilated kernels are the same as the regular kernels with the exception that successive kernel elements have holes between each other, whose size is determined by a dilation factor. As a result, they increase receptive field size without decreasing receptive field resolution. Note that regular convolution layers can be considered dilated convolution layers with a dilation factor of one.

The dilation factor of the first block is one, which is doubled after every block. The number of blocks (i.e., five) is chosen to be the largest possible value such that the receptive field dimensions of the last block is less than or equal to the pixel dimensions. That is:

(6)

where is the number of blocks.

3.3 Recurrent neural network

Following [3], we formulate and the iterative update equation as a recurrent neural network. The network comprises (i) a message passing layer, (ii) a compatibility transform layer, and (iii) a local update and normalization layer. Note that only the compatibility transform layer has free parameters.

The layers are implemented as follows: Let be a tensor and be a tensor, which are the outputs of the convolutional neural network. Prior to the first iteration, is initialized with , and the channels of are broadcasted to the shape of , which results in a set of four tensors.

  • In the message passing layer, is shifted up, right, down and left by one pixel, and multiplied (i.e. Hadamard product) with the corresponding elements of k, which results in a set of four tensors. The elements of this set are summed. As a result, this layer outputs a tensor, which becomes the input of the next layer.

  • In the compatibility transform layer, the input tensor is convolved with kernels of size 1 1. As a result, this layer outputs a tensor, which becomes the input of the next layer.

  • In the local update and normalization layer, the input tensor is subtracted from , exponentiated and normalized (i.e., softmax function). As a result, this layer outputs a tensor, which becomes the input of the first layer after the first four iterations and the output of the network after the fifth iteration.

3.4 Adversarial training

While this end-to-end trainable convolutional and recurrent neural network formulation of a conditional random field can learn both the unary potential and the pairwise potential, it does not take into account any higher-order potentials that can enforce higher-order consistencies. To be able to enforce higher-order consistencies without explicitly taking into account any higher-order potentials, we train the model by minimizing an adversarial loss function in addition to a segmentation loss function [24].

To this end, we train a discriminator along with our model, which is from now on referred to as the generator. We denote the output of the discriminator as , which is the probability that the input of the discriminator is a ground-truth segmentation. We denote the output of the generator as , which is the probabilities of assigning each of the labels to each of the pixels of the input.

In this context, the goal of the discriminator is to distinguish ground-truth segmentations from generated segmentations, whereas the goal of the generator is to generate segmentations that are indistinguishable from ground-truth segmentations. That is, they play the following minimax game:

(7)

where is a set of images, and is a set of corresponding ground-truth segmentations.

We formulate the discriminator as a convolutional neural network whose architecture is inspired by the architecture in [39]. The network comprises four convolution layers and a fully-connected layer. The th convolution layer has kernels with a size of

, a stride of

, a pad of

and leaky rectified units [58]

. The activations of the first four convolution layers are normalized along the mini-batch (i.e., batch normalization 

[59]). The output of the last convolution layer is averaged along the spatial axes (i.e., global average pooling [60]). The fully-connected layer has one kernel with a sigmoid unit.

The discriminator is trained by iteratively minimizing the following discriminator loss function:

(8)

Note that is the sum of two sigmoid cross entropy loss functions.

The generator is trained by iteratively minimizing the following linear combination of an adversarial loss function and a segmentation loss function:

(9)

where is the coefficient of the segmentation loss function and the constituent loss functions are of the following forms:

(10)
(11)

Note that is a sigmoid cross entropy loss function, and is a softmax cross entropy loss function.

4 Results

4.1 Implementation details

The models were implemented in Chainer with CUDA and cuDNN [61].

The biases of the models were initialized with zero, the weights of the models were initialized with samples drawn from a scaled Gaussian distribution 

[62], and the coefficient of the segmentation loss function (i.e., ) was set to 100.

Adam [63] with initial = 0.001, = 0.9, = 0.999 and = 1e-8 was used to iteratively train the models on the combination of the training set and the validation set333

The hyperparameters (i.e.,

, and the number of epochs) were optimized prior to combining the training set and the validation set.

for 111 epochs. The learning rate (i.e.,

) was reduced by a factor of 10 after 100 and 110 epochs.

At each iteration, the discriminator and the generator were updated sequentially. To prevent them from overpowering each other, the training of the discriminator was suspended or resumed if the following conditions were satisfied, respectively:

(12)

Similarly, the training of the generator was suspended or resumed if the following conditions were satisfied, respectively:

(13)

These conditions were selected based on [64].

Our source code and pretrained models will be shared post-publication. Further details can be found at https://github.com/umuguc.

4.2 Datasets

We analyzed the Part Labels dataset and the Helen dataset in our experiments. These datasets are the standard benchmark datasets for semantic face segmentation, which comprise pairs of in-the-wild faces and ground-truth segmentations. Parts Label dataset comprises 2927 pairs of in-the-wild faces and ground-truth segmentations of background, face skin (including ear skin and neck skin) and hair (including facial hair), which is split in a 1500 pair training set, a 500 pair validation set and a 927 pair test set. Helen dataset comprises 2330 pairs of in-the-wild faces and ground-truth segmentations of face skin (excluding ear skin and neck skin), left eyebrow, right eyebrow, left eye, right eye, nose, upper lip, inner mouth, lower lip and hair (excluding facial hair), which is split in a 2000 pair training set, a 230 pair validation set and a 100 pair test set.

4.3 Evaluation metrics

Results are reported in terms of confusion matrix and Jaccard index (i.e., intersection over union). Confusion matrix is defined as the square matrix

where is the number of pixels whose true class is and predicted class is . Jaccard index of class is defined as follows:

(14)

Jaccard index of all classes is defined as follows:

(15)

4.4 Main experiments

We conducted two main experiments on the Labeled Parts and the Helen datasets, in which we evaluated the CnnRnnGan model.

4.4.1 Part Labels dataset

We iteratively trained one global CnnRnnGan model for segmenting background, face skin and hair. Before the first iteration, the images in the dataset were resized to 106 pixels 106 pixels. At each iteration, a mini-batch of size 16 was randomly selected without replacement, horizontally and vertically translated by 5 pixels, and mirrored in the left-right direction. Then, the mini-batch was cropped to the central 96 pixels 96 pixels.

In the test phase, the inputs were oversampled (i.e., center and corners) and mirrored (i.e., left-right direction). The outputs were placed to their corresponding locations in the original inputs and averaged. Table 2 shows the resulting confusion matrix and Jaccard index. The most common cause of errors was mislabeling the classes as background. The least common cause of errors was mislabeling the classes as hair. All of the classes were segmented with a relatively high accuracy (). Background was the most accurately segmented class (). Hair was the least accurately segmented class ().

predicted class
97.97 00.73 01.30 —–
true
class
01.83 96.37 01.79 —–
06.35 05.44 88.21 —–
Jaccard index .9656 .9182 .7808 .8882
Table 2: The results of the main experiment on the Part Labels dataset. Confidence matrix is reported in terms of percentage. The rest of the results are reported in terms of Jaccard index (i.e. intersection over union) of the classes and their arithmetic mean, respectively.

4.4.2 Helen dataset

We iteratively trained the following five CnnRnnGan models for segmenting different classes:

  • One global model for segmenting background, face skin and hair.

  • Three local models for segmenting eyebrows, eyes and nose, respectively.

  • One local model for segmenting upper lip, inner mouth and lower lip.

The outputs of the global model and the local models were aggregated by resizing the output of the global model to 500 pixels 500 pixels and placing the non-background outputs of the local models to their corresponding locations in the resized output of the global model.

The global model was trained on the Helen dataset in the exact same way as it was trained on the Part Labels dataset. The local models were trained in a slightly different way than that in which the global models were trained. Before the first iteration, the images in the dataset were cropped to 90 pixels 90 pixels such that their centers coincided with the centers of the corresponding classes of the average face. At each iteration, a mini-batch of size 16 was randomly selected without replacement, rotated by 7.5 degrees, scaled by a factor of 1 0.05, horizontally and vertically translated by 5 pixels, and randomly flipped in the left-right direction. Additionally, the initial segmentations were further randomly rotated by 0.75 degrees, scaled by a factor of 1 0.005, and horizontally and vertically translated by 0.5 pixels. The additional data augmentation was used to further avoid overfitting the training set since the training set had a small overlap with the training set of the landmark detection model. Finally, the mini-batch was cropped to the central 80 pixels 80 pixels.

In the test phase, the inputs were oversampled (i.e., center and corners) and mirrored (i.e., left-right direction). The outputs were placed to their corresponding locations in the original inputs and averaged. Table 3 shows the resulting confusion matrix and Jaccard index. The most common cause of errors was mislabeling the classes as face skin and background. The least common cause of errors was mislabeling the classes as eyes and nose. Importantly, when the non-background outputs of the local models were misclassified, they were almost always misclassified as the output of the global model and almost never as one another, which suggests that the simple post-hoc aggregation of the outputs of the global model and the local models was sufficient. All of the classes were segmented with a relatively high accuracy (). Background and face skin were the most accurately segmented classes ( and ). Hair and upper lip were the least accurately segmented classes ( and ).

predicted class
97.28 00.41 02.30 —–
01.83 95.46 00.43 00.16 00.36 00.13 00.01 00.18 01.44 —–
00.05 19.66 80.22 00.06 —–
true class
13.25 00.02 86.73 —–
07.23 92.77 —–
14.16 00.01 80.90 03.63 01.30 —–
02.35 09.49 82.20 05.96 —–
00.06 09.65 04.46 84.83 —–
16.38 02.70 00.11 80.81 —–
Jaccard index .9452 .8933 .6987 .7974 .8884 .6619 .7467 .7580 .6962 .7873
Table 3: The results of the main experiment on the Helen dataset. Confidence matrix is reported in terms of percentage. The rest of the results are reported in terms of Jaccard index (i.e. intersection over union) of the classes and their arithmetic mean, respectively.

Compared to the accuracy of hair segmentations on the Part Labels dataset, accuracy of hair segmentations on the Helen dataset was considerably lower ( versus ). This discrepancy can be attributed to the way in which hair was annotated in the datasets. In the Part Labels dataset, hair was annotated by automatically segmenting images to superpixels and manually labeling the superpixels. In the Helen dataset, hair was automatically annotated by alpha matting.

In the Helen dataset, we observed relatively lower accuracy for hair, eyebrows and upper lips compared to the rest of the classes. The relative low accuracy of hair and eyebrows can be attributed to the fact that these classes do not have well defined boundaries making it difficult to isolate them from background and/or face skin. Similarly, the relatively low accuracy of upper lip can be attributed to the fact that this class has shared borders with four other classes (i.e., face skin, inner mouth and lower lip) and often misclassified as belonging to one of them. However, the discrepancy between upper lip, and inner mouth or lower lip is surprising since these classes have the similar properties with upper lip, but might be explained by class imbalance.

4.5 Comparison of results versus state-of-the-art

After the main experiments, we compared the results of the CnnRnnGan model on the Part Labels dataset and the Helen dataset versus the earlier results reported in the literature.

4.5.1 Part Labels dataset

First, we compared our results on the Part Labels dataset versus the following:

  • RBM and CRF based image labeling method of Kae et al. (2013) [20].

  • CNN, RBM and CRF based semantic part segmentation method of Tsogkas et al. (2015) [65].

  • CNN and CRF based face labeling method of Liu et al. (2015) [18].

  • Convolutional VAE based semantic segmentation method of Zheng et al. (2015) [35].

  • Convolutional neural fabric based semantic segmentation method of Saxena et al. (2016) [66].

To the best of our knowledge, the CnnRnnGan model achieved state-of-the-art results on the Part Labels dataset (Table 4). The best overall results in the literature [18, 65] were improved by 1.55 and 0.19 percentage points (pp) from 95.12 to 96.67 and from 96.97 to 97.16 for pixels and superpixels, respectively.444Note that the CnnRnnGan model was trained on pixels only. The superpixel results were obtained by averaging the corresponding outputs of the CnnRnnGan model. While these results are supoptimal since the CnnRnnGan model was not trained on superpixels, they are reported for completeness. The improvements in the best existing hair results were more pronounced compared to those in the rest of the best existing results ( pp versus pp).555Note that background, face skin and hair results were reported in [18] only.

Kae et al. (2013) [20] —– —– —– —– 94.95
Tsogkas et al. (2015) [65] —– —– —– —– 96.97
Liu et al. (2015) [18] 97.10 93.93 80.70 95.12 —–
Zheng et al. (2015) [35] —– —– —– —– 96.59
Saxena et al. (2016) [66] —– —– —– 94.82 95.63
Ours 98.25 95.74 87.69 96.67 97.16
Table 4: Comparison of our results versus the previous state-of-the-art on the Part Labels dataset. The overall results are reported in terms of pixel and superpixel accuracy, respectively. The rest of the results are reported in terms of score.

4.5.2 Helen dataset

Second, we compared our results on the Helen dataset versus the following:

  • Exemplar based face parsing method of Smith et. al (2013) [22].

  • CNN and CRF based face labeling method of Liu et al. (2015) [18].

  • CNN based face parsing method of Liu et al. (2015) [53].

To the best of our knowledge, our model achieved state-of-the-art results on the Helen dataset (Table 5). The best overall result in the literature [53] was improved by 3.69 pp, from 87.30 to 90.99. The improvements in the best existing face skin, upper lip and lower lip results were more pronounced compared to those in the rest of the best existing results ( pp versus pp).

Smith et. al (2013) [22] 88.20 72.20 78.50 92.20
Liu et. al (2015) [18] 91.20 73.40 76.80 91.20
Zhou et. al (2015) [53] —– 81.30 87.40 95.00
Ours 94.36 82.26 88.73 94.09
Smith et. al (2013) [22] 65.10 71.30 70.00 85.70 80.40
Liu et. al (2015) [18] 60.10 82.40 68.40 84.90 85.40
Zhou et. al (2015) [53] 75.40 83.60 80.90 92.60 87.30
Ours 79.66 85.50 86.23 92.82 90.99
Table 5: Comparison of our results versus the state-of-the-art on the Helen dataset. All of the results are reported in terms of score. The additional mouth results include the upper lip results, the inner mouth results and the lower lip results. The overall results exclude the background results and the face skin results.

4.6 Ablation experiments

Finally, we conducted two sets of ablation experiments on the Part Labels dataset and the Helen dataset, in which we evaluated the variants of the CnnRnnGan model.

4.6.1 Part Labels dataset

First, we evaluated the effect of removing the different components of the CnnRnnGan model on the Part Labels dataset (Table 6).

The CnnRnnGan model achieved the best results except for background. The results were deteriorated by removing the Gan component and keeping the Rnn component (CnnRnn model). Removing the Rnn component and keeping the Gan component (CnnGan model) further decreased the accuracy. The results were once again deteriorated by removing both the Rnn component and the Gan component (Cnn model). Among all of the classes, the most notable change was observed for hair ().

Cnn .9617 .9111 .7525 .8751
CnnGan .9622 .9114 .7574 .8770
CnnRnn .9663 .9177 .7795 .8878
CnnRnnGan .9656 .9182 .7808 .8882
Table 6: The results of the ablation experiment on the Part Labels dataset. The results are reported in terms of Jaccard index (i.e. intersection over union) of the classes and their arithmetic mean, respectively.

We illustrate qualitative examples of these results in Fig. 3. In the first column, it can be observed that even though the ground truth had a mistake (the hands were incorrectly labeled as face skin), particularly the CnnRnn and CnnRnnGan models correctly segmented most pixels. The example in the second column demonstrates the performance of the models in a difficult facial hair case. In this example, all models performed well in segmenting the mustache, but only CnnRnnGan model correctly identified the beard pixels. The third column showcases an example that all models performed well. The examples in the fourth and fifth columns highlight the gradual improvement provided by each additional model component in the correct classification of hair pixels. Especially in the example in the fifth column, it is possible to observe the improvements in the identification of fine details of hair. The first failure case example in column six demonstrates a difficult case for all models. The pixels to the left of the face skin are indeed hair pixels, however they belong to another person in the photograph. All models failed to make this distinction. The last failure case example shows that all models failed to segment the facial hair pixels and incorrectly labeled them as facial skin pixels. This error could be attributed to the low contrast difference between the face skin and facial hair pixels.

Figure 3: Example segmentations of the variants of the CnnRnnGan model that were evaluated in the ablation experiment on the Part Labels dataset. The last two columns show failure cases in which none of the model variants achieved satisfactory results. gt. denotes ground-truth.

4.6.2 Helen dataset

Second, we evaluated the effect of conditioning the CnnRnnGan model on the initial segmentation and/or training multiple CnnRnnGan models for segmenting different classes on the Helen dataset (Table 7). The initial segmentation (init. model) failed to achieve competitive results. These results were considerably improved by training a single CnnRnnGan model for segmenting all of the classes (1c. model). Conditioning the single CnnRnnGan model on the initial segmentation (init.+1c. model) slightly increased the accuracy. The results were once again considerably improved by training multiple CnnRnnGan models for segmenting different classes and conditioning them on the initial segmentation (init.+5c. model), which made them the best for all of the classes except for background and hair. Among all of the classes, the most notable improvements were observed for eyebrows, eyes, upper lip, inner mouth and lower lip ().

init. .8253 .6358 .4855 .6527 .5325 .5568 .5757 .6001 .5405
1c. .9465 .8770 .6074 .6811 .8562 .5666 .6655 .6667 .7030 .7300
init.+1c. .9408 8805 .6189 .6880 .8618 .5724 .6804 .6738 .6717 .7320
init.+5c. .9452 .8933 .6987 .7974 .8884 .6619 .7467 .7580 .6962 .7873
Table 7: The results of the ablation experiment on the Helen dataset. The results are reported in terms of Jaccard index (i.e. intersection over union) of the classes and their arithmetic mean, respectively.

We illustrate qualitative examples of these results in Fig. 4. In the first five columns of this figure, it is possible to see an increase in performance starting from the simplest initial segmentation model to the complex variants of the CnnRnnGan model. While the initial segmentation does a good job in determining the general locations of each face region, it does not provide a detailed solution. Furthermore, it can be observed that the initial segmentation performs rather poorly in the nose and eyebrow regions, and whenever the expression of the face diverges from a neutral pose in the mouth regions. Among the variants of the CnnRnnGan model, the qualitative differences were minimal. However, the improvement provided by training multiple CnnRnnGan models for segmenting different classes and conditioning them on the initial segmentation (i.e. init.+5c) has resulted in visually distinguishable accuracy differences. This model was able to capture the details better than the remaining two model variants. The last two columns in the figure demonstrate failure cases where all model variants had errors. Models performed poorly in distinguishing hair from background when the background color was similar to the hair color (column 6) and in identifying the mouth regions when the person in the photograph had an extreme facial expression (column 7).

Figure 4: Example segmentations of the variants of the CnnRnnGan model that were evaluated in the ablation experiment on the Helen dataset. The last two columns show failure cases in which none of the model variants achieved satisfactory results. gt. denotes ground-truth.

5 Conclusion

Here, we proposed an end-to-end trainable semantic face segmentation model, which leverages the recent advances in the field. To this end, we formulated a conditional random field over a four-connected graph as convolutional and recurrent networks and estimated them via an adversarial process. Crucially, this formulation made it possible for this model to learn not only unary potentials but also pairwise potentials while aggregating multiscale contextual information and controlling higher-order inconsistencies. We showed that our model can exploit the structured nature of faces by conditioning it on face landmarks, and/or training it for different face landmarks and combining the outputs akin to part-based models. We evaluated our model on the Part Labels dataset and the Helen dataset, achieving state-of-the-art results on both of them while considerably improving the accuracy of challenging face parts such as hair. Future work will evaluate our model on other semantic segmentation datasets to asses its generalizability beyond faces.

Acknowledgements

This work has been partially supported by VIDI grant number 639.072.513 of the Netherlands Organization for Scientific Research (NWO), the Spanish projects TIN2015-66951-C2-2-R, TIN2015-65464-R and TIN2016-74946-P (MINECO/FEDER, UE), by the European Comission Horizon 2020 granted project SEE.4C under call H2020-ICT-2015, and by the CERCA Programme/Generalitat de Catalunya.

References