Semantic segmentation is a very important topic in computer vision primarily because of its many applications in object recognition, image annotation, image coding, scene understanding and biomedical image processing. One specific field of semantic segmentation is face segmentation in which the task is to correctly assign labels of face regions such as nose, mouth, eye, hair, etc. to each pixel in a face image. Face segmentation techniques are frequently used in security systems and in the field of human computer interaction, mainly in order to facilitate the problems of face detection and recognition[1, 2, 3, 4], and emotion/expression recognition [5, 6, 7]. Further specialized entertainment oriented applications of face segmentation include style transfer , virtual make-up application , virtual face-swapping  and 3D performance capturing . Additionally, utilization of face segmentation techniques could potentially improve the performance in several other computer vision tasks involving the processing of face images such as apparent personality prediction  and face hallucination .
Semantic segmentation of faces is a difficult problem because of the large number of variable conditions that need to be considered, especially when applied to face pictures taken in uncontrolled environments. These conditions include variations in facial expression, skin color, lighting, image quality, pose, hair texture and style, as well as the presence of varying amounts of background clutter and occlusions. Furthermore, despite extensive studies in face segmentation, correctly classifying hair pixels still remains a particularly challenging task, largely due to the inherent properties of hair such as color similarity to background, non-rigidity and non-unique shape.
Recently, there has been a sizable number of advances in semantic segmentation. In the context of semantic image segmentation,  showed that formulating the iterative update equation of a CRF over a fully-connected graph 
as a recurrent neural network (RNN) resulted in state-of-the-art accuracy on Pascal VOC 2012 dataset. While this model did not learn the pairwise potential of the CRF and relied on fixed Gaussian kernels, it was end-to-end trainable. In the context of semantic face segmentation, 
showed that formulating the unary potential and the pairwise potential of a conditional random field (CRF) over a four-connected graph as a convolutional neural network (CNN) resulted in state-of-the-art accuracy on the Part Labels dataset[19, 20] and the Helen dataset [21, 22]. While this model was not end-to-end trainable and relied on graph cuts, it learned both the the unary potential and the pairwise potential of the CRF.
Furthermore,  showed that the results of convolutional semantic segmentation models can be improved by using dilated kernels instead of regular kernels, which increase receptive field size without decreasing receptive field resolution. Similarly, 
showed that results of convolutional semantic segmentation models can be improved by using an adversarial loss function in addition to a segmentation loss function, which enforces higher-order consistencies without explicitly taking into account any higher-order potentials.
Here, our goal is to formulate a model for semantic face segmentation by combining the respective strengths of the aforementioned models. That is, the model should be end-to-end trainable like , and learn both the unary potential and the pairwise potential of a CRF over a four-connected graph like  while aggregating multi-scale contexts like  and controlling higher-order inconsistencies like . Table 1 shows an overview of the differences and the similarities between our model (i.e., CnnRnnGan), its variants (i.e., Cnn, CnnGan and CnnRnn), and the recent models that they are based on.
|conditional random field||
|Yu and Koltun (2015) ||—–||—–||—–||✓|
|Liu et al. (2015) ||✓||✓|
|Zheng et al. (2015) ||✓||✓|
|Luc et al. (2016) ||✓||—–||—–||—–||✓|
The contributions of our work are the following:
We propose an end-to-end trainable convolutional and recurrent network formulation of a conditional random field over a four-connected graph with learnable unary potentials and pairwise potentials in which dilated convolutions and adversarial training are used for aggregating multi-scale contexts and controlling higher-order inconsistencies, respectively.
We exploit the structured nature of faces by conditioning the model on face landmarks, and/or training multiple models for different face landmarks and combining their outputs akin to part-based models.
We evaluate the model on two standard semantic face segmentation datasets (i.e., Part Labels and Helen), achieving state-of-the-art results on both of them while considerably improving the segmentation accuracy of challenging face parts such as hair.
The rest of this paper is organized as follows: In the next section, we overview the recent work on semantic segmentation in general and semantic face segmentation in particular. In Section 3, we present our model. In Section 4, we present the results of the main experiments, in which we evaluate our model on both the Part Labels dataset and the Helen dataset, compare the obtained results versus the state-of-the-art, and present the results of the ablation experiments, in which we evaluate variants of our model on the same datasets. In the last section, we conclude with an overview of our work.
2 Related work
Semantic segmentation has been widely studied in computer vision in a wide spectrum of domains. For a comprehensive review of classical approaches for semantic segmentation, we refer the reader to . In this section, we review recent work on semantic segmentation in general and semantic face segmentation in particular.
The most recent state-of-the-art semantic segmentation models almost exclusively rely on convolutional neural networks. In contrast to earlier approaches where recognition architectures were directly used for semantic segmentation , current approaches utilize architectures that are carefully adapted for the task at hand.  proposed the first such approach, where the fully-connected layers of popular architectures such as AlexNet , VGGNet  and GoogLeNet  were replaced with (de)convolution layers and combined with earlier layers to enable dense and high resolution predictions. Since then, this approach has been continuously improved by the introduction of more sophisticated architectures, which enabled denser [31, 32, 33], higher resolution predictions , and/or encoder-decoder  predictions. In particular,  proposed dilated convolutions for dense prediction, where contexts could be aggregated by multiscale levels without loss of neither resolution nor coverage. This idea has been extended by  to enable a larger field of view through spatial pyramid pooling. Such approaches enjoy the benefits of dense and high resolution predictions without the burden of extra parameters.
At the same time, conditional random fields have been used in semantic segmentation for postprocessing outputs of region-level or pixel-level semantic segmentation models. While the relatively small number of outputs of region-based semantic segmentation models could be postprocessed by CRFs with dense pairwise connectivity , the relatively large number of outputs of pixel-level semantic segmentation models could only be postprocessed by CRFs with sparse pairwise connectivity . In a seminal work,  proposed an efficient iterative algorithm for approximate inference in fully-connected CRFs with Gaussian edge potentials, which has been widely adopted for postprocessing outputs of pixel-level segmentation models [31, 32, 33].  formulated this algorithm as a recurrent neural network, which is trained along with a pixel-level segmentation model instead of postprocessing it. This formulation is reminiscent of the pixel-level semantic segmentation model in , whose outputs were iteratively refined with a recurrent convolutional neural network.
Recently, generative adversarial networks (GANs)  have received particular attention in computer vision [39, 40, 41]. The idea behind GANs is training a discriminator and a generator by letting them play a two-player minimax game. In this game, the objective of the discriminator is distinguishing samples that are drawn from the data distribution from samples that are drawn from the model distribution, and the objective of the generator is fooling the discriminator. While GANs have been proposed for estimating generative models via an adversarial process, they have been widely adopted for other tasks such as inpainting , style transfer 
and super-resolution as loss functions. In particular,  estimated a semantic segmentation model via an adversarial process by training a discriminator for distinguishing ground-truths from outputs of the semantic segmentation model and the semantic segmentation model for fooling the discriminator. They showed that this process leads to improved results on the Stanford Background dataset  and the PASCAL VOC 2012 dataset .
There has been relatively fewer semantic face segmentation models that rely on convolutional neural networks. Most earlier models were based on CRFs , hand designed features  and exemplars . Kae et al. 
modeled global part dependencies using a restricted Boltzmann machine to have an overall realistic shape while local shape details were modeled through a CRF, whereas Smith et al. used exemplar-based non-rigid warping for face segmentation. Despite the progress in the models, hair segmentation is still the most challenging part due to its color and style variability. Earlier works include attempts of modeling hair, skin and background color [47, 48], mixture of hair styles  or MRF/CRF labeling . As a specialized hair segmentation, Wang et al. 
applied co-occurrence probabilities of face components identified by a Markov random field. The final segmentations were constrained in a tree-structured model built over part co-occurrences. Among CNN-based face segmentation methods,
segmented faces based on a hierarchical part detection process, where the face was detected as the root of the hierarchy and the smallest components of the face were detected at the bottom of such hierarchy. Then, an autoencoder network transformed those detected components into label maps. As a result of using hierarchies, partially occluded faces could be easily handled. Recently, applied the part detection idea by training one network for each part and mapping the segmentation result to the original image.  generated pairwise terms as class edge potentials through a four-connected graph. Such edge potentials extracted in a multi-objective network along with unary terms were trained using non-structured loss functions, and provided prior knowledge to the network by including inaccurate segmentations as an additional network input. This study showed the benefits of including prior knowledge for improving face segmentation.
3 End-to-end semantic face segmentation
For end-to-end semantic face segmentation, we formulate a conditional random field as a composition of a convolutional neural network and a recurrent neural network (Section 3.1). The convolutional neural network is used for obtaining the unary potential and the pairwise kernels of the conditional random field as a function of an input face and its initial segmentation (Section 3.2). The recurrent neural network is used for obtaining the label compatibility function and a mean field approximation of the Gibbs distribution of the conditional random field as a function of the unary potential and the pairwise kernels of the conditional random field (Section 3.3). In the training phase, a discriminator and the conditional random field play a two-player minimax game, in which the objective of the discriminator is distinguishing ground-truth segmentations from final segmentations, and the objective of the conditional random field is fooling the discriminator (Section 3.4). Fig. 1 illustrates our model. The following sections present the components of our model in detail.
Prior to entering the model, an input face is preprocessed as follows: A template face is obtained by averaging the faces in the training set of the Part Labels dataset. Sixty-eight landmarks of the template face and the input face are detected by using the dlib implementation  of an ensemble of regression trees .111Note that the dataset that was used for training the landmark detection model provided by dlib contains some of the images that we use to test our final segmentation model. To avoid circular analysis, we retrained the landmark detection model on the same dataset that it was originally trained on after removing these images. An initial segmentation of the input face is obtained by filling the regions that are formed by connecting the landmarks around background, face skin, left eyebrow, right eyebrow, left eye, right eye, nose, upper lip, inner mouth and lower lip. A similarity transformation from the landmarks of the input face to the landmarks of the template face is estimated. The input face and its initial segmentation are warped to the template face by using the similarity transformation, and resized to 500 pixels 500 pixels. The final segmentation of the input face is obtained by using our model. Optionally, the final segmentation of the input face can be resized back from 500 pixels 500 pixels and warped back from the template face by using the inverse of the similarity transformation. Fig. 2 illustrates our preprocessing pipeline.
3.1 Conditional random field
We begin the exposition of our model by considering a conditional random field over a four-connected graph. Let and be random fields, where and
are the color vector and the label of the pixel, respectively. Let be a four-connected graph, where contains all pixels, and contains all pixel pairs that have a taxicab metric of one.
The conditional random field over is defined by the following Gibbs distribution:
where is the partition function, and is the following Gibbs energy:
where is the unary potential, which is the cost of assigning the label to the pixel , and is the pairwise potential, which is the cost of assigning the labels and to the pixels and , respectively. Note that we omit conditioning on for notational convenience. The pairwise potential is of the following form:
where is a label compatibility function, which is not assumed to be symmetric since it was shown that this assumption improves semantic segmentation results , and is arbitrary pairwise kernels.
, we approximate the Gibbs distribution with the mean field distribution that minimizes the Kullback–Leibler divergence between the Gibbs distribution and the distributions that are of the following form:222While the Gibbs energy can be converted to a submodular energy, which makes exact inference (e.g. with combinatorial min cut/max flow algorithms) possible, we resort to approximate inference (i.e. with mean field theory) to be able to formulate it as a recurrent neural network, which makes end-to-end training possible.
This approximation results in the following iterative update equation:
3.2 Convolutional neural network
The network comprises the following layers:
One convolution layer that has 32 kernels of size with no nonlinearities.
Five blocks, where each block comprises the following layers:
Two parallel convolution layers that have 64 kernels of size with no nonlinearities (i.e. bias layer) and 64 dilated kernels of size with gated activation units  (i.e. weight layer). The input of the bias layer is the initial segmentation. The output of the bias layer is summed with the activation of the weight layer. The output of the weight layer becomes the input of the next layer.
Two parallel convolution layers that have 64 kernels of size with no nonlinearities (i.e. residual layer) and 64 kernels of size
with rectified linear units (i.e. skip layer). The output of the residual layer is summed with the input of the block, which becomes the input of the next layer. The output of the skip layer is concatenated with the outputs of the skip layers of the remaining blocks along the channel axis, which becomes the input of the next layer after the last block.
One convolution layer that has 160 kernels of size with rectified linear units.
Two parallel convolution layers that have kernels of size with no nonlinearities (i.e., ) and four kernels of size with exponential units (i.e., ).
Dilated kernels are the same as the regular kernels with the exception that successive kernel elements have holes between each other, whose size is determined by a dilation factor. As a result, they increase receptive field size without decreasing receptive field resolution. Note that regular convolution layers can be considered dilated convolution layers with a dilation factor of one.
The dilation factor of the first block is one, which is doubled after every block. The number of blocks (i.e., five) is chosen to be the largest possible value such that the receptive field dimensions of the last block is less than or equal to the pixel dimensions. That is:
where is the number of blocks.
3.3 Recurrent neural network
Following , we formulate and the iterative update equation as a recurrent neural network. The network comprises (i) a message passing layer, (ii) a compatibility transform layer, and (iii) a local update and normalization layer. Note that only the compatibility transform layer has free parameters.
The layers are implemented as follows: Let be a tensor and be a tensor, which are the outputs of the convolutional neural network. Prior to the first iteration, is initialized with , and the channels of are broadcasted to the shape of , which results in a set of four tensors.
In the message passing layer, is shifted up, right, down and left by one pixel, and multiplied (i.e. Hadamard product) with the corresponding elements of k, which results in a set of four tensors. The elements of this set are summed. As a result, this layer outputs a tensor, which becomes the input of the next layer.
In the compatibility transform layer, the input tensor is convolved with kernels of size 1 1. As a result, this layer outputs a tensor, which becomes the input of the next layer.
In the local update and normalization layer, the input tensor is subtracted from , exponentiated and normalized (i.e., softmax function). As a result, this layer outputs a tensor, which becomes the input of the first layer after the first four iterations and the output of the network after the fifth iteration.
3.4 Adversarial training
While this end-to-end trainable convolutional and recurrent neural network formulation of a conditional random field can learn both the unary potential and the pairwise potential, it does not take into account any higher-order potentials that can enforce higher-order consistencies. To be able to enforce higher-order consistencies without explicitly taking into account any higher-order potentials, we train the model by minimizing an adversarial loss function in addition to a segmentation loss function .
To this end, we train a discriminator along with our model, which is from now on referred to as the generator. We denote the output of the discriminator as , which is the probability that the input of the discriminator is a ground-truth segmentation. We denote the output of the generator as , which is the probabilities of assigning each of the labels to each of the pixels of the input.
In this context, the goal of the discriminator is to distinguish ground-truth segmentations from generated segmentations, whereas the goal of the generator is to generate segmentations that are indistinguishable from ground-truth segmentations. That is, they play the following minimax game:
where is a set of images, and is a set of corresponding ground-truth segmentations.
We formulate the discriminator as a convolutional neural network whose architecture is inspired by the architecture in . The network comprises four convolution layers and a fully-connected layer. The th convolution layer has kernels with a size of
, a stride of
, a pad ofand leaky rectified units 
. The activations of the first four convolution layers are normalized along the mini-batch (i.e., batch normalization). The output of the last convolution layer is averaged along the spatial axes (i.e., global average pooling ). The fully-connected layer has one kernel with a sigmoid unit.
The discriminator is trained by iteratively minimizing the following discriminator loss function:
Note that is the sum of two sigmoid cross entropy loss functions.
The generator is trained by iteratively minimizing the following linear combination of an adversarial loss function and a segmentation loss function:
where is the coefficient of the segmentation loss function and the constituent loss functions are of the following forms:
Note that is a sigmoid cross entropy loss function, and is a softmax cross entropy loss function.
4.1 Implementation details
The models were implemented in Chainer with CUDA and cuDNN .
The biases of the models were initialized with zero, the weights of the models were initialized with samples drawn from a scaled Gaussian distribution, and the coefficient of the segmentation loss function (i.e., ) was set to 100.
Adam  with initial = 0.001, = 0.9, = 0.999 and = 1e-8 was used to iteratively train the models on the combination of the training set and the validation set333 The hyperparameters (i.e.,
The hyperparameters (i.e.,, and the number of epochs) were optimized prior to combining the training set and the validation set.
for 111 epochs. The learning rate (i.e.,) was reduced by a factor of 10 after 100 and 110 epochs.
At each iteration, the discriminator and the generator were updated sequentially. To prevent them from overpowering each other, the training of the discriminator was suspended or resumed if the following conditions were satisfied, respectively:
Similarly, the training of the generator was suspended or resumed if the following conditions were satisfied, respectively:
These conditions were selected based on .
Our source code and pretrained models will be shared post-publication. Further details can be found at https://github.com/umuguc.
We analyzed the Part Labels dataset and the Helen dataset in our experiments. These datasets are the standard benchmark datasets for semantic face segmentation, which comprise pairs of in-the-wild faces and ground-truth segmentations. Parts Label dataset comprises 2927 pairs of in-the-wild faces and ground-truth segmentations of background, face skin (including ear skin and neck skin) and hair (including facial hair), which is split in a 1500 pair training set, a 500 pair validation set and a 927 pair test set. Helen dataset comprises 2330 pairs of in-the-wild faces and ground-truth segmentations of face skin (excluding ear skin and neck skin), left eyebrow, right eyebrow, left eye, right eye, nose, upper lip, inner mouth, lower lip and hair (excluding facial hair), which is split in a 2000 pair training set, a 230 pair validation set and a 100 pair test set.
4.3 Evaluation metrics
Jaccard index of all classes is defined as follows:
4.4 Main experiments
We conducted two main experiments on the Labeled Parts and the Helen datasets, in which we evaluated the CnnRnnGan model.
4.4.1 Part Labels dataset
We iteratively trained one global CnnRnnGan model for segmenting background, face skin and hair. Before the first iteration, the images in the dataset were resized to 106 pixels 106 pixels. At each iteration, a mini-batch of size 16 was randomly selected without replacement, horizontally and vertically translated by 5 pixels, and mirrored in the left-right direction. Then, the mini-batch was cropped to the central 96 pixels 96 pixels.
In the test phase, the inputs were oversampled (i.e., center and corners) and mirrored (i.e., left-right direction). The outputs were placed to their corresponding locations in the original inputs and averaged. Table 2 shows the resulting confusion matrix and Jaccard index. The most common cause of errors was mislabeling the classes as background. The least common cause of errors was mislabeling the classes as hair. All of the classes were segmented with a relatively high accuracy (). Background was the most accurately segmented class (). Hair was the least accurately segmented class ().
4.4.2 Helen dataset
We iteratively trained the following five CnnRnnGan models for segmenting different classes:
One global model for segmenting background, face skin and hair.
Three local models for segmenting eyebrows, eyes and nose, respectively.
One local model for segmenting upper lip, inner mouth and lower lip.
The outputs of the global model and the local models were aggregated by resizing the output of the global model to 500 pixels 500 pixels and placing the non-background outputs of the local models to their corresponding locations in the resized output of the global model.
The global model was trained on the Helen dataset in the exact same way as it was trained on the Part Labels dataset. The local models were trained in a slightly different way than that in which the global models were trained. Before the first iteration, the images in the dataset were cropped to 90 pixels 90 pixels such that their centers coincided with the centers of the corresponding classes of the average face. At each iteration, a mini-batch of size 16 was randomly selected without replacement, rotated by 7.5 degrees, scaled by a factor of 1 0.05, horizontally and vertically translated by 5 pixels, and randomly flipped in the left-right direction. Additionally, the initial segmentations were further randomly rotated by 0.75 degrees, scaled by a factor of 1 0.005, and horizontally and vertically translated by 0.5 pixels. The additional data augmentation was used to further avoid overfitting the training set since the training set had a small overlap with the training set of the landmark detection model. Finally, the mini-batch was cropped to the central 80 pixels 80 pixels.
In the test phase, the inputs were oversampled (i.e., center and corners) and mirrored (i.e., left-right direction). The outputs were placed to their corresponding locations in the original inputs and averaged. Table 3 shows the resulting confusion matrix and Jaccard index. The most common cause of errors was mislabeling the classes as face skin and background. The least common cause of errors was mislabeling the classes as eyes and nose. Importantly, when the non-background outputs of the local models were misclassified, they were almost always misclassified as the output of the global model and almost never as one another, which suggests that the simple post-hoc aggregation of the outputs of the global model and the local models was sufficient. All of the classes were segmented with a relatively high accuracy (). Background and face skin were the most accurately segmented classes ( and ). Hair and upper lip were the least accurately segmented classes ( and ).
Compared to the accuracy of hair segmentations on the Part Labels dataset, accuracy of hair segmentations on the Helen dataset was considerably lower ( versus ). This discrepancy can be attributed to the way in which hair was annotated in the datasets. In the Part Labels dataset, hair was annotated by automatically segmenting images to superpixels and manually labeling the superpixels. In the Helen dataset, hair was automatically annotated by alpha matting.
In the Helen dataset, we observed relatively lower accuracy for hair, eyebrows and upper lips compared to the rest of the classes. The relative low accuracy of hair and eyebrows can be attributed to the fact that these classes do not have well defined boundaries making it difficult to isolate them from background and/or face skin. Similarly, the relatively low accuracy of upper lip can be attributed to the fact that this class has shared borders with four other classes (i.e., face skin, inner mouth and lower lip) and often misclassified as belonging to one of them. However, the discrepancy between upper lip, and inner mouth or lower lip is surprising since these classes have the similar properties with upper lip, but might be explained by class imbalance.
4.5 Comparison of results versus state-of-the-art
After the main experiments, we compared the results of the CnnRnnGan model on the Part Labels dataset and the Helen dataset versus the earlier results reported in the literature.
4.5.1 Part Labels dataset
First, we compared our results on the Part Labels dataset versus the following:
RBM and CRF based image labeling method of Kae et al. (2013) .
CNN, RBM and CRF based semantic part segmentation method of Tsogkas et al. (2015) .
CNN and CRF based face labeling method of Liu et al. (2015) .
Convolutional VAE based semantic segmentation method of Zheng et al. (2015) .
Convolutional neural fabric based semantic segmentation method of Saxena et al. (2016) .
To the best of our knowledge, the CnnRnnGan model achieved state-of-the-art results on the Part Labels dataset (Table 4). The best overall results in the literature [18, 65] were improved by 1.55 and 0.19 percentage points (pp) from 95.12 to 96.67 and from 96.97 to 97.16 for pixels and superpixels, respectively.444Note that the CnnRnnGan model was trained on pixels only. The superpixel results were obtained by averaging the corresponding outputs of the CnnRnnGan model. While these results are supoptimal since the CnnRnnGan model was not trained on superpixels, they are reported for completeness. The improvements in the best existing hair results were more pronounced compared to those in the rest of the best existing results ( pp versus pp).555Note that background, face skin and hair results were reported in  only.
|Kae et al. (2013) ||—–||—–||—–||—–||94.95|
|Tsogkas et al. (2015) ||—–||—–||—–||—–||96.97|
|Liu et al. (2015) ||97.10||93.93||80.70||95.12||—–|
|Zheng et al. (2015) ||—–||—–||—–||—–||96.59|
|Saxena et al. (2016) ||—–||—–||—–||94.82||95.63|
4.5.2 Helen dataset
Second, we compared our results on the Helen dataset versus the following:
To the best of our knowledge, our model achieved state-of-the-art results on the Helen dataset (Table 5). The best overall result in the literature  was improved by 3.69 pp, from 87.30 to 90.99. The improvements in the best existing face skin, upper lip and lower lip results were more pronounced compared to those in the rest of the best existing results ( pp versus pp).
|Smith et. al (2013) ||88.20||72.20||78.50||92.20||…|
|Liu et. al (2015) ||91.20||73.40||76.80||91.20||…|
|Zhou et. al (2015) ||—–||81.30||87.40||95.00||…|
|Smith et. al (2013) ||65.10||71.30||70.00||85.70||80.40|
|Liu et. al (2015) ||60.10||82.40||68.40||84.90||85.40|
|Zhou et. al (2015) ||75.40||83.60||80.90||92.60||87.30|
4.6 Ablation experiments
Finally, we conducted two sets of ablation experiments on the Part Labels dataset and the Helen dataset, in which we evaluated the variants of the CnnRnnGan model.
4.6.1 Part Labels dataset
First, we evaluated the effect of removing the different components of the CnnRnnGan model on the Part Labels dataset (Table 6).
The CnnRnnGan model achieved the best results except for background. The results were deteriorated by removing the Gan component and keeping the Rnn component (CnnRnn model). Removing the Rnn component and keeping the Gan component (CnnGan model) further decreased the accuracy. The results were once again deteriorated by removing both the Rnn component and the Gan component (Cnn model). Among all of the classes, the most notable change was observed for hair ().
We illustrate qualitative examples of these results in Fig. 3. In the first column, it can be observed that even though the ground truth had a mistake (the hands were incorrectly labeled as face skin), particularly the CnnRnn and CnnRnnGan models correctly segmented most pixels. The example in the second column demonstrates the performance of the models in a difficult facial hair case. In this example, all models performed well in segmenting the mustache, but only CnnRnnGan model correctly identified the beard pixels. The third column showcases an example that all models performed well. The examples in the fourth and fifth columns highlight the gradual improvement provided by each additional model component in the correct classification of hair pixels. Especially in the example in the fifth column, it is possible to observe the improvements in the identification of fine details of hair. The first failure case example in column six demonstrates a difficult case for all models. The pixels to the left of the face skin are indeed hair pixels, however they belong to another person in the photograph. All models failed to make this distinction. The last failure case example shows that all models failed to segment the facial hair pixels and incorrectly labeled them as facial skin pixels. This error could be attributed to the low contrast difference between the face skin and facial hair pixels.
4.6.2 Helen dataset
Second, we evaluated the effect of conditioning the CnnRnnGan model on the initial segmentation and/or training multiple CnnRnnGan models for segmenting different classes on the Helen dataset (Table 7). The initial segmentation (init. model) failed to achieve competitive results. These results were considerably improved by training a single CnnRnnGan model for segmenting all of the classes (1c. model). Conditioning the single CnnRnnGan model on the initial segmentation (init.+1c. model) slightly increased the accuracy. The results were once again considerably improved by training multiple CnnRnnGan models for segmenting different classes and conditioning them on the initial segmentation (init.+5c. model), which made them the best for all of the classes except for background and hair. Among all of the classes, the most notable improvements were observed for eyebrows, eyes, upper lip, inner mouth and lower lip ().
We illustrate qualitative examples of these results in Fig. 4. In the first five columns of this figure, it is possible to see an increase in performance starting from the simplest initial segmentation model to the complex variants of the CnnRnnGan model. While the initial segmentation does a good job in determining the general locations of each face region, it does not provide a detailed solution. Furthermore, it can be observed that the initial segmentation performs rather poorly in the nose and eyebrow regions, and whenever the expression of the face diverges from a neutral pose in the mouth regions. Among the variants of the CnnRnnGan model, the qualitative differences were minimal. However, the improvement provided by training multiple CnnRnnGan models for segmenting different classes and conditioning them on the initial segmentation (i.e. init.+5c) has resulted in visually distinguishable accuracy differences. This model was able to capture the details better than the remaining two model variants. The last two columns in the figure demonstrate failure cases where all model variants had errors. Models performed poorly in distinguishing hair from background when the background color was similar to the hair color (column 6) and in identifying the mouth regions when the person in the photograph had an extreme facial expression (column 7).
Here, we proposed an end-to-end trainable semantic face segmentation model, which leverages the recent advances in the field. To this end, we formulated a conditional random field over a four-connected graph as convolutional and recurrent networks and estimated them via an adversarial process. Crucially, this formulation made it possible for this model to learn not only unary potentials but also pairwise potentials while aggregating multiscale contextual information and controlling higher-order inconsistencies. We showed that our model can exploit the structured nature of faces by conditioning it on face landmarks, and/or training it for different face landmarks and combining the outputs akin to part-based models. We evaluated our model on the Part Labels dataset and the Helen dataset, achieving state-of-the-art results on both of them while considerably improving the accuracy of challenging face parts such as hair. Future work will evaluate our model on other semantic segmentation datasets to asses its generalizability beyond faces.
This work has been partially supported by VIDI grant number 639.072.513 of the Netherlands Organization for Scientific Research (NWO), the Spanish projects TIN2015-66951-C2-2-R, TIN2015-65464-R and TIN2016-74946-P (MINECO/FEDER, UE), by the European Comission Horizon 2020 granted project SEE.4C under call H2020-ICT-2015, and by the CERCA Programme/Generalitat de Catalunya.
-  F. Pujol, M. Pujol, A. Jimeno-Morenilla, and M. Pujol, “Face detection based on skin color segmentation using fuzzy entropy,” Entropy, vol. 19, p. 26, jan 2017.
-  T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, pp. 2037–2041, dec 2006.
-  A. Mian, M. Bennamoun, and R. Owens, “An efficient multimodal 2d-3d hybrid approach to automatic face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, pp. 1927–1943, nov 2007.
K. Luu, C. Zhu, C. Bhagavatula, T. H. N. Le, and M. Savvides, “A deep learning approach to joint face detection and segmentation,” inAdvances in Face Detection and Facial Image Analysis, pp. 1–12, Springer Nature, 2016.
-  I. Cohen, A. Garg, and T. S. Huang, “Emotion recognition from facial expressions using multilevel hmm,” in in In Neural Information Processing Systems, 2000.
-  I.-O. Stathopoulou, E. Alepis, G. Tsihrintzis, and M. Virvou, “On assisting a visual-facial affect recognition system with keyboard-stroke pattern information,” Knowledge-Based Systems, vol. 23, pp. 350–356, may 2010.
-  S. L. Happy and A. Routray, “Automatic facial expression recognition using features of salient facial patches,” IEEE Transactions on Affective Computing, vol. 6, pp. 1–12, jan 2015.
-  M. Elad and P. Milanfar, “Style-transfer via texture-synthesis,” CoRR, vol. abs/1609.03057, 2016.
-  S. Liu, X. Ou, R. Qian, W. Wang, and X. Cao, “Makeup like a superstar: Deep localized makeup transfer network,” CoRR, vol. abs/1604.07102, 2016.
-  I. Korshunova, W. Shi, J. Dambre, and L. Theis, “Fast face-swap using convolutional neural networks,” CoRR, vol. abs/1611.09577, 2016.
-  S. Saito, T. Li, and H. Li, “Real-time facial segmentation and performance capture from RGB input,” CoRR, vol. abs/1604.02647, 2016.
-  Y. Güçlütürk, U. Güçlü, M. A. J. van Gerven, and R. van Lier, “Deep impression: Audiovisual deep residual networks for multimodal apparent personality trait recognition,” CoRR, vol. abs/1609.05119, 2016.
-  Y. Güçlütürk, U. Güçlü, R. van Lier, and M. A. J. van Gerven, “Convolutional sketch inversion,” CoRR, vol. abs/1606.03073, 2016.
-  N. Wang, H. Ai, and S. Lao, “A compositional exemplar-based model for hair segmentation,” in Computer Vision – ACCV 2010, pp. 171–184, Springer Nature, 2011.
-  S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr, “Conditional random fields as recurrent neural networks,” in 2015 IEEE International Conference on Computer Vision (ICCV), Institute of Electrical and Electronics Engineers (IEEE), dec 2015.
-  P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” NIPS, 2012.
-  M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision, vol. 111, pp. 98–136, jun 2014.
S. Liu, J. Yang, C. Huang, and M.-H. Yang, “Multi-objective convolutional
learning for face labeling,” in
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Institute of Electrical and Electronics Engineers (IEEE), jun 2015.
-  E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, and G. Hua, “Labeled faces in the wild: A survey,” in Advances in Face Detection and Facial Image Analysis, pp. 189–248, Springer Nature, 2016.
-  A. Kae, K. Sohn, H. Lee, and E. Learned-Miller, “Augmenting crfs with boltzmann machine shape priors for image labeling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2019–2026, 2013.
-  V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, “Interactive facial feature localization,” in Computer Vision – ECCV 2012, pp. 679–692, Springer Nature, 2012.
-  B. M. Smith, L. Zhang, J. Brandt, Z. Lin, and J. Yang, “Exemplar-based face parsing,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, Institute of Electrical and Electronics Engineers (IEEE), jun 2013.
-  F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” CoRR, vol. abs/1511.07122, 2015.
-  P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Semantic segmentation using adversarial networks,” CoRR, vol. abs/1611.08408, 2016.
-  H. Zhu, F. Meng, J. Cai, and S. Lu, “Beyond pixels: A comprehensive survey from bottom-up to semantic image segmentation and cosegmentation,” Journal of Visual Communication and Image Representation, vol. 34, pp. 12–27, 2016.
-  C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440, 2015.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in Neural Information Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc., 2012.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014.
-  H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” arXiv preprint arXiv:1511.00561, 2015.
-  S. Hong, H. Noh, and B. Han, “Decoupled deep neural network for semi-supervised semantic segmentation,” in Advances in Neural Information Processing Systems, pp. 1495–1503, 2015.
-  G. Ghiasi and C. C. Fowlkes, “Laplacian pyramid reconstruction and refinement for semantic segmentation,” in European Conference on Computer Vision, pp. 519–534, Springer, 2016.
-  H. Zheng, Y. Liu, M. Ji, F. Wu, and L. Fang, “Learning high-level prior with convolutional neural networks for semantic segmentation,” CoRR, vol. abs/1511.06988, 2015.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” arXiv preprint arXiv:1606.00915, 2016.
-  P. H. Pinheiro and R. Collobert, “Recurrent convolutional neural networks for scene labeling.,” in ICML, pp. 82–90, 2014.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
-  A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” CoRR, vol. abs/1511.06434, 2015.
-  E. L. Denton, S. Chintala, A. Szlam, and R. Fergus, “Deep generative image models using a laplacian pyramid of adversarial networks,” CoRR, vol. abs/1506.05751, 2015.
-  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in Neural Information Processing Systems, pp. 2172–2180, 2016.
-  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544, 2016.
-  C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian generative adversarial networks,” CoRR, vol. abs/1604.04382, 2016.
-  C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” CoRR, vol. abs/1609.04802, 2016.
-  S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into geometric and semantically consistent regions,” in The IEEE International Conference on Computer Vision (ICCV), 2009.
-  J. Warrell and S. J. Prince, “Labelfaces: Parsing facial features by multiclass labeling with an epitome prior,” in Image Processing (ICIP), 2009 16th IEEE International Conference on, pp. 2481–2484, IEEE, 2009.
-  C. Scheffler and J.-M. Odobez, “Joint adaptive colour modelling and skin, hair and clothing segmentation using coherent probabilistic index maps,” in British Machine Vision Association-British Machine Vision Conference, no. EPFL-CONF-192633, 2011.
-  Y. Yacoob and L. S. Davis, “Detection and analysis of hair,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 7, pp. 1164–1169, 2006.
-  K.-c. Lee, D. Anguelov, B. Sumengen, and S. B. Gokturk, “Markov random field models for hair and face segmentation,” in Automatic Face & Gesture Recognition, 2008. FG’08. 8th IEEE International Conference on, pp. 1–6, IEEE, 2008.
-  G. B. Huang, M. Narayana, and E. Learned-Miller, “Towards unconstrained face recognition,” in Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Computer Society Conference on, pp. 1–8, IEEE, 2008.
-  N. Wang, H. Ai, and F. Tang, “What are good parts for hair shape modeling?,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 662–669, IEEE, 2012.
-  P. Luo, X. Wang, and X. Tang, “Hierarchical face parsing via deep learning,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2480–2487, IEEE, 2012.
-  Y. Zhou, X. Hu, and B. Zhang, “Interlinked convolutional neural networks for face parsing,” in International Symposium on Neural Networks, pp. 222–231, Springer, 2015.
D. E. King, “Dlib-ml: A machine learning toolkit,”Journal of Machine Learning Research, vol. 10, pp. 1755–1758, jul 2009.
-  V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Institute of Electrical and Electronics Engineers (IEEE), jun 2014.
-  A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016.
-  N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord, A. Graves, and K. Kavukcuoglu, “Neural machine translation in linear time,” CoRR, vol. abs/1610.10099, 2016.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, vol. 30, 2013.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015.
-  M. Lin, Q. Chen, and S. Yan, “Network in network,” CoRR, vol. abs/1312.4400, 2013.
-  S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation open source framework for deep learning,” in Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
-  A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity metrics based on deep networks,” CoRR, vol. abs/1602.02644, 2016.
-  S. Tsogkas, I. Kokkinos, G. Papandreou, and A. Vedaldi, “Semantic part segmentation with deep learning,” CoRR, vol. abs/1505.02438, 2015.
-  S. Saxena and J. Verbeek, “Convolutional neural fabrics,” CoRR, vol. abs/1606.02492, 2016.