I Introduction
The task of face parsing is to assign a categorical label to every pixel in a face image. It enables many high level applications, e.g, hair editing and face beautification. While there has been previous work related to face parsing based on landmark representations [1], conditional random field [2, 3] and exemplar [4], none of the previous methods has achieved an excellent dense prediction over raw image pixels in a fully endtoend way.
Over the last few years, the success of deep convolutional neural network (CNN) models
[5, 6, 7] has resulted in a dramatic progress in the task of pixlewise semantic segmentation using rich hierarchical features [8, 9, 10, 11, 12, 13]. Most of the current semantic segmentation methods are derived from a fully convolutional network (FCN), which was first introduced in [8]. In FCN, the last few fully connected layers are replaced by a convolutional layer to make efficient an endtoend learning and inference.A dominant research direction for improving semantic segmentation with deep learning is the combination of the powerful classification capabilities of FCNs with a structured prediction, which aims at improving classification by capturing the interactions between predicted labels. Probabilistic graph models have been popular for a long time for structured prediction of labels, with constraints enforcing label consistency. The conditional random field (CRF) is a common framework which utilizes both the local and global dependencies within an image to refine the prediction map. Various models [14, 15, 16] based on higher order clique potentials have been developed to improve the segmentation performance. Most current stateoftheart methods [12, 17, 18, 10] have incorporated graphical models into a deep learning framework. One of the first work of combining deep learning framework with a structured prediction was proposed in [12] and it applied the densely connected conditional random field (DenseCRF) [19] to the postprocesses of FCN output to generate a better segmentation with refined image boundaries. Zheng et al. [10]
combined DenseCRF with CNN into a single Recurrent Neural Network (RNN) to transform the DenseCRF postprocessing into an endtoend procedure.
Although CNN is a powerful tool for semantic segmentation, there are some technical hurdles when the existing CNN architectures are applied to a pixelwise prediction for face parsing. First, the diverse, contextual and mutual relationships among the key components for face parsing should be well addressed when predicting pixelwise labels. Second, the predicted label maps are desired to be detailpreserved and of highresolution in order to recognize or highlight very small labels (e.g. the eyebow regions). However, most of the previous works on semantic segmentation with a CNN can only predict the labels of very lowresolution pixels, for example, the eighttime downsampled features in the fully convolutional network (FCN) [8, 12]
. Their prediction is very coarse and not optimal for the required finegrained segmentation. Third, the critical segmentationspecific context constraints, such as the local superpixel smoothness or the integrity and uniqueness for each semantic region, have not been well considered in the previous works on face parsing. For instance, the pixels within the same superpixel or neighboring superpixels should have high possibilities to be assigned with the same semantic label. The label probabilities from neighboring superpixels should help guide the label inference by leveraging the location priors. Furthermore, the pixels within the same semantic region (e.g. eye region) should be predicted to have the same semantic label to retain the region integrity. Lastly, although the structured prediction is a powerful tool for improving segmentation, the training and inference computation cost of structured prediction is expensive.
In this paper, we present a novel fullyconvolutional continuous CRF neural
network that successfully addresses the abovementioned issues. FCCNN aims to capture
crosslayer context and local superpixel context by combining three subnetworks: a unary network, a pairwise network and a continuous CRF network. Firstly, to recover the image details, a carefully designed unary network which is composed of convolutional blocks and deconvolutional blocks is proposed. To recover the image details, we apply deconvolution layers which are trained in an endtoend way to upsample the feature maps layer by layer. Secondly, the withinsuperpixel smoothing and crosssuperpixel neighborhood relationship are leveraged to retain the local boundaries and label consistencies within the superpixels. They are formulated as natural
subcomponents of the FCCNN in both the training and the testing process. Thirdly, a pairwise network is designed to learn the pixelwise affinity so as to capture the spatial relationship between superpixels. Finally, to incorporate the output of unary network and pairwise network into a unified framework, a particular type of graphical model, the continuous Conditional Random Field is used, which allows
us to perform exact and efficient MaximumAPosteriori (MAP) inference. A differentiable superpixelpooling layer and a continuous CRF layer are designed to combine the unary network and pairwise network. Even though Conditional Random Fields are
unimodal and as such less expressive, Continuous Conditional Random Fields are unimodal conditioned on the data,
effectively reflecting the fact that given the image one solution dominates the posterior distribution. The
CCRF model thus allows us to construct rich expressive structured prediction models that still lend
themselves to efficient inference. To solve the Gaussian CRF effectively, we apply very efficient algorithms
for inference and learning, as well as a customized technique adapted to the semantic segmentation task
building on standard tools from numerical analysis. Our contributions are summarized as follows:
(1) We propose a deep fullyconvolutional continuous CRF network which is composed of a unary network, a pairwise network and a continuous CRF network. The proposed architecture is able to integrate superpixel content, semantic edge information and continuous CRF
model into a unified framework effectively.
(2) We introduce a pairwise network to learn the pixelwise similarity and a superpixelpooling layer is designed to enable a fully endtoend training of the proposed semantic segmentation network based on superpixels.
(3) We present a continuous CRF layer that are designed through the solutions of linear systems.
(4) We compare FCCNN with the stateoftheart results on the LFWPL and the Helen datasets. Better segmentation performances in terms of class average accuracy are achieved.
Ii related work
Iia Face Parsing
The task of face parsing is to parse an input face image into semantic regions, e.g, eyes, eyebrows and mouth for further processing. Face parsing provides a robust representation by assigning a semantic label to every pixel of a face image. Recently, researchers have proposed several face parsing algorithms [20, 21, 2, 3, 4]. The first category is deep learning based methods. Liu et al. [20] proposed a deep convolutional network that jointly models pixelwise likelihoods and label dependencies through a multiobjective learning method. In [21], Luo et al. proposed a face parsing method based on deep hierarchical features and several trained models. The second category is CRF based model. Warrell and Prince [2] used a CRF for labeling facial components by combining a family of multinomial priors to model facial structures. In [3]
, Kae et al. modeled the face shape prior with a restricted Boltzmann machine and combined it with a CRF model for labeling threes classes (face, hair and background) of pixels. The third category is exemplar based methods. In
[4], Smith et al. developed a method based on transferring labeling masks from registered exemplars to classification probabilities to labeling facial skin and facial components such as nose, eyes, etc.IiB Semantic Segmentation
In recent years, Deep Convolutional Neural Networks (CNNs) [22, 5] have demonstrated their excellent performance for semantic segmentation [23, 8, 24, 9, 12, 11]. In [23]
, CNN features are applied to classify each region into one of the semantic classes. Different from the region based approaches, FCN shown in
[8]applies the full convolution only once on an entire image to directly extract features at each pixel. However, the output of FCNs tends to have poorly localized object boundaries due to the deployment of maxpooling layers and down sampling. Several approaches have been introduced to handle this problem. The methods shown in
[8]proposes to extract features from the intermediate layers of a deep network to better estimate object boundaries and recover the image details. In
[8], a single deconvolutional layer is added in the decoding stage to generate prediction results using stacked feature maps from intermediate layers. In [24] and [9], deconvolutional layers are constructed by mirroring the convolutional layers using the stored and pooled locations in an unpooling step. The deconvolutional layers and the unpooling layers are employed to recover the ”spatialinvariance” effect of maxpooling layers. Noh et al. [24]showed that coarsetofine structures are crucial to recover the finedetailed information along with the propagation of deconvolutional layers. Bilinear interpolation
[12, 11] is also commonly used because it is fast and memoryefficient. Approaches shown in [25, 26, 27] use the superpixel representation, which is essentially generated by the lowlevel image segmentation methods to improve the localization and segmentation accuracy.Although CNNs have been shown to work very well for semantic segmentation, they may not be optimal as they can not model the interactions between variables. Combining the strength of CNNs and CRFs for segmentation is another way to recover the finedetailed information in an image , and this has been the focus of recently developed approaches. DeeplabCRF shown in [12] trains an FCN and uses a dense CRF [19] in a postprocessing step to refine the object boundaries by leveraging the color contrast information. CRFRNN [10] implements recurrent layers for endtoend learning of the dense CRF and the FCN network. It uses Pottsmodel based pairwise potential functions to enforce smoothness only. Lin et al. [17] proposed a method that combined CNNs and CRFs to exploit complex contextual information for semantic image segmentation. The CNN based pairwise potentials are formulated for modeling the semantic relations between image regions. In [28], an endtoend trainable Gaussian Conditional Random Field network, which unfolds a fixed number of Gaussian mean field inference steps is proposed. In [13], a structured prediction technique that combines the virtues of Gaussian Conditional Random Fields with a Deep Learning is proposed which learns features and model parameters simultaneously in an endtoend FCN training algorithm.
In contrast to the work described above, our approach shows that it is more efficient to perform face parsing by designing an architecture that integrates fullyconvolutional layers, deconvolutional layers, superpixel information, continuous Conditional Random Field model and semantic edge context into a unified framework.
Iii The Proposed Network Architecture
Figure 1
displays the flowchart of the proposed deep neural network for face parsing at a higher level. The proposed architecture is composed of three parts: a unary network, a pairwise network and a continuous CRF network. The unary network includes convolutional blocks and its corresponding deconvolutional blocks. The convolutional blocks are designed to transform an input image to multidimensional feature representations. The deconvolutional blocks are applied to recover the pixel level prediction information from the features extracted by the convolutional layers. The pairwise network which is fed into the continuous CRF network is used to learn the similarity between pixels. The continuous CRF network is composed of a superpixelpooling layer, a continous CRF layer and a final softmax classification layer. The role of superpixelpooling layer is to transform the pixel level features to superpixel level features. The softmax classification laye is to generate a probability map related to predefined classes.
Iiia Unary Network
We formulate the unary function of CRF by stacking the unary network for generating the feature maps and a fully convolutional network to generate the final output for the unary potential function. The proposed unary network structure encodes the local details in an early stage. Different spatial resolutions are used for capturing different levels of semantic information. The ”unary” part of our network (illustrated in Figure 2) is built on top of SEGNET architecture [9] and we extend it to address the task of face parsing. The convolutional part computes a convolutional feature map of input image
. It is initialized from the Imagenettrained VGG16 network
[6]and then finetuned on a face segmentation dataset. Each convolutional block (C1C5) performs convolution with a filter bank to produce a set of feature maps. To reduce the internalcovariateshift problem, a batch normalization
[29]layer is added to the output of every convolutional layer. Then an elementwise rectified linear nonlinearity (ReLU) is applied. After that, maxpooling with a
window and stride 2 is performed and the resulting output is subsampled by a factor of 2.
The feature maps from deep layers often focus on global structure and are insensitive to local boundaries and spatial displacements. It is observed that the activation maps obtained by the convolutional layers are not sufficient for semantic segmentation since they assign high scores to only few discriminative parts of an object and their resolution are too low to recover object shape accurately. We thus add a nonlinear deconvolution module to recover the image details from generated by convolutional blocks. This module consists of five deconvolution blocks (D1D5). In each deconvolution block, a unpooling layer is employed to reconstruct the original size of activations. The unpooling operation is applied for upsampling the feature maps [24, 9]
. The locations of maximum activations selected during pooling operation the input feature maps are recorded and the activations are upsampled using the memorized maxpooling indices from the corresponding convolutional activations. The output of an unpooling layer is an enlarged activation map which is then convolved with a trainable convolutional filter bank to produce dense feature maps. Then each convolutional filter bank is followed by a batch normalization layer and a rectified linear unit in deconvolution blocks
[9]. To recover the features effectively, a hierarchical structure of deconvolutional blocks (D1D5) are used to recover the image details layer by layer. The filters in lower layers tend to capture overall shape of an object while the classspecific finedetails are encoded in the filters in higher layers[24].IiiB Pairwise Network
The pairwise network is designed to learn the pixelwise similarity. As illustrated in Figure 3, a new branch of convolutional blocks C1C5 are used to generate feature maps. Then a interpolation layer is applied to recover the feature maps to the same resolution with the original image. To compute the pairwise similarity, we create a similarity graph and each location in the feature map (which corresponds to a pixel in the input image) corresponds to a node in the graph. Pairwise connections in the pixel graph are constructed by connecting one node to its neighboring node. We consider two kinds of spatial relations by defining horizonal and vertical relations connections, and each type of spatial relation is modeled by a specific pairwise potential function. Hence the pairwise network generates a similarity matrix between pixels and based on the learned horizonal and vertical relations.
The edge features are computed by concatenating the corresponding feature vectors of two connected nodes (similar to
[17]). The edge feature of pixel pair is then sent to a convolutional layer to compute horizonal and vertical relations. In our experiment, we use a convolution kernel and a convolution kernel to learn the horizonal and vertical pairwise relations respectively. Then the output of pairwise net is fed into the continuous CRF network. Note that the first three convolutional blocks are shared with the unary network, then two convolutional blocks are followed. Finally, a interpolation layer is presented in where the detailed information is recovered by interpolation operation.IiiC Continuous CRF Network
As shown in Figure 1, the continuous CRF network is an important part in the proposed architecture.
The output of unary network and pairwise network are taken as the input of continuous CRF network. Two novel layers, superpixelpooling layer (SPLAYER) and continuous CRF layer (CCRF layer) are designed for the continuous CRF network. The superpixelpooling layer is designed to transform the pixel level feature representations to superpixel level feature representations. The continuous CRF model is integrated into the whole framework via the CCRF layer. The architecture details of the proposed SPLAYER and CCRF layer are illustrated in Figure 4.
Superpixelpooling layer (SPLAYER):
Unlike the traditional practice of treating the complex superpixel random field regularization as postprocessing [30], we embed
the withinsuperpixel smoothing into the training stage and the testing stage. Before the CCRF layer
of CCRF network, the withinsuperpixel smoothing is designed to project the pixel level features to region level features.
Unlike traditional pooling layers, the pooling layout of the SPLAYER is not predefined but determined by superpixels of the input image.
Through the SPLAYER, we can aggregate feature vectors spatially aligned with superpixels by average pooling. To simplify the notations, we assume the image is divided into superpixles after the oversegmentation step. The information from SPLAYER will be propagated through the CCRF layer later.
The roles of SPLAYER can be summarized as two parts. The first role is to process the output of pariwise network.
Based on the pixel level similarity learnt from the pairwise network, we compute the superpixel level similarity by projecting pixel level similarity to superpixel level. Then the similarity between superpixels and is defined as:
(1) 
where denotes the set of pixels in superpixel , represents the boundary pixels between and .
The second role of SPLAYER is to process the output of the unary network. Let stand for the transformed output of unary network through SPLAYER (Figure 4). The smoothed confidence map of superpixel can by represented by:
(2) 
where is the index of pixels in , represents the output unary network. Then the output of superpixels pooling layer
can be represented by a vector . The output of SPLAYER then becomes an matrix, where means the numbers of superpixels and indicates the number of channels in the feature maps.
Continuous CRF layer (CCRFLAYER): The output of SPLAYER corresponding to unary network and the output of SPLAYER corresponding to pairwise network are fed into the CCRF layer which implements the continuous condition random field model for segmentation performance improvement. Let denote the output of layer. The unary potential of CCRF is constructed from the output of a SPLAYER by considering the least square loss:
(3) 
where is the weighting coefficient and represents the parameters for unary network. The pairwise potentials are defined as:
(4) 
where is the pairwise part from the superpixels pair which are learnt from the pairwise network. With the unary and pairwise potentials defined, we can now rewrite the energy function:
(5)  
where is the set of adjacent superpixels. To simplify the equations, the following notation is presented:
(6) 
where is the identify matrix. is the superpixel similarity matrix which is comprised of , and is the diagonal matrix with . Then we have:
(7) 
Given and , the inference involves solving for the value of that minimizes the energy function in Eq.7.
(8) 
To solve the linear system of Eq.8, the sequential meanfield method using the GaussSeidel algorithm [31] is applied. Then the high dimensional feature representation at the output of the CCRF layer is fed to a trainable softmax classifier. This softmax layer classifies each pixel independently. The output of the softmax classifier is a channel feature map of probabilities where is the number of features. The predicted segmentation corresponds to the class with maximum probability at each pixel.
Iv Endtoend Training for Face Parsing
In this section, we will describe the way for training the network in an endtoend way. The method for training the continuous CRF layer, the superpixel pooling layer, the unary network and the pairwise network will be introduced.
Iva Training on Continuous CRF Network
Firstly, we will introduce the way for computing derivatives of loss with respect to and . Note that is the output of CCRF layer and is the output of SPLAYER related to unary network. As shown in Fig 4, the continuous CRF layer is connected to the softmax loss layer. When the proposed network is trained, the derivative is back propagated from the above softmax loss layer. The SPLAYER related to the unary network is connected to the CCRF layer, the derivative of loss with respect to
is computed using the following chain rule:
(9) 
Based on Eq (8), the application of chain rule (Eq (9)) yields a closed form expression, which is a system of linear equations:
(10) 
Based on the above equations, the expression for the partial derivatives of is derived by using then following chain rule of differentiation:
(11) 
Using the expression of calculating the matrix derivative we have:
(12)  
where denotes the kronecker product. Then the following expression can be obtained:
(13) 
IvB Training on Unary Network
To determine the partial derivatives through the superpixelpooling layers, we observe that the superpixel pooling layer does not have any weights and we only need to compute the subgradients of the loss with respect to the pixellevel scores. We further integrate the withinsuperpixel smoothing into the training and testing process to utilize the detailed information. The superpixel guidance is only
used to process the output of unary networks and pairwise network instead of all covolutional layers, so as not to influence the learning of convolution filters.
In the backward process, the output of the unary network is connected to SPLAYER. The derivative is back propagated through the SPLAYER to the unary network. Then, the loss derivation related to can be computed using the following rule:
(14) 
where that means that pixel is contained in superpixel .
Then we have a loss tensor
with dimensions . The network parameters of unary network can be trained by SGD based on back propagation chain rule in an endtoend way:(15) 
IvC Training on Pairwise Network
The pairwise term is learnt through the pairwise branch. The derivatives of loss with respect to similarity matrix is defined as:
(16) 
The derivates of loss with respect to is defined in equation (13), and the derivates can be obtained based on the equation . Then the rule for back propagating gradients through the SPLAYER related to pairwise network is defined as:
(17) 
where is the set of all edges which include the pixel with location . In the pairwise network, the pairwise similarities are
learnt through and convolutions. Then the derivatives are back propagated to the pairwise net to learn the pairwise network parameters .
V Experimental Results
We perform a thorough comparison of FCCNN to the state of the art along with comprehensive
experiments. We use the LFWPL dataset [3] and HELEN dataset [4] to evaluate the proposed method.
We report the Fmeasure metric [20, 32] to measure the perpixel segmentation accuracy.
Va Datasets and Setting
The LFWPL dataset [3] contains 2927 face images of
pixels acquired in unconstrained environments. All of them are manually annotated with skin, hair
and background labels using superpixels. The training set of this dataset is composed of
1500 images, the testing set is composed of 927 images, and a
validation set with 500 images is contained as well. The HELEN dataset [4] contains face labels
with 11 classes for the second set of experiments. It is composed of 2330 face images of pixels with labeled
facial components generated through manuallyannotated contours along eyes, eyebrows, nose, lips and jawline. The dataset is
also divided into a training set with 2000 images, a testing set with 100
images, and a validation set with 300 images.
VB Implementation
The highlevel architecture of our system is implemented using the popular Caffe
[33] deep learning library. We initialized the convolutional blocks and deconvolutional blocks of the unary network using SegNet network from [9]. This acts like a strong baseline for a purely feedforward network. The proposed network has symmetrical configuration of convolution and deconvolution network centered around the last pooling layer. In the implementation of pairwise network, the convolutional blocks C1C3 are shared with the unary stream. Then two convolutional blocks are followed and a bilinear interpolation layer is designed to upsample the activation maps.The superpixelpooling layer and continuous CRF layer are the core of our architecture. In the implementation of CCRF layer, we extend the implementation of pixel based gaussian CRF layer [13] to the superpixel based CCRF layer by combining the information generated by SPLAYER.
VC Evaluation on LFWPL Dataset
Fskin  Fhair  Fbg 


MOGC [20]  93.93  80.70  97.10  95.12  
SEGNET [9]  93.15  84.18  95.25  93.56  
FCN [8]  92.91  82.69  96.32  94.13  
CRFASRNN [10]  92.79  82.75  96.32  94.12  
DEEPLAB [12]  92.54  80.14  95.65  93.44  
DEEPLABDT [11]  91.17  78.85  94.95  92.49  
FCCNN wo superpixels  94.02  84.73  96.06  95.06  
FCCNN  94.10  85.16  96.46  95.28  

At the first step we directly compare the potential advantage of FCCNN with respect to the stateofthatart methods on the task of labeling facial components such as skin, hair and background. In this task, the superpixels are generated using LSC [34]. Note that our method can be used with any oversegmentation algorithm. As a first step, we directly compare the proposed FCCNN with current face parsing method [32, 20] and stateoftheart semantic segmentation methods, including FCN [8], CRFASRNN [10], SEGNET [9], DEEPLAB [12], DEEPLAB+CRF [12], DEEPLABRTF [13] and DEEPLABDT [11].
In the experiments, we train our method on the training set and validation set of LFWPL and evaluate them on the test images. All images are cropped to an input resolution of
to adapt to our network architecture. In the construction of pairwise net, each pixel is connected to its left, right, top and bottom adjacent neighbours. When FCCNN is compared with other methods, we cropped the images so as to adapt to different kinds of deep learning models. For example, the images are padded to the size of
for FCN and CRFASRNN. For SEGNET, images are transformed to the size of by padding the border regions with zeros. For DEEPLAB and DEEPLABDT, the images are cropped to the size of .The quantitative results of the proposed method and the competitors are presented in Table I. We can see that FCCNN achieves the higher accuracy on facial skin and hair segmentation accuracy compared to the pixel classification based face parsing method MOCG. A overall segmentation accuracy improvement can be achieved by FCCNN compared with the baseline SEGNET. Compared to other fully convolutional methods, FCCNN achieves the highest accuracy over all the three classes.
We also evaluate the role of superpixel information and superpixel pooling layer on LFWPL dataset. As shown in Table I, the integration of superpixels information can improve the segmentation accuracy. Figures will demonstrates effectiveness of pixelwise prediction for facial parsing, and we can observe that FCCNN shows better performance than the compared methods by recovering the detailed image information.
Methods  brows  eyes  nose 





overall  
MOGC [20]  0.734  0.768  0.912  0.601  0.824  0.684  0.857  0.912  0.854  
FCN [8]  0.677  0.7429  0.886  0.624  0.764  0.751  0.719  0.880  0.862  
CRFASRNN [10]  0.682  0.769  0.885  0.627  0.769  0.774  0.732  0.896  0.877  
DEEPLAB [12]  0.661  0.704  0.878  0.585  0.701  0.724  0.678  0.881  0.858  
DEEPLABDT [11]  0.700  0.754  0.901  0.638  0.738  0.762  0.721  0.901  0.880  
Simth [32]  0.722  0.785  0.922  0.651  0.713  0.700  0.857  0.882  0.804  
DEEPLABCRF [12]  0.401  0.728  0.807  0.460  0.702  0.717  0.643  0.910  0.871  
DEEPLABRFT [13]  0.701  0.736  0.886  0.624  0.719  0.749  0.706  0.889  0.868  
SEGNET [9]  0.747  0.810  0.898  0.708  0.756  0.796  0.762  0.902  0.887  
FCCNN wo superpixel  0.751  0.819  0.901  0.711  0.793  0.811  0.771  0.905  0.893  
FCCNN  0.757  0.828  0.906  0.717  0.799  0.817  0.782  0.911  0.897  

VD Evaluation on HELEN Dataset
In the second experiment, we conducted an experiment on the HELEN dataset, which differs from the LFWPL dataset. The labels of images in HELEN dataset are composed of two eyes, two eyebrows, nose, upper and lower lips, inner mouth, facial skin and hair. Unlike LFWPL, some facial components are rare classes (e.g. eyes, lips, etc) in HELEN dataset. In the preprocess of our experiment, the faces are extracted from the original images via face detection and facial landmarks detection
[35]. All the cropped facial images are resized to the size of and then padded zeros to the size of . However, all the segmentation labels are transformed to the original sizes in the evaluation process.In this experiment, we merge the ground truth hair label with the background to train a 7classes network. In this way, a fair comparison with the work of [20] and [32] can be obtained. Based on the same subset of images with the same criteria, the experiment results on the HELEN dataset are demonstrated in Table II. A large variation on Fmeasure with respect to each facial components can be seen. The nondeep learning based method such as the one shown in [32] based on exemplar transfer obtains the better result on relative rare facial classes, such as noses. Compared with fully convolutional methods, the region classification based method [20] performs better on class such as ”in mouth”, but the Fmeasures for most of the classes are lower than fully convolutional methods.
We compare the proposed FCCNN with a number of recent fully convolutional semantic segmentation methods with competitive performance. Better segmentation performance is achieved compared with methods such as FCN, CRFFCN, DEEPLAB and DEEPLABDT. This primarily because that the facial component regions are small, the architectures such as DEEPLAB and FCN are not specially designed for segmenting small objects. Morevoer, FCCNN generates better segmentation performance over all the classes compared with the baseline SEGNET. We can see that the primary advantage of our model comes from delineating the objects and improving fine segmentation boundaries by the carefully designed deconvolutional layers, the integration of superpixels information and the continous CRF model.
We also evaluate the role of superpixels and superpixel pooling layer in improving the segmentation accuracy. The corresponding results are listed in Table II
. The results indicate the importance of superpixels for segmentation performance improvement. The segmentation accuracies for all the seven classes have been improved by incorporating the superpixel representation into the segmentation architecture. Consistency over regions helps to get rid of spurious regions of wrong labels. The superpixels can work as guidelines for recovering detailed image information. What’s more, the computation cost of continuous CRF has decreased by inferring on the superpixels based affinity matrix.
The qualitative results of FCCNN and the compared methods are presented. Overall, FCCNN produces fine segmentations compared to other methods, and handles small objects (such as eyes, mouth and eyebows) by integrating deconvolutional layers and superpixel information. The methods such as FCN, DEEPLAB tend to fail in labeling too small objects due to its fixedsize receptive field. Our network generally returns the object masks that are more close to the true object boundaries.
Vi Conclusions
We proposed a fullyconvolutional continuous CRF network for face parsing. This architecture combines the advantage of adaptive representation of superpixels context, continuous Conditional Random Field with endtoend training directly optimized for
semantic segmentation. We achieve this by introducing a segmentation framework that includes three subnetworks: a unary network, a pairwise network and a continuous CRF network. We apply deconvolution blocks to recover the image details in unary network. Pairwise network is designed to learn the pairwise relationship between pixels. In the continuous CRF network, a differentiable superpixelpooling layer and a continuous CRF layer are designed to combine the unary network and pairwise network.
Importantly, we show that continuous CRF inference is an efficient way for utilizing the superpixel information and improving segmentation accuracy. Extensive experimental results on face parsing tasks clearly demonstrate the superiority of the proposed method over the other stateoftheart methods on LFWPL and HELEN datasets. In the future work, we will further extend our FCCNN architecture for more generic image parsing tasks, e.g, scene semantic segmentation.
References

[1]
X. X. Zhu and D. Ramanan,
“Face detection, pose estimation, and landmark localization in the
wild,”
in
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)
. IEEE, 2012, pp. 2879–2886.  [2] J. Warrell and S. J. Prince, “Labelfaces: Parsing facial features by multiclass labeling with an epitome prior,” in in Proc. IEEE 16th Int. Conf. Image. Proc. (ICIP). IEEE, 2009, pp. 2481–2484.
 [3] A. Kae, K. Sohn, H. Lee, and E. LearnedMiller, “Augmenting crfs with boltzmann machine shape priors for image labeling,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2013, pp. 2019–2026.
 [4] B. M. Smith, L. Zhang, J. Brandt, Z. Lin, and J.C. Yang, “Exemplarbased face parsing,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2013, pp. 3484–3491.
 [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [7] K. M. He, X. Y. Zhang, S.Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 770–778.
 [8] E. Shellhamer J. Long and T. Darrel, “Fully convolutional networks for semantic segmentation,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). IEEE, 2015.
 [9] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” arXiv preprint arXiv:1511.00561, 2015.
 [10] B. RomeraParedes V. Vineet Z. Su D. Du C. Huang P. Torr S. Zheng, S. Jayasumana, “Conditional random fields as recurrent neural networks,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). IEEE, 2015.
 [11] L. C. Chen, Jonathan T. Barron, George Papandreou, K. Murphy, and A. L. Yuille, “Semantic image segmentation with taskspecific edge detection using cnns and a discriminatively trained domain transform,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 4545–4554.
 [12] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
 [13] S. Chandra and I. Kokkinos, “Fast, exact and multiscale inference for semantic image segmentation with deep gaussian crfs,” in Euro. Conf. Comput. Vis. Springer, 2016, pp. 402–418.
 [14] Chris Russell, Pushmeet Kohli, Philip HS Torr, et al., “Associative hierarchical crfs for object class image segmentation,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2009, pp. 739–746.
 [15] Lubor Ladicky, Chris Russell, Pushmeet Kohli, and Philip HS Torr, “Graph cut based inference with cooccurrence statistics,” in Euro. Conf. Comput. Vis. Springer, 2010, pp. 239–253.
 [16] L’ubor Ladickỳ, Paul Sturgess, Karteek Alahari, Chris Russell, and Philip HS Torr, “What, where and how many? combining object detectors and crfs,” in Euro. Conf. Comput. Vis. Springer, 2010, pp. 424–437.
 [17] Guosheng Lin, Chunhua Shen, Anton van den Hengel, and Ian Reid, “Efficient piecewise training of deep structured models for semantic segmentation,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 3194–3203.
 [18] X. J. Qi, J. P. Shi, S. Liu, R. J. Liao, and J. Y. Jia, “Semantic segmentation with object clique potential,” in in Proc. IEEE 14th Int. Conf. Comput. Vis. (ICCV), 2015, pp. 2587–2595.
 [19] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in NIPS, 2011.
 [20] C. H. S. F. Liu, J.M. Yang and M. Y. Yang, “Multiobject convolutional learning for face labeling,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). IEEE, 2015.
 [21] P. Luo, X. G. Wang, and X. O. Tang, “Hierarchical face parsing via deep learning,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). IEEE, 2012, pp. 2480–2487.

[22]
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and
L. D. Jackel,
“Backpropagation applied to handwritten zip code recognition,”
Neural computation, vol. 1, no. 4, pp. 541–551, 1989.  [23] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simultaneous detection and segmentation,” in Euro. Conf. Comput. Vis. Springer, 2014, pp. 297–312.
 [24] H. Noh, S. Hong, and B. Y. Han, “Learning deconvolution network for semantic segmentation,” in in Proc. IEEE 14th Int. Conf. Comput. Vis. (ICCV), 2015, pp. 1520–1528.
 [25] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1915–1929, 2013.
 [26] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, “Feedforward semantic segmentation with zoomout features,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 3376–3385.
 [27] R. Gadde, V. Jampani, M. Kiefel, D. Kappler, and P. V. Gehler, “Superpixel convolutional networks using bilateral inceptions,” in Euro. Conf. Comput. Vis. Springer, 2016, pp. 597–613.
 [28] R. Vemulapalli, O. Tuzel, M. Y. Liu, and R. Chellapa, “Gaussian conditional random field network for semantic segmentation,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 3224–3233.
 [29] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [30] G. Papandreou, L. C. Chen, K. Murphy, and A. L. Yuille, “Weaklyand semisupervised learning of a dcnn for semantic image segmentation,” arXiv preprint arXiv:1502.02734, 2015.
 [31] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical recipes in C, vol. 2, Cambridge Univ Press, 1982.
 [32] J. Brandt J. Yang B. Smith, L. Zhang, “Exemplarbased face parsing,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). IEEE, 2013.
 [33] Y. Q. Jia, E . Shelhamer, J . Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in in Proc. ACM 22nd Int. Conf. Multimedia. ACM, 2014, pp. 675–678.

[34]
Z. Q. LI and J. S. Chen,
“Superpixel segmentation using linear spectral clustering,”
in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR). IEEE, 2015.  [35] X. D. Cao, Y. C. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape regression,” Int. J. Comput. Vis., vol. 107, no. 2, pp. 177–190, 2014.
Comments
There are no comments yet.