Segment-based Methods for Facial Attribute Detection from Partial Faces

01/10/2018 ∙ by Upal Mahbub, et al. ∙ University of Maryland 0

State-of-the-art methods of attribute detection from faces almost always assume the presence of a full, unoccluded face. Hence, their performance degrades for partially visible and occluded faces. In this paper, we introduce SPLITFACE, a deep convolutional neural network-based method that is explicitly designed to perform attribute detection in partially occluded faces. Taking several facial segments and the full face as input, the proposed method takes a data driven approach to determine which attributes are localized in which facial segments. The unique architecture of the network allows each attribute to be predicted by multiple segments, which permits the implementation of committee machine techniques for combining local and global decisions to boost performance. With access to segment-based predictions, SPLITFACE can predict well those attributes which are localized in the visible parts of the face, without having to rely on the presence of the whole face. We use the CelebA and LFWA facial attribute datasets for standard evaluations. We also modify both datasets, to occlude the faces, so that we can evaluate the performance of attribute detection algorithms on partial faces. Our evaluation shows that SPLITFACE significantly outperforms other recent methods especially for partial faces.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

The problem of attribute detection from face images has received much attention from the computer vision community in recent years

[1] [2] [3] [4]. Successful detection of facial attributes has numerous practical applications, such as user-verification [5] and image search [6], video surveillance [7]

, age and gender estimation to assist salutation for HCI

[3], and facial expression estimation for mood analysis [3]. Most attribute detection algorithms assume the availability of a full, near frontal and aligned face, and we find that their performance degrades significantly in domains where partially visible faces are frequent. One such domain is front-camera images of smartphones, which are used for continuous active authentication of users [8] [9].

To develop a method that detects attributes from full as well as partial faces, we consider the following key observations:

  • Some attributes can be inferred correctly even if the face is partially occluded. For example, it is possible for humans to infer the gender from only the left half or upper half of the face.

  • Some attributes are strongly localized in certain part of the face, such as beard or mustache can only be inferred from the lower half of the face.

Given this observations, it is desirable that a technique for attribute detection be designed, whose performance degrades gracefully with increasing occlusion, rather than suffer catastrophic failures.

In this paper, we present a two-step deep convolutional neural network-based method for facial attribute detection that takes into account the relative strength of different facial segments in detecting different facial attributes. We analyze the detection results obtained in the first step where all facial segments were tasked to decide the attributes. We then present a method to automatically assign selective sets of attributes to different facial regions, resulting in a performance boost in the second step. We also determine the appropriate thresholds for deciding on each attribute at each segment based on the detection results obtained from the validation set. Finally, we combine the predictions from different facial segments and produce the final result. Some special features of the proposed algorithm are:

  • We have implemented a local to global attribute detection approach that harnesses the strength of different facial segments into determining different attributes. For example, bottom-half of a face has information about the beard, while the upper-half has information about the hair. Our divide and conquer approach extracts intermediate results from each segment and combines them in the end to boost the overall performance.

  • Not all facial segments have to be present for the proposed method to work. The method relies on the whole face and one or more facial segments to estimate all the attributes. The individual facial segments are self-sufficient for estimating the attributes they are assigned to. Hence, the method demonstrates superior performance when the full face is not visible due to partial occlusion or pose variation.

  • We analyze the local aspects of facial attributes by associating them with facial segments and develop an automated method to utilized the local information.

  • It is a well known fact that an ensemble of networks generally outperform a single network. However training an ensemble is very time-consuming. Our proposed architecture provides us with predictors with only one round of training. We show that a significant increase in the final result is achieved by combining scores from these predictors.

In section 2, a summary of related works done on facial attribute detection is given. In sections 3, the proposed Segmentwise, Partial, Localized Inference in Training Facial Attribute Classification Ensembles network is described in detail. All the analyses and experimental results for the proposed methods and comparisons with state-of-the-art methods are provided in section 4. Finally, a brief summary of this work as well as future directions of research are included in section 5.

2 Related Works

There has been significant amount of research on attribute extraction starting from learning separate models for each attribute [10] [11] to jointly learning multiple attributes in a multi-task learning fashion [12] [2] [3] [1] [4]. Multi-task optimization is found to improve performance in comparison to training independent models for each attribute detection task[3] [2].

In recent times, the research on attribute detection mostly revolves around two challenging, publicly available datasets namely, CelebA and LFWA [13]. Both datasets have annotations for forty different attributes along with identity information. The CelebA dataset contains images for training, image for validation and more for testing. It is a very challenging dataset with wide variations in pose, illumination and image quality. The LFWA dataset is a much smaller dataset with training and test images. The datasets are introduced in [13] where the authors proposed a cascaded system of two DCNNs to jointly perform face localization and attribute detection. In [2]

the authors addressed the multi-label imbalance problem of the CelebA dataset and proposed a mixed objective optimization network (MOON) that utilizes a unique loss function comprised of a mixed multitask objective with domain adaptive re-weighting.

Some authors, such as in [3] [4], categorized the attributes into different groups to take advantage of their mutual relationships. The authors in [3] suggested an auxiliary network on top of the multi-task DCNN to further exploit the relationships among the attributes. On the other hand, the authors in [4] defined a modified AlexNet with both shared and category-specific feature learning to assist attribute extraction.

Some researchers also implemented the attribute detection task as an auxiliary task of another task. For example, in [1], the authors proposed a DCNN architecture similar to Faster RCNN [14]

with additional losses for joint detection of face and associated facial attributes without requiring explicit face alignment. However, the method does not address partial face detection, which is a challenging problem in itself

[15]. Other notable attribute detectors for unaligned face are proposed in [16] and [17]. In [16], the authors proposed a cascade network to concurrently localize face regions to different attributes and perform attribute classification. While this method might be suitable for attribute extraction from partially visible faces if trained properly, the authors presented no such extension or analysis. Also, the original network is huge, consisting of separate DCNN branches for each of the attributes and therefore not easily scalable. In [17], the authors introduced a data augmentation technique to assist attribute detection from unaligned faces. They improved the detection performance by augmenting the test data and combining the results. Even though their reported accuracy using an ensemble network of three ResNets is very good on unaligned faces, the architecture does not incorporate any mechanism for partially visible faces and also require combining scores from transformations of the test image to achieve the best performance.

3 Proposed Method

The basis of our approach is dividing the task of detecting attributes among different segments of the face. By segments, we mean portions of the face such as left half, right half, upper half, bottom half, nose segment etc. We divide a face into such facial segments (adopted from [18]) using 21 fiducial keypoints (as shown in Fig. 1) which are upper-left-half (UL12), upper-half (U12), upper-right-half (UR12), upper-left-three-fourth (UL34), upper-three-forth (U34), upper-right-three-fourth (UR34), left-half (L12), left-three-fourth (L34), eye-pair (EP), nose region (NS), right-half (R12), right-three-fourth (R34), bottom-three-fourth (B34), and bottom-half (B12). Let us denote the points shown in Fig. 1 with where and are the horizontal and vertical pixel distances from the pixel coordinate (top-left corner) of an image image of width and height , and . TL and BR corresponds to Top-Left and Bottom-Right coordinate of the full face bounding box. The fiducials and full face bounding boxes are obtained from All-in-One Face [19] along with visibility scores where . Now, where is a visibility threshold, the bounding boxes of segments , were are defined as

(1)

Here, and , , where .

Fig. 1: The 21 fiducial key points and the full face bounding box.

Intuitively, certain segments are more effective at predicting a subset of attributes than others. For example, we can expect that segments related to the upper part of the face (e.g. , etc.) would contain information about the person being bald or having certain types and color of hair. Therefore, even if the some other part of the face is occluded (e.g. , , or being not visible), by looking at the upper portion of the face, one can still predict attributes related to hair. Thus detecting attributes from parts as opposed to the whole face, has the advantage of allowing graceful degradation of performance rather than catastrophic failures with increasing occlusion.

While some attributes can be easily predicted from facial segments, some attributes reflect more global characteristics. For example, one can get hints if a person is young or not from multiple parts of the face, but youth is a global attribute. Therefore it is important to combine the segment predictions into a global prediction, so that multiple parts can contribute to the final prediction. Naturally, the following questions arise:

  • Global vs local attributes: How does one decide if an attribute is better predicted by facial segments or by a global predictor?

  • Optimal segment selection: How does one decide which facial segment is more suitable for predicting a particular attribute?

  • Combining results multiple networks: Given that each attribute is predicted by multiple segments of the proposed network, how does one combine the results optimally?

  • Handling occlusion: If a certain facial segment responsible for predicting a certain attribute is not visible, how does one get a reasonable prediction?

We can summarize the solutions to these problems as follows.

  • Network architecture: The first problem is solved by choosing an architecture of a DCNN that is not only able to predict attributes from facial segments, but also performs feature-level fusion of intermediate features through a global prediction network to produce accurate global predictions. Also, the sub-modules of the network have a Global Average Pooling which endows the networks with localization ability [20].

  • Output pruning: The second question is answered by the two-stage training approach that is adopted in this work, where the first stage primarily is used to prune the outputs of the segment networks by deciding which segments are good at predicting which attributes.

  • Committee Machines: For the third problem, we use two-committee machines to perform score level fusion of the multiple predictors to significantly improve the performance of any single constituent network.

  • Hierarchy of best predictors and Segment Dropout: Finally, to address the fourth problem, we keep track of a hierarchy of segments, which are good at predicting that attribute. Therefore, even if the best segment for an attribute is not present, one can fall back on other segments that are known to do somewhat well for that attribute. We also train our network with ‘Segment Dropout’ [21] to make it more robust to partial faces.

These ideas are core to our proposed method: Segmentwise, Partial, Localized Inference in Training Facial Attribute Classification Ensembles (SPLITFACE). The algorithm looks at the facial segments and learns to infer local attributes, to better handle partial faces. The next four subsections expand on these ideas.

3.1 Local to Global Network Architecture

Fig. 2: SPLITFACE network architecture showing the Facial Segment Networks and the Full Face Network, which are culminate in the Global Prediction Network.

Here the three constituents of the proposed network namely, the Full Face Network, the Facial Segment Networks and the Global Predictor Network, and their training losses are described:

Facial Segment Networks: Let denote face regions for the aforementioned facial segments. Each segment has some predictive power which is unknown initially. In the next section, we describe our data-driven approach to find which attributes are predicted more accurately by each segment. For now, let us say that segment predicts a set of attributes , where the number of attributes predicted by each segment , where . Initially, all segments predict all attributes, but later each segment is allowed to specialize, as described in the next section. We denote these Segment Networks , where . When the facial segment is passed through its corresponding segment network , it yields attribute scores for each attribute in and a feature for that segment , i.e.

(2)

where is tapped from the last convolutional layer of . The architectures of all the segment networks are same, as described in table I, and each of these segment networks is independent of one another.

Full Face Network: Let represent the full-face region, which is passed through a DCNN . We have adopted a seven layer deep convolutional network as . Details on the network architecture are provided in table I. The full face region is expected to always predict all attributes (

). Hence, it outputs a vector

of length , and also a compact feature representation after global pooling the last convolutional feature, i.e.

(3)

Global Prediction Network for local feature fusion: In the Global Prediction Network, we combine the results from the local segment networks and the full face network to produce predictions for all the attributes. To do so, we first concatenate the convolutional features from the segment networks, convolve them, then apply global pooling to get a flattened feature from the segments, . This is concatenated with , the flattened features from the Full Face Network, and passed through a few fully connected layers, to finally yield predictions for all the attributes. The Global Prediction Network can be thought of as a feature level fusion of the different segments, as opposed to the score level fusion of committee machines described in the next section.

(4)

The color-coded network architecture for the SPLITFACE network is shown in Fig. 2. It shows the above mentioned architectural choices, namely predictions from segments, predictions from the full face and fusion of segment and full face features to provide a global prediction.

For further discussions, we shall use the word ‘predictor’ to mean any of the sub-networks described above, that is, any of the Facial Segment Networks, the Full Face Network or the Global Prediction Network.

Localization using Global Average Pooling: It has been shown in [20] that Global Average Pooling (GAP) introduced in [22] has remarkable localization properties. Since we are aiming to predict localized attributes well from partial segments, we use a GAP layer in the architecture to transition from convolutional to fully connected layers. Using Class Activation Maps (CAM) in section 4.3 we observe that this provides the network with the desired property of being able to focus on regions of interest, thus making the process more interpretable.

Loss: We use binary crossentropy loss for all the predictor outputs described in 2, 3, 4, weighted by the inverse of priors. Then loss incurred on image is:

(5)

In 5, is a weight based on the ground truth

and the prior probability

of attribute being present, which is precomputed on the training set. The weight defined in 6 helps to mitigate the challenges due to unbalanced class distributions which are prevalent in datasets like CelebA [2].

(6)
,
conv-2 BN ReLU conv- BN ReLU conv- BN ReLU
2D MaxPool

, Stride

2D MaxPool , Stride global average pooling
conv- BN ReLU conv- BN ReLU merge
conv- BN ReLU 2D MaxPool , Stride dense-D ReLU
2D MaxPool , Stride conv- BN ReLU dropout
conv- BN ReLU 2D MaxPool , Stride
2D MaxPool , Stride conv- BN ReLU
conv- BN ReLU 2D MaxPool , Stride
conv- BN ReLU conv- BN ReLU
conv- BN ReLU
2D MaxPool , Stride
conv- BN ReLU
TABLE I: Detailed network architecture.

3.2 Optimal segment selection for output pruning

Intuitively, not all segments predict all attributes well. Therefore it is counterproductive to train the network to produce all predictions from all segment networks. Instead, we follow a data-driven approach to prune the segment networks.

Stage 1: Initially, we predict all attributes from all segment networks, the full face network and the global prediction network. Hence each attribute is predicted by

networks. After training for several epochs, we evaluate the detection accuracy of each of the

segment networks, . For each attribute, we sort the networks according to accuracy on validation set, and pick the top networks for each attribute. The global predictor (GP) and the full face network can be expected to be among the best predictors most of the time, since they have a top view of the sum of parts. The rest of the predictors of each attribute are segment networks and therefore the most associated attributes for segment where and is determined this way.

In Fig.3 a table for and is shown where for the segment networks (row three and below) the non-zero numbers for different attribute columns denote the attributes assigned to the segment after pruning. The total number of attributes assigned to the segment networks after pruning are shown in the last column.

Fig. 3: The top ranked segments (including GP and Full face, row-wise) for each attribute (in the columns). The blue cells indicate that that particular segment was not used to predict that attribute in the stage of training. The segments predict attributes that are localized in that region. For example, the bottom half segment predicts attributes related to facial hair.

Stage 2: After the association of attributes with segments as described above, a second round of training is performed. The pruning process in stage 1 allows the segment networks to focus on attributes that they perform best on, without having to worry about attributes they are just not capable of predicting. Also, we have intentionally assigned all the attributes to GP and FULL networks (as shown in Fig. 3), since the receptive field for these two networks encompasses the entire face. So, we make the inherent assumption that these two networks are capable of successfully predicting all the attributes.

3.3 Committee Machines for Score-Level Fusion

While the global predictor or the full face networks has good predictive power, using only those two predictors does not harness the full potential of the proposed network architecture. To utilize all the segment networks along with the global and full face predictors, we describe two committee machines here namely, the Highest Ranked Predictor (HRP) and the Normalized Score Aggregation (NSA) methods, that perform score-level fusion. For both methods, using the validation set , we first compute the optimal thresholds, for each attribute and for each predictor which are responsible for predicting . For CelebA, the table in Fig. 3 shows information about , which is an ordered set or tuple, ordered in descending order of validation accuracy. For example, if we consider the attribute ‘goatee’, then and , which correspond to B12, FULL, B34, R12, GP, R34 and L34. Denoting as the ground truth of attribute for sample and as the indicator function, the optimal thresholds that maximize validation accuracy are computed as:

(7)

We denote visibility of a segment for image by . Clearly , since the both the Full Face Predictor and Global Predictor Network can predict attributes no matter what the occlusion is. Finally, we define an ordered set which contains the top usable predictors for attribute (ones that have visible segments) as

(8)

3.3.1 Highest Ranked Predictor (HRP) Committee Machine

After the completion of the two training stages, we evaluate the performance of each of the predictors (segment, full face and global networks) on the validation set to find a hierarchy of best performing predictors for each attribute. The results are shown in the table in Fig. 3. For example, we can see that the best predictors for ‘goatee’ are B12, FULL, B34, R12, GP, R34 and L34, in descending order of validation accuracy.

When making a prediction for an attribute, we find the topmost usable predictor that is usable/visible for that image. This score from predictor is then thresholded with the optimal threshold of that segment for that attribute, which was precomputed from the validation set following 7. The prediction outcome based on the optimal threshold would be

(9)

Continuing our example, to predict ‘goatee’, we use the prediction of B12 (), and if that segment is not visible, we use FULL ().

3.3.2 Normalized Score Aggregation (NSA) Committee Machine

In general, different predictors trying to predict the same attribute might have different optimal thresholds. Once they are aggregated (say, by taking their mean or product or median), one needs to calculate the optimal threshold for the aggregate score. Instead, we could normalize the scores of the predictors so that after aggregation, the optimal threshold for the aggregate score is . [23] suggests a double sigmoid score normalization function for fusing scores from multiple predictors. However, it involves hyper-parameters, which need to be found by cross-validation. Instead we propose a simpler normalization function below, which does not require any hyper-parameters.

Linear Threshold Normalization: Consider a binary classification problem, where we have to decide the class , given a score . We assume that the optimal threshold that maximizes separation between the 2 classes is known. Therefore, . Consider a transformation . We wish to identify the function , such that, .

If we choose an invertible function , such that , then above equation yields . Thus, given multiple scores , and their optimal thresholds , we can transform the scores to , so that after aggregating , say by averaging, the optimal threshold is .

For our algorithm, we use a piecewise linear transformation

11, that satisfies the criterion discussed above.

(11)

We transform the scores of at least top predictors out of the visible ones, , using 11 to yield (12).

(12)

Finally we use an aggregation function on for the prediction. The decision rule using the aggregator function is

(13)

Possible choices of aggregator functions are:

  • Bayes’ Rule or Product Rule: As discussed in the score fusion literature [24] we can use a product rule to combine the decisions of

    binary classifiers according to the following decision rule

    (14)

    Thus the product aggregator function is for all .

  • Median Rule: As proposed in [25], the median aggregator function is .

3.4 Segment Dropout and Hierarchy of Best Predictors for Handling Occlusion

Segment Dropout: When training the network with image , only a subset of the segments might be present. The visible segments are randomly dropped with probability when training. This is called Segment Dropout, which was introduced in [21] to augment the dataset for handling occlusion. When a certain segment is not present in a face the input to corresponding segment branch is zero. In order to make SPLITFACE robust against such cases and generalize better to detect attributes from the available segments, random segment dropout is performed.

Hierarchy of Best Predictors: As described in the earlier section, we compute a hierarchy of predictors that are visible. Thus even if a face is partially occluded and the best segment is not available, the other segments provide reasonable predictive power to the committee machine.

The unique architecture of SPLITFACE allows the use of predictor hierarchy and thereby improving the detection accuracy. In addition, it ensures that even if some part of the input face is not visible due to occlusion or failure of the face detector, the attribute detection network would rely on the visible segments to still make a good prediction. Note that the input to GP are the features from all the segment networks and the full face network, and our partial face augmentation approach during training enables it to handle missing segments while predicting attributes.

4 Experimental Setup and Evaluation

4.1 Datasets

We use the CelebA [13] and LFWA [13] datasets for both training and evaluation. Also, to evaluate the SPLITFACE’s capability for handling partially visible faces when estimating facial attributes, we created several variations of these two datasets and evaluate the performance of SPLITFACE on those variations. We follow the data augmentation scheme described in [15] for generating partially visible faces by cropping the images keeping only L12 or L34 or R12 or R34 or U12 or U34 portion. We replace the rest of the pixels with white pixels. Hence, we create six variations of both datasets and named them C and L, respectively for CelebA and LFWA, where 111Bounding boxes for the partial CelebA and partial LFWA datasets are available at https://drive.google.com/open?id=16hL7g3d6dfvbdvwarYfT6zNcNNXcRLlr. Some sample images for C dataset are shown in Fig. 4.

Fig. 4: Modified CelebA dataset samples for partial faces.

4.2 Implementation Details

The proposed network has

trainable parameters, which are tuned using the adaptive moment estimation (ADAM) optimizer

[26]. The initial learning rate was set to . For CelebA, we train the network for epochs for both stages, while for LFWA, we train it for epochs in the first stage and epochs in the second stage. The full face region is resized to and given as input to the full face branch, which the inputs to the facial segment branches are all resized to . The experiments were performed on NVIDIA Quadro P6000 GPUs, with training batch size of

, and the code was written using the KERAS python library

[27]

with tensorflow

[28] backend. Apart from segment dropout, horizontal flipping was applied for data augmentation. Among state-of-the-art methods, the authors of AFFACT [17] have provided the source code in their paper, which we used for performance comparison on partial face datasets. However, the accuracy obtained from this implementation is slightly less than the accuracy reported in [17], perhaps because we have not applied test time data augmentation. For all the other methods, we directly report the results in corresponding publications.

4.3 Visualizing Network Response using Class Activation Maps

Fig. 5: Visualization of Class Activation Maps for four different facial segments (UR12, EP, NS and B12 in the four quarters from left to right) and some attributes estimated by the corresponding block.

The class activation map was proposed in [20] to visualize the localization properties of the network. Given a network which terminates in a Global Average Pooling (GAP) layer followed by a dense layer, we can compute the CAM of a particular class as a weighted average of the activation maps of the layer just before the GAP layer as

(15)

where is the

feature map in a feature tensor of depth

just before the GAP layer and is the corresponding weight of the dense layer after the GAP layer. In Fig. 5, we show the CAM superimposed on some facial segments from CelebA. Clearly, the activation maps are localized in interpretably meaningful regions. It can be seen from Fig. 5 that for all the three attributes shown for UR12 (‘bald’, ‘receding hairline’ and ‘wearing hat’) and B12 (‘5 o clock shadow’, ‘goatee’ and ‘sideburns’), the network focuses on the same region: the top corner of the head for UR12 and the chin and cheeks for B12. On the other hand, for segments EP and NS, the attention shifted to different regions for different attributes. For example for segment EP, the attribute ‘bags under eyes’ is predicted when the network has high response near the eyes, ‘bangs’ are predicted when the response is high near the forehead and ‘eyeglasses’ are predicted by looking at the bridge of the nose. Similarly the NS segment network shifted its attention to the nose, eyes or lips to predict ‘big nose’, ‘narrow eyes’ and ‘wearing lipstick’, respectively.

4.4 Performance Comparison on Original CelebA and LFWA datasets

The performance of the proposed method is compared with state-of-the-art methods in tables II and III on the original CelebA and LFWA datasets, respectively. Among state-of-the-art methods, the result of AFFACT is directly reported from [17], while the column AFFACT Unaligned contains results that we found by evaluating the full faces we use in our experiment. Since the source codes were not available for any of the other state-of-the-art methods, we reported the results directly from corresponding publications. The column titled ‘Prior’ shows the accuracies obtainable by only applying the knowledge from the prior probabilities of the presence or absence of an attribute in the datasets. It can be seen that a staggering mean accuracy of in CelebA and in LFWA is achievable by only using the prior probabilities in decision making. Even though the state-of-the-art methods and the proposed method increases this number by more than , for certain attributes such as Big Lips and Narrow Eyes in table II, the prior is higher than the trained methods for most of the approaches. We put the prior column in the table as a baseline for evaluation.

The last five columns in tables II and III show the attribute-wise accuracy and the mean accuracy for Full, GP, HRP, NSA Product rule and NSA Median rule, respectively, in both tables. It can be seen from these tables that the committee machine approaches boost the results obtained from Full and GP for most of the attributes. The mean accuracy of for CelebA and for LFWA obtained from NSA Product Rule closely matches the state-of-the-art results presented in the table. Note that we adopted a very simple six layer convolutional network for the Full face branch that achieves accuracy on CelebA and accuracy on LFWA. The result for CelebA is boosted for HRP but degrades slightly for NSA methods. On the other hand, for LFWA, the committee machine approaches improves the overall performance. Since LFWA is a much smaller dataset and hence the trained network over-fitted greatly on the training set, this boost in result shows that the proposed committee machine approaches, especially NSA, is effective for generalization due to their ensemble aggregation mechanism. In later sections, we will present results for partially visible faces, where the committee machine approaches consistently improves over Full and GP branches and hence the adaptation of such methods is justified for practical purposes even with a slight loss in accuracy for the original dataset.

Attributes Proposed Proposed Proposed Committee Machine
Prior LENet+ MOON[2] MCNN+ DMTL[4] AFFACT[17] AFFACT PaW [16] FULL GP HRP NSA
Anet[13] AUX[3] Unaligned[17] Prod. Rule Med. Rule
5_o_Clock_Shadow 88.83 91.00 94.03 94.51 95.00 94.21 94.09 94.64 93.96 90.00 93.96 93.01 93.13
Arched_Eyebrows 73.41 79.00 82.26 83.42 86.00 82.12 81.27 83.01 83.39 83.44 83.39 82.44 82.56
Attractive 51.36 81.00 81.67 83.06 85.00 82.83 80.36 82.86 82.71 82.86 82.86 83.13 82.76
Bags_Under_Eyes 79.55 79.00 84.92 84.92 85.00 83.75 84.89 84.58 85.12 79.72 85.12 84.63 84.86
Bald 97.72 98.00 98.77 98.90 99.00 99.06 97.82 98.93 98.46 97.88 98.46 97.98 98.03
Bangs 84.83 95.00 95.80 96.05 99.00 96.05 95.49 95.93 95.65 95.72 95.72 95.73 95.71
Big_Lips 75.91 68.00 71.48 71.47 96.00 70.88 71.42 71.46 67.29 67.29 67.29 69.78 69.28
Big_Nose 76.44 78.00 84.00 84.53 85.00 83.82 81.83 83.63 83.91 81.85 83.36 81.31 83.81
Black_Hair 76.10 88.00 89.40 89.78 91.00 90.32 85.88 89.84 88.88 72.85 88.88 88.82 89.03
Blond_Hair 85.09 95.00 95.86 96.01 96.00 96.07 95.17 95.85 95.70 95.68 95.70 95.04 95.76
Blurry 94.86 84.00 95.67 96.17 96.00 95.50 94.52 96.11 95.87 94.95 95.87 95.04 95.96
Brown_Hair 79.61 80.00 89.38 89.15 88.00 89.16 87.72 88.50 88.42 87.64 88.42 85.59 88.25
Bushy_Eyebrows 85.63 90.00 92.62 92.84 92.00 92.41 90.59 92.62 92.41 92.20 92.41 91.82 92.66
Chubby 94.23 91.00 95.44 95.67 96.00 94.98 95.10 95.46 94.69 94.69 94.69 93.90 94.94
Double_Chin 95.35 92.00 96.32 96.32 97.00 96.18 95.94 96.26 95.43 95.43 95.68 95.23 95.80
Eyeglasses 93.54 99.00 99.47 99.63 99.00 99.61 99.38 99.59 99.43 99.48 99.30 99.58 99.51
Goatee 93.65 95.00 97.04 97.24 99.00 97.31 97.21 97.38 96.51 95.41 96.70 95.88 96.68
Gray_Hair 95.76 97.00 98.10 98.20 98.00 98.28 97.89 98.21 97.57 95.99 97.57 95.80 97.45
Heavy_Makeup 61.57 90.00 90.99 91.55 92.00 91.10 90.82 91.53 91.18 91.51 91.51 91.55 91.59
High_Cheekbones 54.76 87.00 87.01 87.58 88.00 86.88 86.11 87.44 87.08 87.54 87.54 87.62 87.61
Male 58.06 98.00 98.10 98.17 98.00 98.26 97.29 98.39 97.58 98.14 98.14 98.09 97.95
Mouth_Slightly_Open 51.78 92.00 93.54 93.74 94.00 92.60 92.82 94.05 93.62 93.91 93.91 93.90 93.78
Mustache 95.92 95.00 96.82 96.88 97.00 96.89 96.89 96.90 96.12 96.12 96.12 96.16 95.86
Narrow_Eyes 88.41 81.00 86.52 87.23 90.00 87.23 87.15 87.56 86.79 85.13 86.84 87.31 86.88
No_Beard 83.42 95.00 95.58 96.05 97.00 95.99 95.33 96.22 95.77 96.17 96.17 95.57 96.17
Oval_Face 71.68 66.00 75.73 75.84 78.00 75.79 74.87 75.03 75.40 70.45 75.40 75.75 74.93
Pale_Skin 95.70 91.00 97.00 97.05 97.00 97.04 96.97 97.08 96.90 95.80 96.90 96.72 97.00
Pointy_Nose 72.45 72.00 76.46 77.47 78.00 74.83 76.24 77.35 76.13 71.45 76.13 76.46 76.47
Receding_Hairline 91.99 89.00 93.56 93.81 94.00 93.29 91.74 93.44 92.55 91.52 92.55 92.40 92.25
Rosy_Cheeks 93.53 90.00 94.82 95.16 96.00 94.45 94.54 95.07 94.59 92.83 94.59 94.51 94.79
Sideburns 94.37 96.00 97.59 97.85 98.00 97.83 97.46 97.64 96.83 96.09 96.83 96.01 97.17
Smiling 52.03 92.00 92.60 92.73 94.00 91.77 90.45 92.73 92.42 92.74 92.74 92.89 92.70
Straight_Hair 79.14 73.00 82.26 83.58 85.00 84.10 82.17 83.52 83.11 79.04 83.11 82.36 80.41
Wavy_Hair 68.06 80.00 82.47 83.91 87.00 85.65 83.37 84.07 83.28 63.58 83.28 83.10 81.70
Wearing_Earrings 81.35 82.00 89.60 90.43 91.00 90.20 90.33 89.93 90.41 90.48 90.41 89.72 89.44
Wearing_Hat 95.06 99.00 98.95 99.05 99.00 99.02 98.66 99.02 98.71 95.79 98.71 98.42 98.74
Wearing_Lipstick 53.04 93.00 93.93 94.11 93.00 91.69 92.99 94.24 92.66 93.23 93.23 94.00 93.21
Wearing_Necklace 87.86 71.00 87.04 86.63 89.00 87.85 87.55 87.70 87.54 86.22 87.54 87.50 85.61
Wearing_Necktie 92.70 93.00 96.63 96.51 97.00 96.90 96.43 96.85 96.66 95.61 96.66 95.24 96.05
Young 77.89 87.00 88.08 88.48 90.00 88.66 86.21 88.59 87.95 88.45 88.45 86.93 88.01
Mean Accuracy 80.57 87.30 90.94 91.29 92.60 91.01 90.32 91.23 90.72 88.87 90.80 90.42 90.61
TABLE II: Attribute detection performance comparison on the CelebA dataset in terms of individual and mean detection accuracy for the attributes.
Attributes Proposed Proposed Proposed Committee Machine
Prior LENet+ MCNN+ DMTL[4] FULL GP HRP NSA
Anet[13] AUX[3] Prod. Rule Med. Rule
5_o_Clock_Shadow 59.76 84 77.06 80 74.72 74.72 74.72 77.47 77.59
Arched_Eyebrows 72.35 82 81.78 86 78.78 78.78 78.78 81.82 81.72
Attractive 62.09 83 80.31 82 77.44 77.44 77.44 80.25 80.16
Bags_Under_Eyes 59.52 83 83.48 84 79.11 79.11 79.11 82.98 82.62
Bald 88.94 88 91.94 92 91.69 91.51 91.51 90.97 91.88
Bangs 83.57 88 90.08 93 89.72 89.72 89.72 90.89 90.71
Big_Lips 64.07 75 79.24 77 75.47 77.54 77.54 79.10 78.97
Big_Nose 69.62 81 84.98 83 80.23 80.23 80.23 82.95 83.13
Black_Hair 85.53 90 92.63 92 91.63 92.22 92.22 92.34 92.49
Blond_Hair 95.75 97 97.41 97 97.31 97.31 97.31 97.47 97.47
Blurry 84.66 74 85.23 89 85.41 85.41 85.41 86.41 86.42
Brown_Hair 62.02 77 80.85 81 79.22 79.22 79.22 81.12 80.93
Bushy_Eyebrows 53.58 82 84.97 80 80.73 82.41 82.41 84.42 84.26
Chubby 64.31 73 76.86 75 74.13 75.19 75.19 76.13 76.06
Double_Chin 65.58 78 81.52 78 77.82 79.19 79.19 80.76 80.49
Eyeglasses 80.23 95 91.30 92 89.69 90.76 90.76 91.72 91.50
Goatee 77.41 78 82.97 86 81.72 81.72 81.72 83.30 83.01
Gray_Hair 83.94 84 88.93 88 87.94 87.94 87.94 88.37 88.46
Heavy_Makeup 87.21 95 95.85 95 94.80 94.80 94.80 95.38 95.39
High_Cheekbones 63.34 88 88.38 89 86.53 86.53 86.53 88.34 88.34
Male 76.02 94 94.02 93 92.17 92.17 92.17 92.81 92.60
Mouth_Slightly_Open 57.02 82 83.51 86 79.03 79.03 79.03 82.70 82.50
Mustache 89.03 92 93.43 95 91.92 91.92 91.92 93.27 92.97
Narrow_Eyes 63.45 81 82.86 82 78.94 80.07 80.07 82.86 82.75
No_Beard 73.08 79 82.15 81 79.27 79.27 79.27 80.65 80.77
Oval_Face 52.37 74 77.39 75 74.19 74.19 74.19 76.51 76.80
Pale_Skin 50.82 84 93.32 91 88.36 90.16 90.16 91.00 90.97
Pointy_Nose 68.4 80 84.14 84 81.50 82.92 82.92 83.63 84.20
Receding_Hairline 56.36 85 86.25 85 83.91 83.91 83.91 85.09 84.90
Rosy_Cheeks 81.46 78 87.92 86 85.55 85.55 85.55 87.19 87.08
Sideburns 69.38 77 83.13 80 79.42 79.42 79.42 81.89 81.76
Smiling 56.65 91 91.83 92 88.65 88.65 88.65 90.77 90.80
Straight_Hair 60.1 76 78.53 79 77.09 78.10 78.10 79.27 78.91
Wavy_Hair 57.94 76 81.61 80 77.02 77.02 77.02 78.55 78.28
Wearing_Earrings 85.1 94 94.95 94 94.20 94.20 94.20 94.59 94.75
Wearing_Hat 86.57 88 90.07 92 89.81 90.23 90.23 90.25 90.23
Wearing_Lipstick 83.22 95 95.04 93 93.71 93.71 93.71 94.07 94.07
Wearing_Necklace 78.54 88 89.94 91 88.71 88.71 88.71 89.45 89.59
Wearing_Necktie 63.13 79 80.66 81 79.55 79.55 79.55 81.70 81.40
Young 78.59 86 85.84 87 83.90 83.90 83.90 85.55 85.68
Mean Accuracy 71.27 83.85 86.31 86.15 84.02 84.36 84.36 85.85 85.82
TABLE III: Attribute detection performance comparison on the LFWA dataset in terms of individual and mean detection accuracy for the attributes.

4.5 Cross-Dataset Testing Accuracies

In table IV, we presented the cross-dataset testing performances of AFFACT, DMTL and SPLITFACE (NSA product rule). For AFFACT and the proposed method, we presented two accuracies separated by , the first one is for using the optimal threshold obtained from the validation set to find detection results (for AFFACT) or to normalize scores before applying product rule (for proposed). And, the second accuracy is obtained by using the mid value of the score range ( for AFFACT which gives scores between and , and for proposed method for which score is between and ) as the threshold. Higher accuracies are obtained by using optimal thresholds for the proposed method, while for AFFACT the accuracies dropped slightly. It can be seen from this table that, for cross-dataset testing (trained on CelebA and tested on LFWA or vice versa), the proposed method outperforms both AFFACT and DMTL with a relatively large margin. This again proves the generalization capability of SPLITFACE, which is achieved by the combination of its unique architecture with the committee machine.

Train/Test CelebA LFWA
CelebA 89.07/90.32 92.6 90.39/87.14 79.5/73.84 73 79.32/74.56
LFWA -/- 70.2 78.15/77.88 -/- 86 85.99/85.28
TABLE IV: Cross dataset results. The three numbers for each Train-Test pair are for AFFACT [17], DMTL [4] and the Proposed method, respectively, from left to right.

Next, we evaluated the performances SPLITFACE on the modified CelebA and LFWA partial face datasets. The results for evaluation on same and cross-dataset are presented in tables V and VI. In table V, results are presented form AFFACT and SPLITFACE network, both trained on the original CelebA training set and tested on the original and modified CelebA and LFWA datasets. The results for both using and not using optimal threshold (using for AFFACT and for SPLITFACE as threshold instead) are shown in the table. It can be seen that SPLITFACE, especially NSA with product rule and optimal threshold outperforms AFFACT in terms of accuracy for full face dataset, and both cross-domain and partial datasets. The differences are more prominent when using the optimal thresholds, which show that threshold normalization step with a piece-wise linear function can boost the overall performance. Similar scenario is found in Table VI, where SPLITFACE is trained on the original LFWA training set and tested on both original and partial CelebA and LFWA datasets. Since the no pre-trained version of AFFACT on LFWA is publicly available, the results for AFFACT could not be provided in this table. Note that in both tables V and VI, the committee machine approaches improves over the full face branches, especially for partial face datasets. This improvement can be attributed to the unique architecture of SPLITFACE that harnesses local information from unoccluded facial segments and to the ensemble aggregation approach by using committee machine.

Method CelebA C-U12 C-U34 C-L12 C-L34 C-R12 C-R34 LFWA L-U12 L-U34 L-L12 L-L34 L-R12 L-R34
AFFACT 90.32 77.98 81.86 80.56 84.93 80.18 85.07 73.84 68.83 71.12 69.01 73 69.21 73.44
Witout Full 86.76 80.99 84.34 83.86 85.39 83.45 85.71 73.52 67.28 70.25 69.94 72.69 70.01 72.67
Optiamal HRP 86.93 81.46 84.51 84.23 85.97 84.3 86.29 73.54 67.51 70.51 70.08 72.13 70.38 72.64
Threshold NSA Prod rule 87.14 83.24 85.53 84.6 86.79 84.84 86.5 74.56 68.61 71.45 70.34 73.2 70.35 73.32
NSA Med rule 87.07 83.22 85.47 84.51 86.75 84.76 86.44 74.3 68.49 71.59 70.36 73.24 70.27 73.1
AFFACT 89.07 83.05 85.60 84.98 87.47 85.33 87.69 79.5 74.77 77.55 74.97 78.34 74.85 78.19
With Full 90.72 83.99 87.14 87.19 89.33 87.46 89.63 72.32 66.96 69.37 68.22 71.1 68.72 71.25
Optimal HRP 90.8 84.27 87.59 87.11 89.31 87.82 89.71 72.67 67.28 69.94 67.94 70.91 68.42 71.35
Threshold NSA Prod rule 90.39 85.3 88.08 88.12 89.87 88.42 90.02 79.32 75.76 77.81 76.7 78.77 76.51 78.57
NSA Med rule 90.61 85.47 88.16 87.59 89.76 88.1 90.01 79.86 75.28 78.34 76.1 78.73 75.74 78.49
TABLE V: Networks trained on CelebA and tested on both full and partial CelebA and LFWA datasets.
Method CelebA C-U12 C-U34 C-L12 C-L34 C-R12 C-R34 LFWA L-U12 L-U34 L-L12 L-L34 L-R12 L-R34
Without Full 75.87 64.84 71.67 70.93 73.67 70.77 74 83.52 67.11 74.76 75.08 79.74 75.8 80.24
Optimal HRP 75.94 65.03 71.94 70.78 73.64 70.58 74.02 83.84 67.59 75.36 74.93 79.81 75.7 80.45
Threshold NSA Prod rule 77.88 67.85 75.63 71.39 75.95 71.23 75.82 85.28 70.47 79.78 75.74 81.62 76.35 82.52
NSA Med rule 78.22 67.72 75.55 71.3 75.84 71.16 75.72 85.18 70.3 79.64 75.62 81.51 76.25 82.43
With Full 76.37 65.46 71.74 70.97 74 70.83 74.35 84.02 69.38 76.05 76.09 80.51 76.74 80.86
Optimal HRP 76.58 66 72.18 71.22 74.21 71.16 74.61 84.36 70.13 76.66 76.42 80.85 77.06 81.28
Threshold NSA Prod rule 78.15 69.54 76.14 72.3 76.68 72.22 76.63 85.99 73.26 81.4 77.4 82.84 78.32 83.46
NSA Med rule 78.13 68.13 75.61 71.75 75.92 71.79 76.03 85.82 72.12 80.79 76.75 82.22 77.41 83.06
TABLE VI: Networks trained on LFWA and tested on both full and partial CelebA and LFWA datasets.

4.6 Analysis of Performance Degradation with Occlusion

Fig. 6: Attribute-wise comparison of performance changes (w.r.to the performance on the unoccluded faces in CelebA) on the C-U12, C-L12 and C-L34 modified datasets. The vector of differences are denoted as delta-U12, delta-L12 and delta-L34 respectively.

One obvious observation from tables V and VI is that the attribute detection accuracies decrease with increasing occlusion. For example, all the methods achieve higher accuracies for upper three-fourth faces present in C-U34 and L-U34 in comparison to C-U12 and L-U12, respectively. In this section, we explore the effect of occlusion on the accuracy of each attribute using Fig 6 which plots the decrease in accuracy of SPLITFACE (after stage 1 before output pruning) for the partial CelebA datasets, C described in section 4.1 with respect to full face accuracy. The differences are denoted as delta-U12, delta-L12 and delta-L34, respectively. We observe that SPLITFACE fails in C-L12 and C-L34 for the same attributes such as wavy hair, high cheekbones and wearing lipstick, since part of the right side of the faces containing vital information in these regard are occluded in both cases. On the other hand, C-U12 has reduced performance for attributes like ‘mouth slightly open’, ‘no beard’ and ‘smiling’, which are attributes localized in the lower part of the face, which is not visible in C-U12. So, SPLITFACE avoids catastrophic failures during occlusion, since the prediction accuracy of other attributes remain near constant and only invisible localized attributes’ performance degrade. The output pruning step of SPLITFACE removes the attributes for which a segment performs badly in the first stage. When trained, SPLITFACE utilizes information from different segments to bolster its decision about an attribute as well as fill up the gaps in attributes in one segment by using information from other segments which predict those missing attributes.

4.7 Performance for Partial Face Augmentation

We also trained SPLITFACE with training samples from the modified partial face datasets. When training, samples from the modified datasets were picked with a probability in each batch while the rest of the samples came from the original datasets. The performances of the networks trained in this manner are presented in tables VII and VIII. In comparison to tables V and VI, we can see that for both partially modified CelebA and LFWA datasets, the performances improved greatly when the partial faces are augment the training samples in addition to segment dropout.

Methods CelebA C-U12 C-U34 C-L12 C-L34 C-R12 C-R34
Full 90.42 88.01 89.7 89.56 90.02 89.54 90.07
HRP 90.18 87.94 89.55 89.23 89.77 89.21 89.71
NSA Prod rule 90.39 88.43 89.77 89.76 90.02 89.86 90.11
NSA Med rule 90.52 88.46 90.05 89.85 90.16 89.85 90.21
TABLE VII: Performance of SPLITFACE trained on original and modified CelebA (70-30 ratio).
Methods LFWA L-U12 L-U34 L-L12 L-L34 L-R12 L-R34
Full 83.93 80.31 83.01 81.6 83 81.93 83.37
HRP 85.46 82.39 84.45 83.36 84.59 83.56 84.79
NSA Prod rule 86.04 82.06 84.87 83.07 84.97 83.43 85.14
NSA Med rule 85.97 82.17 84.88 83.06 84.87 83.4 85.23
TABLE VIII: Performance of SPLITFACE trained on original and modified LFWA (70-30 ratio).

5 Conclusion and Future Work

In this paper, we introduced SPLITFACE, an algorithm for facial attribute extraction utilizing multiple facial segments, a unique deep convolutional network, and a committee machine approach for ensemble aggregation. Through extensive experimentation, we have shown that the proposed method outperforms state-of-the-art facial attribute extraction methods when the faces are partially visible. Also, utilizing a committee machine approach, SPLITFACE achieved better generalization and therefore superior performance across domains. Moreover, when trained with both segment dropout and partial face data, the network achieved even higher attribute detection accuracy for partially visible faces. The overall accuracies might be boosted by replacing the full face and segment branches with more advanced deep neural network architecture such as ResNet. On the other hand, since the segments are heavily overlapping and therefore can assist each other greatly, similar performance might be achievable with smaller input images. Finally, it would be interesting to see if a cross-stitch network [29] can improve performance when connected to the segment network branches at certain intervals, by allowing the segment networks to share data.

Acknowledgments

This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA RD Contract No. 2014-14071600012. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

References