Face Attribute Prediction Using Off-the-Shelf CNN Features

02/12/2016
by   Yang Zhong, et al.
KTH Royal Institute of Technology
0

Predicting attributes from face images in the wild is a challenging computer vision problem. To automatically describe face attributes from face containing images, traditionally one needs to cascade three technical blocks --- face localization, facial descriptor construction, and attribute classification --- in a pipeline. As a typical classification problem, face attribute prediction has been addressed using deep learning. Current state-of-the-art performance was achieved by using two cascaded Convolutional Neural Networks (CNNs), which were specifically trained to learn face localization and attribute description. In this paper, we experiment with an alternative way of employing the power of deep representations from CNNs. Combining with conventional face localization techniques, we use off-the-shelf architectures trained for face recognition to build facial descriptors. Recognizing that the describable face attributes are diverse, our face descriptors are constructed from different levels of the CNNs for different attributes to best facilitate face attribute prediction. Experiments on two large datasets, LFWA and CelebA, show that our approach is entirely comparable to the state-of-the-art. Our findings not only demonstrate an efficient face attribute prediction approach, but also raise an important question: how to leverage the power of off-the-shelf CNN representations for novel tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/28/2019

Feature Level Fusion from Facial Attributes for Face Recognition

We introduce a deep convolutional neural networks (CNN) architecture to ...
02/04/2016

Leveraging Mid-Level Deep Representations For Predicting Face Attributes in the Wild

Predicting facial attributes from faces in the wild is very challenging ...
09/12/2017

A Deep Cascade Network for Unaligned Face Attribute Classification

Humans focus attention on different face regions when recognizing face a...
05/14/2021

Face Attributes as Cues for Deep Face Recognition Understanding

Deeply learned representations are the state-of-the-art descriptors for ...
11/28/2014

Deep Learning Face Attributes in the Wild

Predicting face attributes in the wild is challenging due to complex fac...
05/23/2018

Attributes in Multiple Facial Images

Facial attribute recognition is conventionally computed from a single im...
07/03/2019

Slim-CNN: A Light-Weight CNN for Face Attribute Prediction

We introduce a computationally-efficient CNN micro-architecture Slim Mod...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent success achieved by the Convolutional Neural Networks (CNNs) has vastly driven the advances in many aspects of computer vision, such as image classification and object detection, and pushed the boundaries of understanding image content through computer vision. In face recognition, we have witnessed great improvements brought by CNNs in solving the challenging large-scale face verification and recognition tasks [5, 8].

Like recognizing identities, describing attributes from face images in the wild has been an active research topic for years. Being able to automatically describe face attributes from face images in the wild is very challenging but can be very helpful. For instance, one can not only build identifiers directly based on attributes [10], but also efficiently construct highly flexible large-scale hierarchical datasets, which can further benefit image classification and attribute-to-image generation [23, 22].

The general process of predicting face attributes is to construct face representations and train domain classifiers for prediction. As summarized in Figure

1, traditional approaches (Pipeline 1) construct low-level descriptors, such as SIFT [11] and LBP [1], through landmark detection. These descriptors are then utilized for building attribute classifiers. Similarly, by using CNNs one can also employ massive sentence and image training instances to construct end-to-end deep architectures (Pipeline 2) for learning semantic-visual correspondences as in [6, 21]. However, such approaches are rather resource demanding.

Figure 1:

Potential pipelines of automatic attribute estimation.

An intuitive alternative way (Pipeline 3) is to decompose the end-to-end network (by functionality) into a localization network, a feature construction network, and attribute classifiers, and build them individually as in [25]. By cascading the trained components, such a pipeline can achieve state-of-the-art performance. However, the requirements of this approach on data and training efforts seem enormous. In addition, it appears that to reach the best performance, high-level features must be used in the concatenated deep networks; fine-tuning on the pre-trained off-the-shelf high-level abstraction yields significantly better performance.

However, given that many face attributes are locally orientated and different layers of CNN features encode different levels of visual information, we believe face attributes would not be best represented by merely high-level features from deep neural networks. Thus, in this paper, we alternatively tackle the face attribute prediction problem using a pipeline composed of a conventional localization component, an off-the-shelf CNN, and attribute classifiers (Pipeline 4 in Figure 1). Our focus is finding proper feature representations from pre-trained CNNs to boost attribute perdition. We use off-the-shelf architectures and a publicly accessible model intended for face recognition to do feature construction, and investigate what types of feature representations from the network can efficiently improve face attribute prediction.

Our investigations show that intermediate representations from pre-trained CNNs have distinct advantages over high-level features for the target face attribute prediction problem. By simply utilizing these features, we achieved very promising results on a par with the state-of-the-art, produced by the intensively trained two-stage CNN, on two recently released face attribute prediction datasets CelebA and LFWA [25]. Our findings also suggest that off-the-shelf intermediate CNN representations could be easily utilized when transferring from the source problem to novel detection and classification tasks.

2 Related Work

Traditionally, face descriptors were built from hand-crafted features. These features were constructed either from the whole face area, or extracted from detected local components and concatenated into a train of descriptors [9]. Classifiers were trained based on these features to recognize the presence and quantitative degree of the domain attributes. Recently, Liu. et al. [25] proposed a cascaded learning framework to perform attribute prediction in the wild. By pre-training and fine-tuning on large object dataset and face datasets, it efficiently localizes faces and produces semantic attributes for arbitrary face sizes without alignment.

As a strong feature learner, CNN has been successfully applied in face recognition, especially for solving the challenging face recognition in the wild problem [5]. Besides the DeepID series approaches [18, 16, 17], related efforts have also been made to pose correction [20], architectures design [19, 14] and data collection skills [12]. With recently launched hardware platforms and the publicly accessible large-scale dataset [24], developing deep learning based face recognition approaches becomes feasible with less resources.

3 Attribute Prediction using CNNs Off-the-shelf

3.1 Overview

To describe face appearances using CNN features, it is critical to first consider a proper face representation from the deep neural network. One natural way is to represent faces using the discriminatively learned features, from the high-level hidden layers, mostly used for representing identities in face recognition tasks, as in [18]

. In this case, appearance attributes are embedded in the activation of neurons in the discriminative feature.

However, to describe the appearance using deep representations from CNNs, it is easy to expect that the selected representation should preserve the variability to describe the appearance variations regarding facial physical characteristics, such as “big (eyes)” and “open (mouth)”. While on the contrary, when attributes are identity correlated (e.g. gender and ethnicity), such representation should be robust with respect to non-identity related interference. Thus, the representation that most suitable to describe a certain attribute highly depends on the property (e.g. if subject to identity) of the attribute itself. Given that a CNN enables its intermediate representations to maintain both discriminality and rich spatial information [2, 13], it is therefore tenable to employ flexible selections of feature presentations for predicting face attributes.

3.2 Experiments

3.2.1 Procedures

To identify the most effective deep representations, our method explores the attribute prediction power of intermediate representations versus the final representation 111The high level abstraction used for representing identity, which is often extracted from the last FC layer. from CNNs trained for face recognition. Therefore, we first trained a face classification CNN (or use a publicly available model), then we evaluated the prediction performance of the representations extracted from different levels of the CNN. The training of CNNs and the evaluation of prediction power were conducted separately on two independent datasets: the WebFace [24] was used for CNN training, and the CelebA and LFWA for evaluation. We used two well known off-the-shelf architectures (configurations of filter stacks) in our experiments to benefit from the latest development in CNN architecture design.

Network architecture: The networks used in our experiments shared the same format: they were composed of off-the-shelf filter stacks followed by two Fully Connected (denoted by and ) layers. Considering the ease of training and efficient inference during test phase, we selected Google’s FaceNet NN.1 [14] (shortened as “FaceNet” in the following) and VGG’s “very deep” model [15]

as the structure of convolutional (conv.) layers. The CNNs were trained in the most fundamental flat classification manner with a Softmax layer attached to the last FC layer during training. We used dropout regularization between FCs to prevent overfitting and the dropout rate was set to 0.5 for all FC layers in our networks. PReLU

[4] rectification was attached to each convolution and FC layer.

Training: Around identities with image instances of the WebFace dataset were used. Random mirroring, slight rotation and jittering were utilized as data augmentation. The learning rate was initially set to , and then decreased by a factor of 10 when the validation set accuracy stopped increasing. The networks were trained by 3 decreasing learning rates. Faces were segmented and normalized to a size of and randomly cropped patches of were fed into the network.

Feature Extraction: To extract face descriptors from CNNs, only the center patch (

) and its mirrored version of aligned face images were fed into the CNNs unless otherwise stated. We aligned faces using feature points detected by random forests

[7]. We took the averaged representations of the two fed-in patches at different levels of the network, i.e. “”, “”, “”, and “” , as shown in Figure 2, and evaluated their attribute estimation performance to identify the most effective representation corresponding to each attribute.

The output of the last conv. filter stack was selected as the representative of the intermediate representations since it was shown to have the most discriminality and spatial information for recognition and image retrieval

[2]

. Extra max. pooling steps were applied to reduce the dimension of intermediate spatial representations. Then

and were of and in our experiments regardless of the network, where represents the channel depth of the employed network.

Figure 2: Pipeline of extracting deep representations from trained CNN. Intermediate features (, FC1) and final representation FC2 are extracted from the trained network.

for the side of deep feature map after extra pooling step. In total, 4 types of representations will be studied for face attribute prediction.

Attribute prediction: The face attribute prediction performance was evaluated on the released version of CelebA and LFWA datasets222http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html. Released in Oct. 2015.. The CelebA contains approximately images of identities and LFWA has images of identities. Each image in CelebA and LFWA is annotated with binary attribute tags. We used the same procedure to build our attribute classifier as in [25]: binary linear SVM [3] classifiers were trained directly for all levels of representations (i.e.  and ) to classify face attributes. On the CelebA, the training set for each attribute classifier had image instances (where available). Since this dataset and the data for training our CNN are independent (the learning targets are also different), we tested the attribute prediction accuracy of our classifiers across the whole dataset through random selection of training and testing face instances. On the LFWA, we took the training instances defined by the dataset. We report the prediction accuracy as the mean of True Acceptance Rate and True Rejection Rate for each attribute on both CelebA and LFWA datasets.

Evaluations and Comparison: The same evaluation protocol as in [25] was used in our experiments. Since the features in our experiment was extracted from aligned face images and the alignment process was independent of the network, we selected the corresponding approach (“[17]+ ANet” in [25]) as the baseline method. The current state of the art in [25] is denoted by “Two-stage CNN” and “LNet+ANet” in this paper.

The above mentioned procedures were used in the following experiments. We first employed our FaceNet to thoroughly study the discrepancy between different representation types for face attribute prediction. The identified best performing off-the-shelf features were utilized to challenge on the CelebA and the LFWA to compare with [25]. We then extended our experiments by further investigating different configurations of the FaceNet, the VGG’s “very deep” architecture, and the publicly available VGG-Face model333http://www.robots.ox.ac.uk/~vgg/software/vgg_face/, accessed in Nov. 2015. to ensure the discrepancy in attribute prediction power among the deep representations.

3.2.2 Performance Discrepancy between Deep Representations

Our intuition as stated above was that the intermediate face representations would be more suitable for describing diverse types of attributes regarding their physical characteristics and image conditions. To validate this, we trained a face recognition CNN with a structure of FaceNet. The length of both FC layers was set to to reduce the risk of overfitting. The recognition rate of the trained FaceNet on the validation set was less than 98% and the face verification performance on the LFW [5] was 97.5%. We then extracted the four types of face representations, , , , and , from our trained model and linear classifiers were constructed and evaluated respectively on the training set. The prediction performance for each representation type on all attributes is shown in Figure 3.

Figure 3: Comparing prediction performance of deep representations on CelebA and LFWA. was set as the reference for each attribute and the relative prediction power of each representation type is plotted based on its difference to the reference. The attributes are sorted based on the absolute prediction accuracy of the baseline method on each dataset. On CelebA, the absolute mean prediction accuracy of , , , , and on LFWA, , , , .

While it is intuitive that , the identity discriminative feature, is unlikely to be the best choice for describing facial attributes all the time, it is still astonishing that was significantly outperformed ( in prediction accuracy) by others on attributes. Similar disadvantages can also be observed on the LFWA dataset. It is easy to find that:

  1. Representations at different levels of the network feature quite diverse performance in attribute description.

  2. Intermediate representations, especially , are likely more effective in telling the weak identity-related attributes describing expressions and image conditions, which counts more on spatial information.

For instance, for attributes related to mouth and eyes which can produce dynamic facial expressions, such performance gaps are significant. For “” which better preserved spatial information, it is more effective in specifying shape and motion of facial components (e.g. Attribute 20 and 34 on CelebA in Figure 3). This is natural since intermediate representations contain mid-level features composed by low-level ones, thus they are more suitable to describe local facial attributes.

Our investigations show that the best performing representations achieved attribute prediction accuracy of 86.6% on CelebA and 84.7% on LFWA, which is on a par with state-of-the-art “Two-stage CNN” approach which was trained with massive image classification and face data. The comparative results are listed in Table 1 and shown in Figure 4. One can see that:

  1. By leveraging the intermediate deep representations from various levels of CNNs, the equivalent baseline approach is outperformed with a big margin.

  2. Even without fine-tuning the pre-trained CNN, our average prediction performance is still comparable to the state-of-the-art on both datasets.

Figure 4: Comparing attribute prediction results on CelebA and LFWA.

Here we noticed that the intermediate representations dominated the best representations of the attributes. This indicates that spatial information, i.e. location and magnitude of activation in conv. filter responses, is significant for describing attributes; if one wants to utilize high-level features, which implicitly embeds spatial information, fine-tuning must be conducted on the high-level abstractions to enhance such useful spatial information.

5 o Clock Shadow

Arched Eyebrows

Attractive

Bags Under Eyes

Bald

Bangs

Big Lips

Big Nose

Black Hair

Blond Hair

Blurry

Brown Hair

Bushy Eyebrows

Chubby

Double Chin

Eyeglasses

Goatee

Gray Hair

Heavy Makeup

High Cheekbones

CelebA Baseline 86 75 79 77 92 94 63 74 77 86 83 74 80 86 90 96 92 93 87 85
LNet+ANet 91 79 81 79 98 95 68 78 88 95 84 80 90 91 92 99 95 97 90 87
Ours 89 83 82 79 96 94 70 79 87 93 87 79 87 88 89 99 94 95 91 87
LFWA Baseline 78 66 75 72 86 84 70 73 82 90 75 71 69 68 70 88 68 82 89 79
LNet+ANet 84 82 83 83 88 88 75 81 90 97 74 77 82 73 78 95 78 84 95 88
Ours 77 83 79 83 91 91 78 83 90 97 88 76 83 75 80 91 83 87 95 88

Male

Mouth S. Open

Mustache

Narrow Eyes

No Beard

Oval Face

Pale Skin

Pointy Nose

Receding Hairline

Rosy Cheeks

Sideburns

Smiling

Straight Hair

Wavy Hair

Wearing Earrings

Wearing Hat

Wearing Lipstick

Wearing Necklace

Wearing Necktie

Young

CelebA Baseline 95 85 87 83 91 65 89 67 84 85 94 92 70 79 77 93 91 70 90 81
LNet+ANet 98 92 95 81 95 66 91 72 89 90 96 92 73 80 82 99 93 71 93 87
Ours 99 92 93 78 94 67 85 73 87 88 95 92 73 79 82 96 93 73 91 86
LFWA Baseline 91 76 79 74 69 66 68 72 70 71 72 82 72 65 87 82 86 81 72 79
LNet+ANet 94 82 92 81 79 74 84 80 85 78 77 91 76 76 94 88 95 88 79 86
Ours 94 81 94 81 80 75 73 83 86 82 82 90 77 77 94 90 95 90 81 86
Table 1: Comparing prediction accuracy (in %) on CelebA and LFWA: corresponding average values of our approaches are 86.6% and 84.7%; for the baseline method: 83%, 76%; for the current best LNet+ANet approach: 87% and 84%.

3.2.3 Further Validations

To further verify the potential utility of intermediate spatial representations for face attribute prediction, we also evaluated various network architectures trained by different configurations, which for each model are listed in Table 2. CelebA was selected as the evaluation dataset due to its larger scale.

Specifically, we first evaluated two different networks of the FaceNet architecture. Model 1 was trained by the first identities that has the most images on the WebFace dataset (i.e. taking away the long-tail data). Since the length of the representing features plays a vital role in face representation [12], we then evaluated the influences of varying FC layer lengths in face attribute prediction with Model 2 by increasing the length of FC layers to . The receptive field was kept the same ().

We also cross-validated the utility of the deep representations with VGG type architectures in Model 3 and 4. Model 3 had filter stack as VGG fitler-Config.C [15], but with a duplicated conv. and pooling section appended to the fifth pooling so that it was even deeper. (Thus, for this configuration, the filter stack directly gave output with size of . It was then max. pooled to get output.) To bring more divergence, we decreased the input size to (still cropped from ) and set FCs to . Model 4 was the off-the-shelf VGG-Face network. The receptive area for Model 4 was . The corresponding results in terms of the averaged prediction accuracy are provided in Table 3.

Model
Conv. Filter Stack
Architecture
FC 1
Dim.
FC 2
Dim.
Training Dataset /
Identities
1 FaceNet 512 512 WebFace,  8k
2 FaceNet 1024 1024 WebFace,  10k
3 VGG, Config. C 4096 4096 WebFace,  10k
4 VGG-Face 4096 4096 private, >2.6k
Table 2: CNNs and training data used for further validations.
Model # of Best Rep. from ave. accuracy
S.3*3 S.1*1 FC1 FC2 Best Rep. FC2
1 33 0 6 1 86% 84%
2 31 0 6 3 86% 84%
3 28 11 1 0 85% 83%
4 37 1 0 2 86% 85%
Table 3: Decomposition of the best representations of the architectures in Table 2. This table gives the number of each representation type that formed the best representation for each model and provides the average prediction accuracy achieved by the best representations and the representation. (“S.” for “Spat.”)

We observed that on average the spatial representations excelled on more than of the attributes. The spatial representation from the off-the-shelf VGG-Face model even dominated the best representations. We attribute it to the dramatic increase of the receptive area. The intermediate representations embedded more detailed spatial information also further boosted the performance of , which was as effective as the . The slightly worse performance of the features from Model 3 can be attributed to the lower receptive field and the 6th extra pooling, which caused transfer of prediction power from to .

Through further analysis of the results, we found that the intermediate spatial representations predicted 5 attributes (“Bags Under Eyes” , “Blurry”, “Mouth S. Open”, “Pale Skin” and “Narrow Eyes”) much better than the last FC representations.

We believe the reason intermediate spatial representations outperformed on so many attributes is that these human describable attributes are more likely to be identified from the spatial information captured by human brains. Considering these attributes are semantic concepts relating to specific domains and these domains by themselves alone can hardly be used to pin-point a specific identity from a crowd of people, the utility of the intermediate features, which are less discriminating than the high level abstraction, from CNNs makes sense.

4 Conclusions

In this paper, we address the problem of predicting face attributes using CNNs trained for face recognition. We employ CNNs with off-the-shelf architectures and publicly available models — Google’s FaceNet and VGG’s “very deep” model — with the conventional pipeline to study the prediction power of different representations from the trained CNNs. Our investigations present the correspondence diversity between the best performing representations and the human describable attributes. They also reveal that the intermediate representations from CNNs are very effective in predicting facial attributes in general. Although previous works have shown that fine-tuning the pre-trained networks brought significant improvements when transferring to novel tasks, we empirically demonstrate that intermediate deep features from pre-trained networks can also form a promising alternative. By properly leveraging these off-the-shelf CNN representations, we achieved accurate attribute prediction on a par with current state-of-the-art performance.

Acknowledgments

We gratefully acknowledge the support from NVIDIA Corporation for GPU donations. We have enjoyed discussions with Ali Sharif Razavian and Atsuto Maki.

References