A Survey to Deep Facial Attribute Analysis

12/26/2018 ∙ by Xin Zheng, et al. ∙ Dalian University of Technology 0

Facial attribute analysis has received considerable attention with the development of deep neural networks in the past few years. Facial attribute analysis contains two crucial issues: Facial Attribute Estimation (FAE), which recognizes whether facial attributes are present in given images, and Facial Attribute Manipulation (FAM), which synthesizes or removes desired facial attributes. In this paper, we provide a comprehensive survey on deep facial attribute analysis covering FAE and FAM. First, we present the basic knowledge of the two stages (i.e., data pre-processing and model construction) in the general deep facial attribute analysis pipeline. Second, we summarize the commonly used datasets and performance metrics. Third, we create a taxonomy of the state-of-the-arts and review detailed algorithms in FAE and FAM, respectively. Furthermore, we introduce several additional facial attribute related issues and applications. Finally, the possible challenges and future research directions are discussed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Facial attributes represent intuitive semantic features that describe human-understandable visual properties of face images, such as smiling, eyebrows, and mustache. Moreover, as the vital information of faces, facial attributes contribute to numerous successful real-world applications, e.g., face verification kumar2009attribute ; berg2013poof ; song2014verification ; zhang2018demeshnet

, face recognition

he2018wasserstein ; shi2015person_reid ; he2018learning ; song2018adversarial , face retrieval li2015twobrtdonestone ; nguyen2018large ; fang2018attribute and face image synthesis huang2018introvae ; cao2018load ; huang2018variational ; song2018geometry

. Facial attribute analysis aims to build a bridge between such human-understandable visual descriptions and abstract feature presentations that are required by many computer vision issues. Recently, deep neural networks have gained popularity in the research of facial attribute analysis and dramatically improved the state-of-the-art performance. Specifically, deep facial attribute analysis mainly contains two subproblems: Facial Attribute Estimation (FAE) and Facial Attribute Manipulation (FAM). Given a face image, FAE recognizes whether a describable attribute of visual appearance is present by training attribute classifiers

kumar2009attribute . As an inverse problem of FAE, FAM modifies face images for synthesizing or removing desired attributes by constructing generative models.

Deep facial attribute estimation (FAE) methods can be generally categorized into two groups: part-based methods and holistic methods. Part-based FAE methods first locate the positions of facial attributes and further extract features according to the obtained localization cues. Depending on the different schemes of locating facial attributes, part-based methods can be further classified into two sub-categories: separate auxiliary localization and end-to-end localization. More specifically, separate auxiliary localization methods seek help from existing part detectors or auxiliary localization algorithms, e.g., facial key point detection mahbub2018segment and semantic segmentation kalayeh2017improving , and then learn features from different positions for the further estimation. Note that the localization and estimation are operated in a separate and independent manner. By contrast, end-to-end localization methods exploit the locations of facial attributes and predict their presences simultaneously in an end-to-end way. Compared with part-based methods, holistic methods mainly solve a multi-task problem, which takes each attribute as a single task and predicts multiple attribute labels simultaneously. All processes are complete in a unified framework, without any extra localization modules. Specifically, such methods model the association and distinction among different attributes to explore the complementary information, where the general strategy to achieve this goal is designing various networks with sharing features from different layers. Besides, other prior or auxiliary information, such as grouping attributes manually or identity information assistance, are also taken into consideration when predicting attributes in the multi-task framework.

Deep facial attribute manipulation (FAM) methods are mainly based on the generative model and can be grouped into two categories: model-based methods and extra condition-based methods, where the main difference between them is whether extra conditions are required. Model-based methods construct a model without any extra conditional inputs and learn a set of model parameters that only correspond to one attribute during a training process. That means when another attribute needs to be changed, another training process is necessary. In contrast, extra condition-based methods take extra attribute vectors or reference images as input conditions, which can alter multiple attributes simultaneously by changing the corresponding values of attribute vectors or taking multiple exemplars with different attributes as references. Specifically, given the original to-be-manipulated images, extra conditional attribute vectors, such as one-hot vectors denoting the attribute presence, are concatenated with the latent original image codes, while extra conditional reference exemplars exchange specific attributes with the original images in the framework of the image-to-image translation. Note that these reference images do not need to have the same identity with the original to-be-manipulate images. Recently, more and more researchers shift their focus on such methods conditioned on reference exemplars, since more specific details that appear in the reference images can be explored to generate more realistic images compared with altering attribute vectors manually.

Massive facial attribute analysis approaches have been developed in recent years. In this paper, we provide the evolutions of FAE and FAM in Figure  1 and Figure  2, respectively. Note that only typical methods are listed for illustrating the timelines of the development.

We provide the general pipeline and evolution of deep FAE methods in Figure  1

. To begin with, given a large-scale facial attribute dataset, data pre-processing is usually the first step, where face detection and alignment, as well as data augmentation, are both commonly used schemes. After that, deep neural networks are designed to extract deep features, so as to estimate the presences of facial attributes by attribute classifiers in the following step. The same goal is also can be achieved in an end-to-end way, where feature learning and attribute classification are implemented in a unified framework. Specifically, for the construction of deep models, deep FAE methods share two parallel routes, i.e., part-based methods and holistic methods. As we mentioned above, the main difference between the two is part-based methods introduce localization mechanisms into feature extraction networks, while holistic ones capture attribute relationships for utilizing complementary information. With the continuous development of deep FAE methods, part-based methods are placing more emphasis on facial details for modeling localization cues, while holistic methods are inclined to design more specific networks and learn more discriminative features with attribute relationships.

The development of deep FAM methods is illustrated in Figure  2. Since model-based methods train a model only corresponding to one attribute, the process of model construction does not have a general form. Hence, we only provide the pipeline of two extra condition-based methods, i.e., conditioned on attribute vectors and reference exemplars. The main difference between the two is whether conditions need to be given on the test stage. Taking exemplars as references must provide samples during both training and testing, while attribute vectors are only required on the training stage. Deep FAM methods conditioned on attribute vectors feed both the to-be-manipulated image and desired attribute vectors into a conditional generative model to generate edited face images. By contrast, deep FAM methods conditioned on reference exemplars take both the to-be-manipulated image and reference image as inputs and feed them into an image translation network. Then, the attributes can be exchanged between the two images in an adversarial paradigm, where a discriminator is used to distinguish real images and generated fake images, as well as discriminate the attribute categories of all images. Recently, more and more researchers shift their focus to the reference exemplar-based methods. Moreover, when designing conditional generative models, the combination of GANs and VAEs has become a widespread tendency.

Although a large number of methods achieve appealing performance in FAE and FAM, there remain some challenges in the facial attribute analysis. In FAE, facial attribute data has the significantly class-imbalanced distribution, where some of the classes have a much higher number of examples than others, corresponding to the majority class and minority class, respectively. As a result, the learned model would bias towards the majority class samples and perform poorly in the real-world data with the balance distribution. Therefore, class-imbalanced learning is an urgent issue to be solved in FAE. In FAM, methods taking attribute vectors as conditions can alter multiple attributes simultaneously but cannot change attributes continuously, while methods condition on reference exemplars is capable of editing an arbitrary attribute but incapable of manipulating multiple attributes once. Besides, how to keep the attribute-relevant details unchanged is still the common challenge faced by the two types of methods.

In this paper, we make an in-depth survey in facial attribute analysis based on deep learning, including FAE and FAM. The primary goal is to provide an overview of the two issues, as well as point out their respective strengths and weaknesses to give a newcomer prime skills for facial attribute analysis. The rest of this paper is organized as follows. In Section  

2, we provide some basics in the data pre-processing and facial attribute analysis modeling. For FAE, there are three main steps required in detail: pre-processing, feature extraction, and classification, while for FAM, two steps are taken: pre-processing and generative model construction. In Section  3, we summarize commonly used publicly facial attribute datasets and metrics. Section  4 and Section  5 provide detailed overviews of state-of-the-art deep learning FAE and FAM methods, as well as their strengths and weaknesses, respectively. Additional related issues and challenges discussions are introduced in Section  6 and Section  7, respectively. Finally, we conclude this paper in Section  8.

Figure 1: The evolution of facial attribute estimation methods
Figure 2: The evolution of extra condition-based facial attribute manipulation methods

2 Facial Attribute Analysis Preliminaries

In this section, we provide some basics about the data pre-processing and model construction in facial attribute analysis. First, several commonly used data pre-processing strategies in both FAE and FAM are introduced, including face detection and alignment, as well as data augmentation. Since current FAM algorithms almost operate on aligned face images, the process of data pre-processing is not especially emphasized, so we no longer give the extra introduction. Second, we provide the general processes of the model construction in FAE and FAM, respectively. Specifically, we provide the basics of facial attribute feature extraction and classification, which are two crucial parts when designing deep FAE networks. Furthermore, we review the basic VAE and GAN, as well as their respective conditional versions, which are important backbones in deep FAM.

2.1 Pre-processing

The variations of illuminations, poses, occlusions, and low image qualities are primary influence factors in unconstrained scenarios for facial attribute analysis. Therefore, pre-processing is a necessary strategy before training deep networks, especially for FAE and FAM, as more detailed information used for the estimation and manipulation may be damaged by these adverse effects.

2.1.1 Face Detection and Alignment

Before the databases with more facial attribute annotations are released, most of attribute prediction methods zhang2014panda ; kumar2008facetracer ; gkioxari2015wholeandparts must take the whole human images (more than face regions) as inputs and only several well-marked facial attributes can be estimated, such as smile, gender, and has glasses. There is no doubt that much extra irrelevant information is involved and result in redundant computation. Hence, face detection and alignment become crucial steps for locating face areas.

In face detection, Ranjan et al. ranjan2017hyperface first recognize the gender attribute with a HyperFace detector that locates faces and landmarks. Günther et al. GuntherRB17Affact further predict 40 facial attributes simultaneously with the HyperFace detector. Kumar et al. kumar2008facetracer use the poselet part detector Bourdev2009Poselets to detect different parts corresponding to different poses, where the face is taken as a part of an image of the whole person region. Compared with such poselet detector operated as the sliding window based on gradient orientation features, Gkioxari et al. gkioxari2015wholeandparts propose a ‘deep’ version of the poselet, which trains a sliding window detector on deep feature pyramids. Specifically, the deep poselet detector divides the human body into three parts (head, torso, and legs) and clusters fiducial key points of each part into many different poselets. However, as all the face detectors mentioned above are used to find the rough facial part and other body parts, more subtle facial attributes related to details, such as ‘eyebrows’, are not able to be predicted.

In face alignment, well-annotated databases with fiducial key points have significant benefits for both FAE and FAM methods. With the help of these fiducial key point annotations, more detailed facial regions of attributes can be captured so that unalignment error would not make features corresponding to specific regions go outside of these parts. All-in-One Face dataset ranjan2017allinone provides fiducial key points and full face. Based on these fiducial key points, Mahbub et al. mahbub2018segment divide a face into 14 facial segments related to different facial regions. As a result, attribute prediction for partial faces can be handled. Kumar et al. kumar2008facetracer artificially break up the face into 10 functional parts including hair, forehead, eyebrows, eyes, nose, cheeks, upper lip, mouth, and chin. Such partition ensures that these regions of the face are wide enough to be robust against discrepancies among individual faces and overlap slightly. In this way, the geometry shared by different faces can be exploited, and small errors in alignment would not degenerate the performance of features.

Note that there is a trend that integrates the face detection and alignment into the training process in facial attribute analysis. He et al. he2017jointlyFAEandDetec take the face detection as a special case of the general semi-rigid object detection. Consequently, the performance of both face detection and attribute estimation is improved through the well-designed architecture and jointly learning. More importantly, the input images from the wild, which varies a lot in illuminations and occlusions, can also be handled to predict facial attributes without cropping and aligning. Ding et al. ding2017deepCascadeFAE propose a cascade network to localize the face regions according to different attributes and perform facial attribute estimation simultaneously with no need to align faces. Li et al. li2018landmark design an AFFAIR network for learning a hierarchy of spatial transformation and predicting facial attributes without landmarks. To sum up, more and more researches are devoted to integrating the face detection and alignment into the training process, even exploring schemes for estimating attributes without face detection and alignment.

2.1.2 Data Augmentation

In most face processing tasks, data augmentation is a vital strategy for obtaining sufficient training data in deep learning. For facial attribute analysis, a problem remains that there are not many publicly available databsets with sufficient quantity of images for training, which can be handled by the data augmentation.

In most face processing tasks, data augmentation is a vital strategy for solving the problems of insufficient training data and overfitting in deep learning, and face attribute analysis is no exception.By imposing perturbations and distortions on the input images, data can be extended for improving deep learning models.

Günther et al. GuntherRB17Affact propose an Alignment-Free Facial Attribute Classification Technique (AFFACT) with data augmentation. More specifically, AFFACT leverages shifts, rotations, and scales of images to make facial attribute feature extraction more reliable in both training stage and testing stage. In the training stage, first, face images are rotated, scaled, cropped into RGB images, and horizontally flipped with 50probability with defined coordinates. Then, a Gaussian filter is applied to emulate smaller image resolutions and yield blurred upscaled images. In the testing stage, AFFACT rescales the test images to 256256 resolution at first. Then, it transforms the images to 10 crops including a center one, four corners of the original images, as well as their horizontally flipped versions. Finally, it averages the scores from the deep network per attribute over these ten crops to make the final prediction. Apart from taking only crops, AFFACT also uses all combinations of shifts, scales, and angles, as well as their mirrored versions. Consequently, AFFACT enhances the performance of the deep network in FAE effectively.

2.2 Facial Attribute Estimation Basis

A straightforward thought of predicting facial attributes is first to extract discriminative features and then train attribute classifiers. Therefore, in this section, we introduce feature extraction and attribute classification methods that are generally utilized in FAE.

2.2.1 Deep Facial Attribute Feature Extraction

Deep convolutional neural networks (CNNs) play significant roles that learn discriminative representations in FAE and achieves the state-of-the art performance. In general, arbitrary classical CNN networks learning to focus on detailed properties, such as VGG

parkhi2015deep and ResNet he2016resnet , all can be used to extract deep facial attribute features. Zhong et al. zhong2016face simply apply the FaceNet and VGG-16 network for learning facial features. Günther et al. GuntherRB17Affact investigate the performance of the ResNet for extracting features on FAE task.

To analyze how the features at different levels of networks affect the FAE performance, Zhong et al. zhong2016leveraging take mid-level CNN features as an alternative to the high-level ones for attribute prediction. The experiments demonstrate that even the early convolution layers astonishingly achieve comparable performance in most facial attributes as the state of the arts, and such mid-level representations features are capable of breaking the bounds of the interconnections between convolutional and fully connected (FC) layers, resulting in accepted arbitrary receptive sizes of CNNs.

According to the two types of deep FAE methods, i.e., part-based methods and holistic methods, researchers consider the following two problems when designing networks for discriminative feature extraction.

(1) how to make networks focus more on the locations of attributes?

(2) how to make the best of the relationships among attributes?

We provide more details for the two concerns when introducing state-of-the-art methods hereinafter.

Besides, there exist methods designing specific network architectures for learning discriminative features. Lu et al. lu2017fully design a automatically constructed compact multi-task deep learning architecture, which starts with a thin multi-layer network and dynamically widens in a greedy manner. Belghazi et al. belghazi2018hierarchical build a hierarchical generative model and a corresponding inference model through the adversarial learning paradigm.

2.2.2 Deep Facial Attribute Classifiers

Early methods learn deep feature representations with deep networks but make predictions with traditional classifiers, such as support vector machines (SVMs)

cortes1995support ; bourdev2011describing

, decision trees

luo2013sum-product

, as well as k-nearest neighbor (kNN) algorithm

huang2016learning ; huang2018deep . For example, Kumar et al. kumar2009attribute train multiple support vector machines (SVMs) cortes1995support

with radial basis function (RBF) kernels for the multiple attribute prediction, where each linear SVM corresponds to one facial attribute. Bourdev et al.

bourdev2011describing present a feedforward classification system with linear SVMs and classify attributes at the image patch level, the whole image level, and the semantic relationship level, respectively. Luo et al. luo2013sum-product construct a sum-product decision tree network for the prediction of facial attributes. Recently, even the k-nearest neighbor (kNN) algorithm huang2016learning ; huang2018deep is adopted for dealing with the class-imbalanced FAE problem.

Looking at details, as regards attribute classifiers based on deep networks, FC layers attached to the end of deep feature extraction networks achieve the purpose of the basic facial attribute estimation. However, what the most crucial is to measure the losses between the outputs of FC layers and the ground truths with proper loss functions for reducing the classification error.

Rudd et al. first take the multiple facial attribute classification as a regression issue to minimize the mean squared error (MSE) loss, i.e., the Euclidean loss, by mixing errors of all attributes. They train a mixed objective optimization network rudd2016moon (MOON) based on the VGG-16 topology parkhi2015deep . In this way, multiple attribute labels can be obtained simultaneously via a single deep convolutional neural network (DCNN). Rozsa et al. rozsa2016facial also adopt the Euclidean loss to train multiple DCNNs, where each one is for each facial attribute.

Apart from the Euclidean loss, the most commonly used loss function is the sigmoid cross entropy loss, which treats each attribute prediction as a binary classification task. Hand and Chellappa hand2017attributes develop a Multi-task deep CNN (MCNN) and use the sigmoid cross-entropy loss to evaluate its output, as well as calculate the scores of all facial attribute. Furthermore, the output scores are fed into an auxiliary network (AUX) for learning attribute correlations at the score level. Moreover, Günther et al. GuntherRB17Affact even provide an evaluation of the Euclidean loss and the sigmoid cross entropy loss. The experiments over the same network but different loss functions denote that the two loss functions achieve comparable performance, which illustrates that the two loss functions do not affect much on the performance under the same network.

2.3 Facial Attribute Manipulation Basis

Mainstream FAM methods edit attributes under the framework of the conditional generative model, taking the attribute vectors or reference exemplars as conditions. As the backbones of generative models, VAEs and GANs goodfellow2014gan play important roles in FAM. In this section, we briefly review the two basic models and their corresponding conditional versions.

2.3.1 Variational autoencoder

In general, a VAE has two components: the encoder, which encodes a data into a latent representation , i.e., , and the decoder, which maps the obtained latent representation back to the data space , i.e., . VAE first samples from the distribution of encoder , where typically. Then, the obtained sample is fed into the differentiable generator network for obtaining by sampling from . The key of VAE is training to max the variational lower bound :

(1)

where

denotes the Kullback-Leibler divergence.

As for the conditional version of VAE, given the attribute vector and latent representation , it aims to build a model for generating images containing desired attributes, where taking the and as conditional variables. Such image generation task follows a two-step process: the first step is randomly sampling latent variable from prior distribution , while the second step is generating image according to given conditional variables. As a result, the variational lower bound of conditional VAE can be written as

(2)

where the is the true posterior from the encoder.

2.3.2 Generative adversarial network

A generative adversarial network (GAN) contains two parts: the generator G and the discriminator D, where G tries to synthesize data from a random vector z obeying a prior noise distribution , and D attempts to discriminate that whether the data is from realistic data distribution or from G. Given a data , the generator G and discriminator D are trained in an adversarial manner with playing a min-max game as

(3)

Similarly, the conditional version of GAN is more frequently used by feeding the attribute vector into the and as an additional input layer. Specifically, the attribute vector is concatenated with the prior input noise in the generator, while is taken as an input along with into a discriminative function. As a consequence, the min-max game of conditional GAN can be denoted as

(4)

3 Facial Attribute Analysis Datasets and Metrics

3.1 Facial Attribute Analysis Datasets

In the following, we introduce the overview of publicly available facial attribute datasets, including the conditions of collection, the number of subjects and image or video samples.

FaceTracer Dataset is an extensive collection of real-world face images collected from the internet. There are 15,000 faces with fiducial key points marked, as well as 10 groups of attributes, totally 5,000 labels, where 7 groups facial attributes are made up of 19 attribute values, while the rest 3 groups denote the qualities of images and environment. The dataset provides the URLs of each image for considering privacy and copyright issues. Besides, FaceTracer takes 80% of the labeled data as training data while the remaining 20% as testing with 5-fold cross-validation.

The Labeled Faces in the Wild (LFW) dataset consists of 13233 images of cropped, centered frontal faces that derived from the work of T. Berg et al., the Names and Faces in the News miller2007names . It is collected from 5,749 people using news sources online, and there are 1680 people having two or more images. Kumar et al. kumar2009attribute first collect 65 attribute labels, denoting as LFW-65 through Amazon Mechanical Turk (AMT) AmazonMechanicalTurk , and then expand to 73 kumar2011visual_attributes attributes, denoting as LFW-73 in Table.  2. Liu et al. liu2015deep annotate LFW images with 40 face attributes with 5 fiducial key points by a professional labeling company, called LFWA, which is one of the most commonly used datasets in facial attribute analysis. LFWA is partitioned into half for training and the rest for testing. Specifically, there are 6,263 images for training and the remains for testing.

PubFig dataset is a large, real-world face dataset comprising 58,797 images of 200 people collected from the internet under the uncontrolled situations, which results in considerable variation in pose, light, expression, and scene, etc. It labels 73 facial attributes as many as LFW-73 but includes fewer individuals. As for the protocol, PubFig divides the development set containing 60 identity images and the evaluation set containing the rest 140 identities.

Celeb-Faces Attributes Dataset (CelebA) is constructed by labeling images selected from Celeb-Faces sun2014deep , which is a large scale face attributes dataset covering large pose variations and background clutter. It contains 10,177 identities, 202,599 number of face images with 5 landmark locations, and 40 binary attribute annotations per image. In the experiment, CelebA is partitioned into three parts: images of first 8,000 identities (with 160,000 images) for training, the images of another 1,000 identities (with 20,000 images) for validation and the rest for testing.

Berkeley Human Attributes Dataset is collected from H3D Bourdev2009Poselets dataset and PASCAL VOC 2010 wang2016walk training and validation datasets, containing 8,053 images that each centered at a full body of a person. There is wide variation in pose, viewpoint, and occlusion so that many existing attributes methods working on front faces perform not well on the dataset. Amazon Mechanical Turk (AMT) is also used to provide labels for all 9 attributes on all annotations by 5 independent annotators. The dataset partitions 2,003 images for training, 2,010 for validation and 4,022 for testing.

Attribute 25K dataset is collected from Facebook, which contains 24,963 people split into 8,737 training, 8,737 validation and 7,489 test examples. Since the images are with large variation in viewpoint, pose and occlusions, not every attribute can be inferred from every image. For instance, we cannot label the ‘wearing hat’ category when the head of the person is not visible.

Ego-Humans Dataset draws out images from videos that track casual walkers with the OpenCV frontal face detector and facial landmark tracking in New York City throughout two months. Compared with all the datasets as mentioned above, Ego-Humans dataset covers the location and weather information through clustering GPS coordinates. Moreover, nearly five million face pairs along with their same or not same labels are extracted under the constraints of temporal information and geo-location. Wang et al. wang2016walk manually annotate 2,714 images with 17 facial attributes randomly selected from these five million images. As for the testing protocol, 80% images are selected randomly for training and the rest for testing.

University of Maryland Attribute Evaluation Dataset (UMA-AED) comes from the image search taking 40 attributes as search terms and the Hyperface as the face detector ranjan2017hyperface . UMD-AED is constructed for the class-imbalanced learning in FAE, which is made up of 2800 face images labeled with a subset of the 40 attributes from CelebA and LFWA. Each attribute has 50 positive and 50 negative samples, which means not every attribute is labeled in each image. Such collected way and labeled scheme make UMD-AED more presentative of real-world data than CelebA and LFWA, as well as being high-quality. Since the dataset is recently proposed and used only for a less biased evaluation, there is no relatively mature test protocol.

All the labels in LFW dataset which has the maximum number of attributes are listed in Table.  2. Different facial attribute datasets take out different subsets of these attribute marks for FAE and FAE. Note that in Table.  2, ‘Common’ denotes the attributes shared by all variant versions of LFW, totally 34 categories. LFWA and CelebA become the most commonly used FAE datasets by adding 6 attributes based on the ‘Common’, where the six attributes, together with the underlined ‘flushed face’ and ‘brown eyes’, are added to LFW-65 to constitute the LFW-73.

3.2 Facial Attribute Analysis Metrics

3.2.1 Facial Attribute Estimation Metrics

In this section, we list several frequently used metrics of FAE as follows.

  • The Accuracy and Error Rate

The classification accuracy and error rate are the most commonly used measures for evaluating classification performance. Formulaically, the accuracy can be defined as

(5)

where and denote the numbers of positive and negative samples, and denote the numbers of true positives and true negatives huang2016learning . Meanwhile, the error rate can be defined as

(6)
  • The Balanced Accuracy and Error Rate

When dealing with class-imbalanced data, the traditional classification accuracy is not befitting due to the bias of the majority class. Hence, a balanced classification accuracy is defined as

(7)

Similarly, the balanced error rate can be defined as . When dealing with the domain adaption issue rudd2016moon , the balance error rate is defined as

(8)

where and denote the target domain distribution of positive and negative examples, respectively. The superscript is used to denotes the balanced error rate in domain adaption.

  • Mean Average Precision

As there are more than one labels in multi-label image classification, the mean Average Precision (mAP) becomes a popular metric, which computes the average of the precision-recall curve from the recall 0 to recall 1 mAP . Meanwhile, mAP is the mean of the average precision scores for a set of queries. The specific definition and formulation can be researched in many works yue2007support ; philbin2007object and we provide no more detailed description.

3.2.2 Facial Attribute Manipulation Metrics

There are two types of metrics in FAM: qualitative and quantitative measurements, where qualitative metric evaluates the performance of generated images through statistical surveys, and quantitative one gives the evaluation according to the degree of preserving the facial information after manipulation, such as identity preservation, facial landmark detection gain, and attribute prediction. We provide more detailed descriptions of the two categories of FAM metrics as bellow.

  • Qualitative Metrics

Statistical survey is the most intuitive way to qualitatively evaluate the quality of generated images in most generative model construction tasks. By setting specific rules in advance, subjects vote for generated images with the appealing visual fidelity and researchers draw the conclusion according to the statistical analysis of votes. For example, Choi et al. choi2017stargan quantitatively evaluate the performance of generated images in a survey format using Amazon Mechanical Turk (AMT) AmazonMechanicalTurk . Given an input image, the Turkers are required to select the best generated images by instruction based on perceptual realism, quality of manipulated in attributes, and preservation of original identities. Each Turker is asked a set number of questions along with a few logical yet straightforward questions for validating human effort. Zhang et al. zhang2017age conduct a statistical survey as compared with prior works. Specifically, volunteers are required to choose the better result from proposed CAAE or prior works, or hard to tell via voting. Sun et al. sun2018mask instruct volunteers to rank facial attribute manipulation approaches based on perceptual realism, quality of transferred attribute and preservation of personal features and then calculated the average rank (between 1 and 7) of each approach. Lample et al. lample2017fader perform a quantitative evaluation on Mechanical Turk with two different aspects of the generated images: the naturalness measuring the quality of generated images, and the accuracy measuring the degree of swapping an attribute reflected in the generation.

  • Quantitative Metrics

Measuring the Distance Between Distributions helps to calculate the differences between real images and generated face images. Xiao et al. xiao2018elegant achieve this goal by the Fréchet Inception Distance heusel2017gans (FID) with the means and covariance matrices of two distributions before and after editing facial attributes. Wang et al. wang2018weakly compute the Peak Signal to Noise Ratio (PSNR) to measure pixel differences, the Structure SIMilarity index (SSIM) and its multi-scale version MS-SSIM wang2004assess to estimate the structure distortion, as well as the identity distance to evaluate the high-level similarity of two face images. In light of this, face identity preservation becomes a popular evaluation for measuring the ability to preserve attribute excluding details. He et al. he2017arbitrary use an Inception-ResNet szegedy2017inception to train a face recognizer for evaluating identity preservation ability with rank-1 recognition accuracy.

Facial Landmark Detection Gain uses the accuracy gain of landmark detection before and after attribute editing to evaluate the quality of synthesized images. For example, He et al. he2016dual adopt a landmark detection algorithm, the ERT method kazemi2014one trained on the 300-W dataset sagonas2013semi , to achieve this goal. During the testing, researchers partition the test sets into three components: the first one containing images with the positive attribute labels, the second containing images with the negative labels, and the last one containing the manipulated images from the first part. Then, the average normalized distance error is computed to evaluate the discrepancy of landmarks between generated images and the ground truths.

Facial Attribute Estimation Auxiliary constructs attribute prediction networks to measure the performance of FAM according to the classification accuracy. Perarnau et al. perarnau2016invertible first design an Anet that predicting attributes on the manipulated facial images. Once the output attribute labels of the Anet are closer to the original attribute labels, the generator can be considered to have satisfied performance. That means almost all metrics of facial attribute estimation can be used in this case. Besides, considering that visual facial attributes produced by good generative models should be correctly recognized by regression models, Larsen et al. larsen2016autoencoding

calculate the attribute similarity between the conditional attributes and generated attributes employing an attribute prediction network. Specifically, the face images are generated by retrieval from chosen attribute configurations and fed into a separately trained regressor network for predicting facial attributes. During the testing, faces are sampled given the test set attributes and propagated through the attribute prediction network. As a consequence, attribute similarity scores, as well as cosine similarity and mean squared error, are both computed over the test set, where the cosine similarity is defined as the best out ten samples per attribute vector.

Dataset Resources Identities Samples Number of attributes Testing protocol Access
FaceTracer kumar2008facetracer Internet 15,000 15,000 10
Train: 80%
Test: 20%
Val: 5-fold cross
www.cs.columbia.edu/CAVE/databases/facetracer/
LFW huang2008labeled Names and Faces miller2007names 5,749 13233 65/73
Train: 50%(6,263)
Test: 50%(6,970)
http://vis-www.cs.umass.edu/lfw/
LFWA liu2015deep LFW 5,749 13233 40
Train: 50%(6,263)
Test: 50%(6,970)
http://vis-www.cs.umass.edu/lfw/
PubFig kumar2009attribute Internet 200 58,797 73
Train: 60 identities
Test: 140 identities
http://www.cs.columbia.edu/CAVE/databases/pubfig/download/
CelebA liu2015deep Celeb-Faces 10,177 202,599 40
Train: 8000 identities (160,000)
Test: 1000 identities (20,000)
http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
The Berkeley Human Attributes Dataset bourdev2011describing
H3D Bourdev2009Poselets
PASCAL VOC 2010 wang2016walk
- 8,053 9
Train: 2,003 images
Test: 4,022 images
Val: 2,010 images
https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/poselets/
Attributes 25K Dataset zhang2014panda Facebook 24,963 24,963 8
Train: 8,737 identities
Test: 7,489 identities
Val: 8,437 identities
-
Ego-Humans Dataset wang2016walk videos - 2,714 17
Train: 80%
Test: 20%
-
University of Maryland Attribute Evaluation Dataset (UMA-ADE) hand2018doing Image Research - 2800 40 All used for test https://www.cs.umd.edu/ emhand/research.html
Table 1: An overview of facial attribute datasets.
Dataset Attributes
LFW Common Arched Eyebrows Attractive Bags under eyes Bald Bangs Big nose Black hair
Blond hair Blurry Brown hair Bushy eyebrows Chubby Double chin Eyeglasses
Goatee Gray hair High cheekbones Male Mouth slightly open Mustache Narrow eyes
No beard Oval face Pale skin Pointy nose Receding hairline Rosy cheeks Sideburns
Smiling Straight hair Wavy hair Wearing hat Wearing lipstick Young

LFW-73 LFWA/CelebA Big lips Heavy makeup Wearing earrings Wearing necklace Wearing necktie 5’o clock shadow


LFW-65 Asian Baby Black Child Color photo Curly hair Environment
Eyes open Flash Frowning Fully visible forehead Harsh lighting Indian Middle aged
Mouth wide open Mouth closed No eyewear Obstructed forehead Posed photo Round jaw Round face
Semi obscured forehead Senior Shiny skin Soft lighting Square face Strong nose mouth lines Sunglasses
Teeth not visible Teeth visible White (Flushed face Brown eyes)
Table 2: An overview of facial attributes

4 State-of-the-art Facial Attribute Estimation Methods

Generally, state-of-the-art deep FAE methods can be partitioned into two main classes: part-based methods and holistic methods. In this section, we provided an overview of these two types of methods concerning the algorithms, performance, as well as pros and cons. The summary is provided in Table  3.

4.1 Part-based Deep FAE Methods

Deep part-based FAE methods first locate the areas where facial attributes exist and then classify corresponding attributes on each highlighted position. Moreover, depending on the schemes of attribute locating, deep part-based methods can be further divided into two sub-groups: separate auxiliary localization and end-to-end localization. In this section, we will provide more details about the two types of methods, respectively.

4.1.1 Separate Auxiliary Localization

Since facial attributes describe subtle details of face representations based on human vision, locating the positions of facial attributes enforces subsequent feature extractors and attribute classifiers to focus more on some attribute-relevant regions. The most intuitive and straightway is taking existing face part detectors as auxiliaries.

Poselet Bourdev2009Poselets ; bourdev2011describing is a valid part detector which describes a part of the human pose under a given viewpoint. Since these parts include evidence from different parts of the body at different scales, complementary information can be learned to benefit attribute prediction. Typically, given a whole person image, Zhang et al. zhang2014panda first use the poselet detector to decompose images into several image patches, named as poselets, under various viewpoints and poses. Then, they design a PANDA network that trains a CNN per poselet and concatenates features from all these poselets, as well as the whole image, to obtain the final part-based deep representations. Finally, PANDA branches out multiple binary classifiers where each recognizes an attribute through the binary classification. Based on PANDA, Gkioxari et al. gkioxari2015wholeandparts introduce a deep version of poselets and build a feature map pyramid whose each level computes a score for the corresponding attribute prediction.

However, poselet detector discovers the coarse body parts and cannot explore local details of face images, which deserve more attention in attribute prediction. Considering the probability of an attribute appearing in a face image is not uniformed in the spatial domain, Kalayeh et al. kalayeh2017improving propose to employ the semantic segmentation as a separate auxiliary localization scheme to build a prediction model. They exploit the localization cues obtained by the semantic segmentation to guide the attention of attribute prediction to the naturally occurring areas of different attributes. Looking at the details, as regards for the model construction, first, a semantic segmentation network is designed in the form of an encoder-decoder and trained over Helen face dataset le2012interactive when taking the semantic face parsing smith2013exemplar

as an additional task to learn detailed pixel-level localization information. Once the localization cues from the semantic segmentation network are discovered, the semantic segmentation-based pooling (SSP) and gating (SSG) mechanisms are presented to integrate the location information into the attribute estimation. SSP decomposes the activations of the last convolution layer into distinct semantic regions and then aggregates those regions that only reside in the same area. In the meanwhile, SSG gates the output activations between the convolution layers and the batch normalization (BN) operations to control the activations of neurons from different semantic regions.

Rather than taking semantic segmentation as the auxiliary locating task, Mahbub et al. mahbub2018segment consider a more straightforward way by segmenting faces into several image patches directly according to the key point marks. Then, these segments are fed into a set of facial segment networks to extract corresponding feature representations and learn prediction scores, along with the whole face image flowing into a full-face network. A global predictor network fuses the features from these segments, and two committee machines merge their scores for the final prediction.

In contrast to the above two methods that look for location clues of attributes directly, He et al. he2018harnessing harness synthesized abstraction facial images that contain local facial parts and texture information, to achieve the same goal indirectly. A generative adversarial network is used to generate such facial abstraction images before inputting them into a dual-path facial attribute recognition network, along with the real original images. The dual-path network propagates the feature maps from the abstraction sub-network to the real original image sub-network and concatenates the two types of features for the final prediction. Despite the rich part locations and textual information of the generated facial abstractions images, the quality of these images may be a significant performance hit, as some subtle attribute information may lose in the process of image abstraction.

Note that all the separated auxiliary localization FAE methods share a common drawback: relying on too much accurate facial landmark localization, face detection, facial semantic segmentation and parsing, as well as facial partition according to specific criteria in the training process. That means, once these localization strategies are imprecise or landmark annotations are unavailable, the performance of FAE would be harmed.

4.1.2 End-to-end Localization Methods

Compared with the separate auxiliary localization methods that locate attribute regions and make the attribute prediction separately and independently, end-to-end localization methods jointly exploit the localizations where facial attributes exist and predict their presences in a unified framework.

Liu et al. liu2015deep first propose a cascaded deep learning framework for jointly face localization and attribute prediction. Specifically, the cascaded CNN is made up of an LNet and an ANet, where the LNet locates the entire face region and the ANet extracts high-level face representation from the located area. When it comes to the details of the LNet, it is firstly pre-trained by classifying massive general object categories to ensure the excellent generalization capability, and then fine-tuned using the image-level attribute tags of training images to learn useful features for face localization in a weakly supervised manner. Note that the critical difference between the LNet and the separated auxiliary localization methods is that the LNet has no more need to face bounding boxes or landmark annotations. As for the ANet, first, it is pre-trained by classifying massive face identities to cope with complex variations in the unconstrained face images and then fine-tuned by attribute estimation to extract discriminative face representations. Furthermore, rather than extracting features patch-by-patch, ANet evaluates images with a fast feed-forward scheme, in which the one-pass feed-forward operation with locally shared filters and an interweaved operation are leveraged to learn the discriminative feature representations. Finally, SVMs are trained over these features to estimate attribute values per attribute, while the final prediction is made by averaging all values to deal with the small misalignment of face localization. The cascaded LNet and ANet framework shows the benefit of pre-training with massive object categories and massive identities in improving the feature learning for face localization and attribute recognition. With such carefully designed pre-training schemes and cascaded architecture for jointly face localization and attribute prediction, the method shows outstanding robustness to the background and face variations.

In light of this, Ding et al. ding2017deepCascadeFAE propose a cascade network to jointly learn to locate facial attribute-relevant regions and perform attribute classification. Specifically, they firstly design a face region localization network (FRL) that builds a branch for each attribute to automatically detect a corresponding relevant region. Then, the followed parts and whole (PaW) attribute classification network selectively leverage information from all the attribute-relevant regions for the final estimation. Moreover, during the attribute classification, Ding et al. define two full connected layers: the region switch layer (RSL) and the attribute relation layer (ARL). The former selects the relevant sub-network for predicting attributes while the latter models attribute relationships. In a word, the cascaded FRL and PaW model not only discover semantic attribute regions but also explore rich relationships among facial attributes. Besides, since the model automatically detects face regions, the alignment of the face image is not required, so that it can achieve outstanding performance on the unaligned dataset.

Note that the FRL-PaW method learns a location for each attribute, which makes the training process redundant and time-consuming as several attributes exist in the same area. However, for the perspective of attribute locating, there is no specific solution to the best of our knowledge for tackling this issue so far. Current FAE methods are more inclined to take the whole face images as inputs without paying attention to where attributes are and design networks to model correlations among attributes. This is precisely what the holistic deep FAE methods do, and we will provide a detailed introduction of them in the following section. Before that, we summarize the part-based deep FAE methods as follows.

First, part-based deep FAE methods locate the positions where facial attribute appear. Two strategies can be adopted: separate auxiliary localization and end-to-end localization, where the former leverages existing part detectors or auxiliary localization-related algorithms, while the latter jointly exploit the localizations in which facial attributes exist and predict their presences. Compared with the separate auxiliary localization methods operate separately and independently, end-to-end localization methods locate and predict in a unified framework. Then, once the localization clues are obtained, features corresponding to certain areas related to specific attributes can be extracted and further fed into attribute classifiers to be estimated.

4.2 Holistic Deep FAE Methods

In contrast to part-based approaches detecting and utilizing facial components, holistic FAE methods just extract features from entire face images with no need of facial parts, and explore the correlations among facial attributes by sharing common features and separating attribute-specific features to exploit the complementary information for the final prediction.

Nevertheless, such an intuitive and straightforward network design scheme brings to new challenges:

(1) how to properly assign shared information and task-specific information?

(2) how to explore correlations among attributes for learning richer features?

Existing methods have made many efforts to solve the two problems, and we provide a brief review of these methods as below.

To our best knowledge, the earliest method using the holistic multi-task framework for FAE is MOON rudd2016moon , mixed objective optimization network, which learns multiple attribute labels simultaneously via a single DCNN. MOON takes the FAE task as a regression problem for the first time and adopts a 16-layer VGG network as the primary network configuration, in which abstract high-level features are shared before the last FC layer, where prediction scores of multiple attributes are calculated with the MSE loss for reducing the regression error. Similarly, Zhong et al. zhong2016leveraging take the place of high-level CNN features in Moon with mid-level ones to identify the best representations for each attribute.

Compared with splitting networks at the last FC layer, Hand et al. hand2017attributes present a multi-task deep CNN (MCNN), which branches out to multiple groups at the mid-level convolutional layers, to model the correlations among facial attributes. Specifically, based on the assumption that many attributes are strongly correlated, MCNN divides all the 40 attributes into 9 groups according to semantic, i.e., gender, nose, mouth, eyes, face, around head, facial hair, cheeks, and fat. For example, ‘big nose’ and ‘pointy nose’ are grouped into the ‘nose’ category, while ‘big lips’, ‘lipstick’, ‘mouth slightly open’ and ‘smiling’ are clustered into the ‘Mouth’ category. Therefore, each group consists of similar attributes and learns high-level features independently. In the first two convolutional layers of MCNN, features are shared by all attributes. After this, each layer is divided into groups corresponding to attribute groups, which means each attribute group has a branch in the MCNN. At the end of the MCNN, an FC layer is added to create a two-layer auxiliary network (AUX) to facilitate attributes relationships. The inputs to AUX are attribute scores from the trained MCNN, while the outputs are the final attribute scores. Therefore, MCNN-AUX models the attribute relationships in three ways: the first is sharing the lowest layers for all attributes, the second is sharing the higher layers for spatially-related attributes, and finally, inputting the attribute scores from the MCNN into the AUX network to discover score-level relationships.

However, there exists a limitation in MCNN that shared information at low-level layers may vanish after network splitting. Therefore, shared features and attribute-specific features should be learned jointly at the same layer rather than in order of precedence.

Cao et al. cao2018partially design a partially shared structure based on MCNN, i.e., PS-MCNN. It divides all 40 attributes into 4 groups according to attribute positions by manual, i.e., upper group, middle group, lower group, and whole image group. Note that such a manual grouping strategy can be regarded as prior information based on human knowledge. The partially shared structure connects four attribute-specific networks (TSNets) corresponding to four different groups of attributes and one shared network (SNet) sharing features among all the attributes. Specifically, each TSNet learns features for a specific group of attributes, while SNet shares informative features with each task. As for the connection mode between the two types of sub-networks, each layer of SNet receives additional inputs from the previous layers of TSNet. Then, features from SNet are fed into the next layers of shared and attribute-specific networks. In a word, PS-MCNN learns features at a certain layer based on both task-specific features and shared features. Besides, the shared features at a specific layer are closely related to the features of all its previous layers, leading to informatively shared representations.

Apart from the attribute correlations, Han et al. han2017heterogeneous introduce the concept of attribute heterogeneity. They point out that individual attributes could be heterogeneous concerning data type and scale, and semantic meaning. In terms of data type and scale, attributes can be grouped into ordinal vs. nominal attributes. For instance, if attributes, like age and hair length, are ordinal, attributes like gender and race are nominal. Note that the difference between ordinal and nominal attributes is that ordinal attributes have an explicit ordering of their variables, while nominal attributes usually have two or more classes and there is no intrinsic ordering among the categories. Similarly, in terms of semantic meaning, attributes such as ‘age’, ‘gender’, and ‘race’ are used to describe the characteristics of the whole face, and the ‘pointy nose’ and ‘big lips’ are mainly used to describe the local characteristics of facial components. Therefore, these two categories of attributes are heterogeneous and can be grouped into holistic vs. local attributes for the prediction of different parts of a face image. Therefore, taking both the attribute correlation and heterogeneity into consideration, Han et al. design a deep multi-task learning (DMTL) CNN that learns shared features of all attribute and category-specific features of heterogeneous attributes. The shared feature learning naturally exploits the relationship between tasks to achieve robust and discriminative feature representations, while the category-specific feature learning aims at fine-tuning the shared features towards the optimal estimation of each heterogeneous attribute category. Note that existing multi-task learning methods make no distinction between all low-level and mid-level features for different attributes, which is unseemly as features at different levels of networks may have different relationships.

However, the above methods share features across tasks and split layers that encode attribute-specific features by hand-designed network architectures. Such a manual exploration in the space of possible multi-task deep architectures is tedious and error-prone, as possible spaces may be combinatorially large. Therefore, rather than constructing networks by manual, Lu et al. lu2017fully present a case that automatically designs compact multi-task deep learning architectures with no need of discovering the possible multi-task architectures artificially. The proposed network learns sharing features in a fully adaptive way. Specifically, it starts with a thin multi-layer network (VGG16) and dynamically widens in a greedy manner during training, so that both task correlations and complexity of the model can be considered, enabling task grouping decisions at each layer of the network. The initialization of the thin network is based on a simultaneous orthogonal matching pursuit (SMOP) tropp2006SMOP , resulting in faster convergence and higher accuracy. The core of the fully adaptive feature sharing algorithm is incrementally widening the current design in a layerwise fashion. Therefore, a top-down layer-wise model widening strategy is adopted. During the training process, the network decides with whom each task shares features in each layer, which leads to several branches in this layer, thus the whole process is fulfilled in a multi-round branching manner. Finally, the number of branches at the last layer of the model is equal to that of attribute categories to be predicted. This method allows us to estimate multiple facial attributes in a dynamic branching procedure through its self-constructed architecture and fully adaptive feature sharing strategy.

Categories Approaches Algorithms Network Architectures Datasets
Metrics
and
Performance
Part-based Methods PANDA zhang2014panda (CVPR2014)
Using Part-based Pose Aligned Networks for learning features
related to poses and Linear SVM classifiers for attribute estimation
PANDA
Berkeley Human Attributes Dataset
Attributes 25K Dataset
LFW-gender
mean Average Precision (mAP)
The Berkeley Human Attributes Dataset (78.98%)
Attribute 25K Datasets(70.74%)
LFW-gender (99.54%)
Gkioxari et al. gkioxari2015wholeandparts (ICCV2015)
Using a deep version of poselets and capturing parts of
the human body for tasks of action and attribute classification
A 5-layer CNN feature pyramid and
a pyramid of part scores
Berkeley Human Attributes Dataset mAP(89.5%)
LNetANet liu2015deep (ICCV2015)
Cascading LNet CNN for localization and ANet for feature extraction
LNetANet
CelebA
LFWA
Accuracy
CelebA(87%)
LFWA(84%)
Off-the-shelf CNN zhong2016face (ICB2016)
Training off-the-shelf architectures for
face recognition to construct facial representations
Off-the-shelf
CelebA
LFWA
Accuracy
CelebA(86.6%)
LFWA(84.7%)
Singh et al. singh2016end (ECCV2016)

Using Spatial Transformer Network (STN) and Ranker Network (RN)

to jointly learn features, localization and ranker of attributes
STN and RN LFW-10attr Attribute ranking accuracy (86.91%)
SSPSSG kalayeh2017improving (CVPR2017)
Using semantic segmentation guiding the attention of the
attribute prediction to the regions where different attributes naturally show up
Semantic Segmentation-based Pooling(SSP)
Semantic Segmentation-based Gating(SSG)
CelebA
Error Rate(8.20%)
Average Precision(81.45%)
Balanced Accuracy(88.24%)
FRL-PaW ding2017deepCascadeFAE (AAAI2018)
Simultaneously learning to localize face regions specific to attributes
and performs attribute classification without alignment in a cascade network
Facial region localization (FRL) network
Parts and Whole (PaW) classification network
Unaligned CelebA Accuracy(91.23%)
SPLITFACE mahbub2018segment (arxiv2018)
Using facial segmentation for attribute
detection in partially occluded faces
Segmentwise, Partial, Localized Inference in Training Facial
Attribute Classification Ensembles (SPLITFACE) Network
CelebA Accuracy(90.61%)
FMTNet zhuang2018DeepTranfer (PR2018)

Constructing three sub-networks for attribute transfer learning

the Face detection Network (FNet)
the Multi-label learning Network (MNet)
the Transfer learning Network (TNet)
CelebA
LFWA
Accuracy
CelebA(91.66%)
LFWA(84.34%)
He et al. he2018harnessing (IJCAI2018)
Generating abstraction images by GAN as
complementary features and used for facial parts localization
GAN and a dual-path facial
attribute recognition network
CelebA
LFWA
Accuracy
CelebA(91.81%)
LFWA(85.2%)
AFFAIR li2018landmark (TIP2018)
Learning a hierarchy of spatial transformations for
facial attribute prediction with no landmark
lAndmark Free Face AttrIbute
pRediction (AFFAIR) Network
CelebA
LFWA
MTFL
mean AP/Accuracy
CelebA(79.63%/91.45%)
LFWA(83.01%/86.13%)
MTFL(-/86.55%)
Holistic Methods Wang et al. wang2016walk (CVPR2016)
Employing a Siamese structure and embedding location as well as
weather contextual information for learning feature representation
Siamese
CelebA
LFWA
Ego-Humans Dataset
Accuracy
CelebA(88%)
LFWA(87%)
Ego-Humans Dataset(87%)
MOON rudd2016moon (ICCV2016)
Treating attribute classification as a
regression task and solving domain adaptive problem
Mixed-Objective Optimization Network
(MOON, VGG16-based)
CelebA
Error Rate
CelebA(9.06%)
A balance error rate
CelebAB(13.67%)
LMLE huang2016learning (CVPR2016)
Using a Large Margin Local Embedding (LMLE) Method for
large-scale imbalanced classification tasks of binary facical attributes
VGG-6 Quintuplet CNN CelebA
Balanced Accuracy(84.25%)
Zhong et al. zhong2016leveraging (ICIP2016)
Studying the effect of mid-level
CNN features for attribute prediction
FaceNet NN.1 schroff2015facenet
CelebA
LFWA
Accuracy
CelebA (89.8%)
LFWA (85.9%)
CRL dong2017CRL (ICCV2017)
Combining batch-wise incremental hard mining for class-imbalance with
the Class Rectification Loss (CRL) regularizing algorithm for attribute classification
5-layer DeepID2 sun2014deep CNN CelebA
Balance Accuracy(86%)
AFFACT GuntherRB17Affact (IJCB2017)
Introducing the Alignment-Free Facial Attribute Classification Technique(AFFECT)
such data augmentation technique for attribute classification without alignment
AFFACT Network (ResNet based) CelebA Error Rate(8.03%)
MCNNAUX hand2017attributes (AAAI2017)
Considering attribute relationships and constructing a Multi-task deep CNN(MCNN)
with an Auxiliary Network(AUX) for performance improvement
MCNNAUX
CelebA
LFWA
Accuracy
CelebA (91.22%)
LFWA (86.31%)
DMTL han2017heterogeneous (TPAMI2017)
Introducing Deep multi-task feature learning (DMTL) for
joint estimation of multiple heterogeneous attributes
DMTL(AlexNet based)
CelebA
LFWA
Accuracy
CelebA (93%)
LFWA (86%)
Lu et al.lu2017fully (CVPR2017)
Automatically designing compact multi-task deep learning architectures
starting with a thin multi-layer network and dynamically widening in a greedy manner
Automatic top-down layer wise widening CelebA
Accuracy(91.02%)
Top-10 Recall(71.38%)
AttCNN hand2018doing (AAAI2018)
Selectively learning with domain adaptive batch re-sample
methods for multi-label attribute prediction
AttCNN Network
CelebA
LFWA
UMD-AED
Accuracy
Balanced Domain
CelebA(85.05%)
LFWA(73.03%)
UMD-AED(71.11%)
R-Codean sethi2018residual (PRLetters2018)
Incorporating a Cosine similarity based loss function into

the Euclidean distance for constructing an R-Codean autoencoder

Residual Codean Autoencoder
CelebA
LFWA
Accuracy
CelebA(90.14%)
LFWA(84.90%)
PS-MCNN cao2018partially (CVPR02018)
Considering the identity information and attribute relationships simultaneously
and constructing a Partially Shared Multi-task Convolutional Neural Network (PS-MCNN)
PS-MCNN
CelebA
LFWA
Error rates
CelebA (7.02%)
LFWA (12.64%)
Table 3: The State of Art Deep Facial Attribute Estimation Approaches.

5 State-of-the-art Facial Attribute Manipulation Methods

Generally, state-of-the-art deep FAM methods can be grouped into two categories: model-based methods and extra condition-based methods. In this section, we provided an overview of these two types of methods concerning the algorithms, network architectures, as well as pros and cons. The summary is provided in Table  4.

5.1 Model-based Methods

Model-based methods map an image in the source domain to the target domain, and then distinguishes the generated target distribution with the real target distribution under the constraint of an adversarial loss. Such model-based methods are greatly task-specific, resulting in more photo-realistic facial attribute images.

Typically, Li et al. li2016IDaware first propose a DIAT model following the standard paradigm of model-based methods. DIAT takes unedited images as inputs to generate target facial images not only owning the target attributes with an adversarial loss but also keeping the same or similar identity to the input images with identity loss. Zhu et al. zhu2017unpaired add an inverse mapping from the target domain to the source domain based on DIAT and propose a CycleGAN, where the two mappings are coupled with a cycle consistency loss. This process is based on the intuition that if we translate from one domain to the other and back again, we should arrive where we started. Based on CycleGAN, Liu et al. liu2017unsupervised propose a UNIT model that maps the pair of corresponding images in source and target domains to a same latent representation in a shared latent space. Each branch from domain to the latent space implements a CycleGAN.

However, all the above mentioned methods operate on the whole image directly, which means when a certain attribute is edit, the rest of other relevant attributes may be changed uncontrollably in the meanwhile.

Therefore, in order to modify the attribute-specific face areas as well as keeping other parts unchanged, Shen et al. shen2017learning present learning residual images so that face attributes can be manipulated efficiently with modest pixel modification over the attribute-specific regions. The residual images are defined as the discrepancy between images before and after the attribute manipulation. A ResGAN is constructed to learn the residual representations of desired attributes, which contains two image transformation networks and a discriminative network, while the former take responsibility for the attribute manipulation and its dual operation, and the latter distinguishes the generated images from real images. Two image transformation network denoted as and respectively, firstly take two images with opposite attributes as inputs in turn and perform the inverse attribute manipulation operation for outputting residual images. Then, the obtained residual images are added to the original input images, yielding the final outputs with manipulated attributes. In the end, all these images, i.e., two original input images and two images from the transformation networks, are both taken as inputs of the discriminative network, which divides its input images into three categories, i.e., images generated from the two transformation networks, the original images with positive attribute labels and the original images with negative attribute labels.

Moreover, drawing on the dual learning applied in machine translation he2016dual , a given image with a negative attribute label flows into for adding the desired attribute, then the obtained image is fed to for removing the synthetic attribute. In this cycle, is the primal task while is regarded as the dual task of . In this way, the yielded image of is expected to have the same attribute label with the original given image. The experiments demonstrate that the dual learning process is beneficial for the generation of high-quality images, and residual image-based facial attribute manipulation can successfully enforce the manipulation pay more attention to local areas where attributes show up, especially for those local attributes.

Since no extra conditional constraints are requested, model-based methods can only edit an attribute during a training process with a set of corresponding model parameters, and the whole manipulation is only supervised through discriminating real or generated images. That means when multiple attributes need to be changed, multiple training processes are inevitable, causing the time-consuming problem. In contrast, manipulating facial attributes with extra conditions is a more commonly used way, as multiple attributes can be edited through one training process by controlling multiple attributes. Hence, extra condition-based methods attract more attention, where extra attribute vectors and reference exemplars are taken as input conditions. Specifically, attribute vectors can be concatenated with the latent to-be-manipulated image codes for controlling facial attributes, while reference exemplars exchange specific attributes with the to-be-manipulated images in the framework of the image-to-image translation. More details about the extra condition-based deep FAM methods are introduced below.

Categories Approaches Algorithms Network Architectures Datasets
Model-based DIAT li2016IDaware (arxiv1610)
Transferring input images to each reference attribute label while keeping the same
or similar identity for Identity-Aware Transfer (DIAT) of facial attributes
GAN CelebA
InfoGAN chen2016infogan (NIPS2016)
Maximizing mutual information for interpretable representations
and discovering visual concepts of facial attributes
GAN CelebA
UNIT liu2017unsupervised (NIPS2017)
Proposing an UNsupervised Image-to-Image Translation (UNIT)
framework under a shared-latent assumption
GANVAE CelebA
Residual Image shen2017learning (CVPR2017)
Learning residual images to avoid entire face operation with redundant irrelevant information
GAN CelebA
Wang et al. wang2018weakly (WACV2018)
Combining a perceptual content loss and two adversarial losses to guarantee
the global consistency for producing more realistic images
GAN
CelebA
LFW
SG-GAN zhang2018sparsely (ACMMM1805)
Constructing a sparsely grouped generative adversarial networks (SG-GAN)
in the sparsely grouped datasets where most training data are mixed and a few are labelled
GAN
CelebA
Extra Condition-based Conditioned on attribute vectors
VAE/GAN larsen2016autoencoding (ICML2016)
Using learned feature representations in the GAN discriminator as basis
for the VAE reconstruction objective
GANVAE LFW
CVAE yan2016attribute2image (ECCV2016)
Learning a layered foreground-background generative
conditional variational auto-encoder for complex images
VAE LFW
IcGAN perarnau2016invertible (arxiv1611) Combining an encoder with a cGAN for obtaining Invertible cGAN (IcGAN) GANVAE CelebA
Fader Network lample2017fader (NIPS2017)
Disentangling the salient information of face images and the values of attributes
directly in the latent space for modifying facial attributes continuously
AE CelebA
CAAE zhang2017age (CVPR2017)
Learning a face manifold for smooth age progression and regression
simultaneously in a conditional adversarial autoencoder (CAAE)
AE FGNET
cCycleGAN lu2018attribute (ECCV2018)
Extending the cycleGAN zhu2017unpaired conditioned on facial
attributes with the cycle consistency loss
GAN CelebA
StarGAN choi2017stargan (CVPR2018)
Constructing a StarGAN for multiple domain image-to-image translations
GAN CelebA
CRGAN li2018facial (Springer2018)
Introducing recycle reconstruction loss to maintain personal facial identity
and directly learning facial transformation with attribute annotations
GAN
CelebA
SaGAN zhang2018generative (ECCV018)
Introducing a spatial attention mechanism for only modifying the attribute-specific region and keep the rest unchanged
GAN CelebA
Conditioned on reference exemplars
Gene-GAN zhou2017genegan (arxiv1705)
Recombing the latent representation information of two
paired attribute images for swapping specific attributes
GAN
CelebA
ELEGANTxiao2018elegant (arxiv1803)
Exchanging Latent Encoding with GAN for Transferring Multiple Face Attributes (ELEGANT)
and doing image generation by exemplars as well as producing high quality generated images
GANVAE CelebA
Table 4: The State of Art Facial Attribute Manipulation Approaches.

5.2 Extra Condition-based Methods

Deep FAM methods conditioned on extra attribute vectors alter desired attributes with given conditional attribute vectors, such as one-hot vectors whose each value denotes whether an attribute is present. During the training, the conditional attribute vectors are concatenated with the to-be-manipulated images in the coded spaces.

Zhang et al. zhang2017age propose a conditional adversarial autoencoder (CAAE) for the age progression and regression purposes. CAAE first maps a face image to a latent vector through an encoder. Then, the obtained latent vector concatenated with an age label vector is fed into a generator for learning a face manifold. The age label condition controls to alter the age, while the latent vector ensures the personalized face features are preserved. Yan et al. yan2016attribute2image introduce a conditional variational auto-encoders (CVAE) to generate images from visual attributes. CVAE disentangles an image into the foreground and the background parts, where each part is combined with the defined attribute vector respectively so that the generation quality of complex images can be significantly improved since much more attention is paid to the pivotal foreground areas. Perarnau et al. perarnau2016invertible propose an invertible conditional GANs(IcGAN) to edit multiple facial attributes with determined specific representations of generated images. Given an input image, IcGAN firstly learns a representation containing a latent variable and a conditional vector by an encoder. Then, IcGAN modifies such the latent variable and conditional vector to re-generate the original input image through the conditional GAN mirza2014conditional . In this way, through changing the conditional vector encoded by its encoder, IcGAN can achieve arbitrary attribute manipulation.

Apart from autoencoders, VAEs, as well as GANs and their variations, Larsen et al. larsen2016autoencoding combine VAEs and GANs into an unsupervised generative model, i.e., VAE/GAN. In this model, GAN discriminator learns feature representation taken as the basis of the VAE reconstruction objective, which means the VAE decoder and the GAN generator are collapsed into one by sharing parameters and training jointly. Hence, the VAE/GAN contains three parts: the encoder, the decoder, and the discriminator. By concatenating attribute vectors to the feature representations from such three parts, VAE/GAN achieves better performance than ordinary VAEs and GANs.

Recently, taking the multiple attribute manipulation as a domain transfer task, Choi et al. choi2017stargan propose a StarGAN to learn the mappings among multiple domains with only a single generator and a discriminator training from all domains, where each domain corresponds to an attribute and the domain information can be denoted by one-hot vectors. Specifically, first, the discriminator distinguishes the real images and fake images, as well as classifies the real images to its corresponding domain. Then, the generator is trained to translate an input image into an output image conditioned on a target domain label vector, which is generated randomly. In this way, the generator is capable of translating the input image flexibly. In a word, StarGAN takes the domain labels as an extra supervision condition, that makes it possible to incorporate multiple datasets containing different types of labels simultaneously.

As we can see, all the above methods are capable of editing multiple facial attributes simultaneously by changing multiple values of attribute vectors. However, none of them can continuously change the values of attributes.

In light of this, Lample et al. lample2017fader present a Fader network using continuous attribute values to modify attributes through sliding knobs, like faders on a mixing console. For example, one can gradually change the values of gender to control the transition process from ‘man’ to ‘woman’. The Fader network is composed of three ingredients: an encoder, a decoder, and a discriminator. With an image-attribute pair as the input, Fader network first maps the image to the latent representation by its encoder and predict the attribute vector by its discriminator. It is worth noting that the discriminator is incapable of predicting the attribute only under the condition of the latent representation due to the existence of the encoder. Then, the decoder reconstructs the image through the learned latent representation and attribute vector. During the stage of the test, the discriminator is discarded, and different images with various attributes can be generated with different attribute values.

Note that all the methods as mentioned above edit attributes over the whole face images so that the attribute-irrelevant details might be changed in the meanwhile. Hence, how to keep other areas that are outside of the specific attribute-relevant regions unchanged is still a challenge to face.

In order to tackle this, Zhang et al. zhang2018generative introduce the spatial attention mechanism into GANs to locate attribute-relevant areas for manipulating facial attributes more precisely. The proposed GAN with spatial attention, dubbed SaGAN, follows the standard adversarial learning paradigm in FAM condition on attribute vectors, where a generator and a discriminator plays a min-max game, and extra attributes vectors are used for editing specific attributes. Specifically for keeping the attribute-irrelevant regions unchanged, the generator consists of an attribute manipulation network (AMN) and a spatial attention network (SAN). Given a face image, SAN learns a spatial attention mask where attribute-relevant regions have non-zero attention values while The rest is the opposite. In this way, the region which the desired attribute show up can be located. AMN takes the face image and the attribute vector as inputs and outputs an image with the manipulated attribute in the specific region located by SAN.

Rather than taking the attribute vectors as extra conditions, deep FAM methods conditioned on reference exemplars consider to exchange specific attributes with the to-be-manipulated images in the framework of the image-to-image translation. Note that these reference images do not need to have the same identity with the original to-be-manipulate images, and all the generated attributes are present in the real world. In this way, more specific details that appear in the reference images can be explored to generate more realistic images other than altering attribute vectors manually. That is why more and more researchers pay more attention on such extra reference exemplar condition based methods.

Typically, Zhou et al. zhou2017genegan first design an GeneGAN to achieve the basic reference exemplar-based facial attribute manipulation, where an image are encoded into two complement codes, i.e., attribute-specific codes and attribute-irrelevant codes. By exchanging the attribute-specific codes between the reference exemplars and to-be-manipulated images, desired attributes can be transferred from one image to another.

Considering that GeneGAN only transfers one attributes, Xiao et al. xiao2018elegant construct an ELEGANT model to exchange latent encodings with GAN for transferring multiple face attributes by exemplars. Specifically, since all the attributes are encoded in the latent space in a disentangled manner, one can exchange the specific part of encodings and manipulating several attributes simultaneously. Besides, the residual image learning and the multi-scale discriminators for adversarial training enable the proposed model to train on higher resolution images and generate high-quality images with more delicate details and fewer artifacts, respectively. At the beginning of training, ELEGANT model receives two sets of training images as inputs, i.e., a positive set and a negative set, which are not necessary to be paired. Second, an encoder is utilized to obtain the latent encoding of both positive and negative images. After this, if the -th attribute is required to be transferred, the only thing that needs to do is to exchange the -th part in the latent encodings of positive and negative images. Until now, the encoding part is finished, and a reasonable structure should be designed for deciphering the latent encodings into images. ELEGANT recombines the latent encodings and employs a decoder to do this job, and along with the encoder, they together play a role as the image generator. At last, the multi-scale discriminators are utilized, which consists of two discriminators with identical network structure but operating at different scales, to obtain the final manipulated facial attribute images.

6 Additional Related Issues

6.1 Class-imbalanced Learning in Facial Attribute Analysis

Class-imbalanced data refers that, in a dataset, some of the classes have a much greater number of examples than others, corresponding to the majority class and minority class haixiang2017learning , respectively. For example, the largest imbalance ratio in the CelebA dataset between the minority and majority attribute is 1:43. In general, class-imbalanced data is learned based on data re-sampling methods or cost-sensitive learning methods for many real-world applications. In the following, we list several strategies used in facial attribute analysis for tackling class-imbalanced data.

MOON rudd2016moon weights the back propagation error according to a given balanced target distribution in a cost-sensitive way. Broadly speaking, MOON takes the class-imbalanced problem as the domain adaption task by using the balanced target domain distribution to instruct the imbalanced source domain. Since MOON does not consider the label imbalance in each batch, AttCNN hand2018doing applies a domain adaptive re-sampling at the batch level via the proposed selective learning algorithm. In this way, one can correct the bias of data in each batch by the desired distribution.

In contrast to the above method class-imbalanced data in the form of domain adaption, Huang et al. huang2016learning formulate a quintuple sampling method with the associated triple-header loss, called large margin local embedding (LMLE) from the perspective of re-sampling. LMLE enforces the preservation of locality cross clusters and discrimination between classes. Then, a fast cluster-wise kNN is executed followed by a local large margin decision. LMLE learns embedded features, which are discriminative enough without any possible local class imbalance.

However, Dong et al. dong2017CRL point out that the LMLE has several fundamental weaknesses, including the separate process of feature extraction and classification, quintuplet computing costs, and offline clustering of data. In light of this, an end-to-end method is presented by exploiting a batch-wise incremental hard mining on minority attribute classes and formulating a class rectification loss (CRL) based mined minority examples. As for the hard mining strategy, to begin with, the profiles of minority hard-positives and hard negatives are provided, and then for a minority class of a specific attribute, K hard-positives as the bottom-K scored on the minority class are selected, as well as K hard-negatives as the top-K score, given the pre-defined profiles and model. Such process is executed in each batch and incrementally over subsequent batches, i.e., batch-wise incremental hard mining.

Based on LMLE, Huang et al. huang2018deep

present a cluster-based large margin local embedding(CLMLE), which is the improved version of LMLE. CLMLE aims to learn more discriminative deep representation with a CLMLE loss function via the preservation of inter-cluster margin both within and between classes. Different from LMLE enforcing Euclidean distance on a hypersphere manifold, CLMLE designs angular margins enforced between the involved cluster distributions and adopt spherical k-means for obtaining

K clusters with the same size. CLMLE achieves better performance than CRL and LMLE methods.

6.2 Relative Attribute Ranking in Facial Attribute Analysis

Relative attribute learning aims at constructing functions to rank the relative strength of attributes chen2014predicting , which can be widely applied in objection detection fan2013relative , fine-grained visual comparison shi2018fine , and facial attribute estimation li2018landmark . The insight in this line of work is learning global image representations in a unified framework lampert2009learning ; parikh2011relative , or localizing part-based representations bourdev2011describing ; sandeep2014relative ; zhang2014panda through pre-trained part detectors. However, the former ignores the localization of attributes, and the latter ignores the correlations between attributes, which both collapse the performance of relative attribute ranking.

Xiao et al. xiao2015discovering first propose to automatically discover the spatial extent of relevant attributes by establishing a set of visual chains indicating the local and transitive connections. In this way, the localizations of attributes can be learned automatically in an end-to-end way. Despite that no pre-trained detectors are used, the optimization pipeline still contains several independent modules resulting in a suboptimal solution. Singh et al. singh2016end construct an end-to-end deep CNN for learning the features, localization, and ranker of facial attributes simultaneously for tackling this issue by taking the weakly-supervised pairwise images as inputs. Specifically, given pairs of training images ordered according to the relative strength of an attribute, two Siamese networks are used for taking each image as input. Each Siamese network contains two components: one is the Spatial Transformer Network (STN), which generates image transformation parameters for localizing the most relevant regions, and another is the Ranker Network (RN), which outputs the predicted attribute scores for the images. The qualitative results on LFW-10attr show good performance on attribute region localization and ranking accuracy.

To model the pair-wise relationships between images for multiple attributes, Meng et al. meng2018efficient

construct a graph model, where each node represents an image and the edges models the relationships between images and attributes, as well as images and images. The overall framework is made up of two components: one is the CNN for extracting primary features of the node images, while another is the graph neural network (GNN) for learning the features of edges and following updates. After learning the feature representations from the CNN, first, the relationships among all the images are modeled by a fully-connected graph over these features. Then, a gated recurrent unit (GRU) takes the node and its corresponding information as inputs and outputs the updated node. As a result, the correlations among attributes can be modeled by using information from the neighbors of the node, as well as updating its state based on the previous state.

6.3 Adversarial Robustness in Facial Attribute Analysis

Adversarial images, generated from the network topology, training process, and hyperparameter variation, can be used as inputs with slight artificial perturbations in which the original input can be classified correctly, while the adversarial input is misclassified. Szegedy et al.

szegedy2013intriguing first propose that neural networks can be induced to misclassify an image by carefully chosen perturbations that are imperceptible to humans. Hence, the study on adversarial images is entering the horizons of researchers.

Rozsa et al. rozsa2017facial attempt to induce small artificial perturbations on existing misclassified inputs to correct the classification in facial attribute analysis. Specifically, the adversarial images are generated over a random subset of the CelebA via the fast flipping attribute (FFA) technique. The FFA algorithm leverages the back-propagation of a Euclidean loss without ground-truth labels to generate adversarial images by flipping the binary decision of the deep network. Separate facial attribute networks are trained for each attribute and tested the robustness of these networks by FFA. The experiments demonstrate that although FFA can create more adversarial images than the related fast gradient sign (FGS) goodfellow2015explaining , the few attributes can be affected when flipping the targeted attributes. Such attribute flipping enlightens the task of attribute anonymity.

Chhabra et al. chhabra2018anonymizing propose to anonymize multiple facial attributes that one does not want to share via adversarial perturbation and preserve the other attributes and the visual quality of images simultaneously. An algorithm is derived to partition pre-defined attributes into a preservation set and a suppression set. Then, the images with adversarial perturbation can be generated based on these attribute sets. As a result, the prediction of a specific attribute from the true category can be classified to a different target category.

To sum up, the study of adversarial robustness contributes to improving the representational stability of current facial attribute estimation algorithms. Also, through the attack of adversarial examples, deep facial attribute estimation models become more robust to achieve impressive performance.

7 Challenges and Opportunities

As we discussed before, Facial attribute analysis is an important computer vision task both in theory and for real-world applications songlx2018geometry ; lu2018conditional ; hu2018pose . Firstly, we gave a description of what the facial attribute analysis is and what its two main sub-issues cover, i.e., facial attribute estimation (FAE) and facial attribute manipulation (FAM). Secondly, according to the general face image processing process, we reviewed the evolutions of both two sub-tasks and introduced their pipelines respectively. Thirdly, we summarized the public databases as well as commonly used metrics. Fourthly, we present the current state of the arts in both FAE and FAM with respective taxonomies, where part-based methods and holistic methods for FAE and model-based methods and extra condition-based methods for FAM. It is our firm belief that such taxonomies help us to understand the facial attribute analysis task better.

Despite the promising performance achieved by massive current algorithms for the facial attribute analysis, there are still several challenging issues that deserve more focus to be tackled. In this section, we discuss the challenges and describe the future trends in both FAE and FAM, respectively from the perspectives of databases, algorithms, as well as the real-world applications.

7.1 Discussion of Facial Attribute Estimation

7.1.1 Data

The development of deep neural networks makes the facial attribute estimation become a data-driven task, which means a large number of training samples are required for training deep neural networks sufficiently to capture the attribute-relevant facial details. However, current facial attribute databases are facing with the problem of insufficient data, which is a primary and major challenge.

Taking two datasets that are frequently used as examples, i.e., CelebA and LFWA, we provide the analysis of some problems in existing facial attribute databases.

Firstly, from the perspective of data sources, CelebA collects data and labels facial attributes from the celebrities, while the samples of LFWA come from the news online. There is no doubt that these databases are biased and do not match the general data distributions in the real world, since celebrities or people in the news have some attributes that ordinary persons do not have. For example, the ‘bald attribute corresponds to a small number of samples in CelebA, but in the real world, it is a very common attribute among ordinary people. Hence, more complementary facial attribute datasets that cover more real-world scenarios and a wide range of facial attributes need to be constructed in the future. We anticipate that such effort in databases would significantly reduce the over-fitting problem for the deep neural networks in the future.

Secondly, insufficient data leads to the data imbalance problem. We discuss such problem from two aspects: the imbalance in the distribution and the imbalance in the category. The imbalance in distribution is defined as the domain adaption problem, and the imbalance in the category is called class-imbalanced issue that we discussed in the previous section of this paper. We analysis the crucial problems of the two imbalanced tasks as below.

When predicting attributes from other datasets that collect samples from ordinary persons or other data sources, e.g., the internet or videos, even any datasets that share different distributions from the training datasets, domain adaption becomes a new challenge that researchers have to face. The discrepancies between databases have a negative effect on the generalizability on unseen test data and lead to significant performance deterioration. Few studies consider the domain adaption issue in facial attribute estimation, and future works need to pay more attention to this.

Looking at the details, as regards the class-imbalance issue, such imbalance on certain attributes can lead to bias in classification performance, when features of minority attributes cannot be plenty learned. As we mentioned above, there has been a small amount of literature that focuses on this issue, data pre-processing and cost-sensitive learning are still the keys to deal with the class- imbalance problem in the future facial attribute estimation.

7.1.2 Algorithms

Two types of deep FAE methods develop in parallel, i.e., part-based methods and holistic methods, while the former pays more attention to locate attributes, and the latter is more committed to modeling attribute relationships. We provide the main challenges faced by both from the perspective of algorithms and analyze the future trends.

For the part-based methods, earlier methods draw support from the existing part detectors to discover the facial components. However, these detected parts of faces are coarse and attribute-independent, as these detectors only distinguish the whole face with the other face-irrelevant part, such as the background in an image. Considering existing detectors are not specifically designed for the facial attribute estimation, some researchers begin to get help from other face-related auxiliary tasks that focus more on facial details other than the whole face. There are also some studies utilizing the labeled key points to partition facial regions. However, well-labeled facial images are not always available, and the performance of auxiliary tasks would limit the accuracies of the subsequent classification tasks. In the future, we believe that there is an urgent need for an end-to-end strategy that learns attribute-relevant regions of the whole images and predicts corresponding attributes over these regions in a unified framework. Although Ding et al. ding2017deepCascadeFAE have made an attempt to tackle this, learning a region for each attribute is redundant since several attributes might show up in the same region. How to model attribute relationships when locating their positions is a challenge for future part-based methods.

Besides, part-based methods show the great superiority when dealing with the data under the in-the-wild environmental conditions, such as illumination variation, occlusions, and non-frontal faces. Through learning the locations of different attributes, part-based methods can integrate the information from the non-occluded areas to predict attributes in occlusion areas. Mahbub et al. mahbub2018segment deal with this issue through partitioning facial parts manually according to key points, however, as we mentioned above, such labels are not always available so that learning and integrating these non-occluded areas in an adaptive way is a future trend. Moreover, the lack of data under the in-the-wild conditions is still a challenge for training deep neural networks to solve the facial attribute estimation in the wild environment, when Mahbub et al. mahbub2018segment test their model by adding occlusion manually, which is not normative in the test protocol.

As for the holistic methods, the state of the arts design different networks with different architectures for sharing common features and learning attribute-specific features at different layers. However, these methods define attribute relationships for constructing networks by grouping attributes manually, rather than learning such relationships adaptively. Extra prior information given by researchers is introduced while different individuals might partition different attribute groups according to locations or semantic. It is hard to determine that these current facial attribute groups are suitable and optimal. How to discover attribute relationships adaptively without given prior information artificially should be the focus of future work.

Also, facial attributes have been taken as auxiliary and complementary information that applied for many face-related tasks, such as face recognition, face detection, and facial landmark localization. Despite the promising performance that can be achieved by pure facial attribute estimation over face images, joint and incorporative learning of these relevant tasks can further enhance their respective robustness and performance by discovering complementary information among them. For example, considering the inherent dependencies of face-related tasks, Zhuang design a cascaded CNN for simultaneously learning face detection, facial landmark localization, and facial attribute estimation under a multi-task framework to improve the performance of each task. They still attempt to joint face recognition with facial attribute estimation with the consideration of the relationship between identities and attributes. Therefore, it is reasonable to think that the combination of different face-related tasks is becoming a promising research direction due to the complementarity among them.

7.1.3 Applications

Current researches mainly focus on the image data for predicting facial attributes, while the video data that contains a more extensive collection of images of the same person from varied viewpoints is significantly ignored. Earlier work wang2016walk creates Ego-Humans dataset that draws out images from videos, where casual walkers are tracked in New York City throughout two months. Nevertheless, Ego-Humans dataset extracts five million face pairs along with their same or not same labels with their weather and location context, which facilitates facial attribute prediction through additional features related to location and weather information. For example, sunny days encourage people to wear sunglasses and individuals in India Square are more likely to come from South Asia. Note that video data itself is not directly utilized but is extracted into image data, so that different viewpoints from the same person are not be used to assist the facial attribute estimation task.

On the one hand, such viewpoint diversification helps to learn richer features from the same person, as well as maintain identity-attribute consistency which aligns attributes of the same identity when predicting attributes. However, on the other hand, attribute inconsistency becomes another new challenge where images of different viewpoints might differ in attributes even from the same identity. For example, the side face images might not disagree with the front face images for the prediction of the ‘high cheekbones’, as this attribute is not emphasized by the side face images, and even human cannot discriminate preciously. Liu et al. lu2018attribute work on such attribute inconsistency for multiple face images, where probabilistic confidence criterion and image quality criterion are proposed to address the inconsistency issue. Specifically, probabilistic confidence criterion first extracts the most confident image for each subject and then choose the result corresponding to the highest confidence as the final prediction of each attribute concerning each subject. However, the study for this attribute inconsistency issue in the video data still has a lacuna to be filled, and we anticipate that more complementary algorithms and facial attribute video databases will be developed in the future for the deep facial attribute estimation.

Besides, nowadays, many devices in the real world contain enormous valuable personal information of users, such as bank accounts and personal emails, and such personal details make these devices become the targets of various attacks. Hence, biological characteristics, such as fingerprints and iris, has been widely used as the passwords of these devices for further protecting the privacy information of users, i.e., biometric identification. Moreover, more and more biometrics-based algorithms have emerged as a solution for continuous authentication on these devices. Many researchers have committed to designing active authentication algorithms based on face biometrics. For example, studies in fathy2015face ; gunther20132013 ; hadid2007face detect faces through camera sensor images and further extract low-level features for the authentication of smartphone users.

Inspired by these methods, we consider that facial attributes contain more detailed characteristics than the full face identification, and much more attention should be paid on such attribute information for further advancing the progress of biometric identification on mobile devices. Samangouei et al. samangouei2017facial have made an attempt on active authentication of mobile devices employing facial attributes. A set of binary attribute classifiers are trained to estimate whether attributes are present in images of the current user of a mobile device. Then, the authentication can be implemented by comparing the recognized attributes with the original enrolled ones.

However, the features extracted by Samangouei et al. samangouei2017facial are traditional features, such as LBP, which are not task-specific for facial attribute estimation and less discriminative than deep features. Besides, the binary attribute classifiers are also traditional SVMs. To some extent, the use of these traditional features and classifiers aims to balance the verification accuracy and mobile performance, when some methods with good accuracies might have tremendous computation or memory costs. Hence, The future challenges exist in not only how to better apply facial attributes for the mobile device authentication, but also how to explore more discriminative deep features and classifiers under the constraints of the tradeoff between the verification accuracy and mobile performance.

7.2 Discussion of Facial Attribute Manipulation

7.2.1 Data

Generally, facial attribute manipulation is a conditional generative task, which synthesizes images according to given conditions. Note that the conditions that we discuss here are different from the conditions we mentioned when we create a taxonomy of the state-of-the-art facial attribute manipulation methods. The conditions here denote broader concepts, which contain given input images, i.e., the model-based methods, and other additional extra conditions, such as attribute vectors and reference exemplars. Compared with unconditional models that synthesize images only from random noises, facial attribute manipulation is a more controllable process. Moreover, facial attribute manipulation is an image-to-image translation issue, which typically learns feature mappings from one image domain to another or discovers joint representations across different image domains, i.e., multi-domain transfer. That means many image-to-image translation algorithms can be directly used for facial attribute manipulation. As we mentioned above, we only list studies that specifically do experiments on facial attribute databases, i.e., CelebA and LFWA.

From the perspective of data, note that currently FAM methods only edit several conspicuous attributes over the two databases, such as smile, glasses, gender, and hair color, several attributes that represent high-level semantic or abstractive details cannot be manipulated, such as attractive, shadow and blurry. To some extent, that is because it is difficult to define the explicit concepts of these attributes since the original images are labeled over these attributes subjectively. However, the synthesis of these abstractive attributes means a lot for the beauty makeup application in the real world. Moreover, the heavy makeup and age attributes have become essential branches of the image transformation, i.e., makeup manipulation li2018beautygan ; chang2018pairedcyclegan and face aging suo2010compositional , which are more tough tasks and attract the attention of many researchers. We believe they will have been being hot research topics in the future.

Further, from the perspective of the metrics, as we mentioned in Section  3, existing methods either evaluate generated images by statistical surveys or get help from other face-related tasks, such as attribute estimation and landmark detection. Unified and standard metric systems have not yet formed in terms of qualitative and quantitative analysis. We anticipate that this will be a research direction that deserves more attention in the future.

7.2.2 Algorithms

State-of-the-art deep FAM methods can be grouped into two categories: model-based methods and extra condition-based methods, while the model-based methods tackle an attribute domain transfer issue and use the adversarial loss to supervise the process of image generation, and extra condition-based methods alter desired attributes with given conditional attributes that are concatenated with the to-be-manipulated images in the code spaces. The main difference between the two types of methods is whether extra conditions are required.

Model-based methods take no extra conditions as inputs, and one trained model only changed one corresponding attributes. This strategy is task-specific and helps to generate more photo-realistic images, but it is difficult to guarantee the attribute-irrelevant details are unchanged due to its operation on the whole image directly. Few methods have focused on this issue, except the ResGAN proposed by Shen et al. shen2017learning . However, ResGAN generates residual images for locating attribute-relevant regions under the sparsity constraint which rely much on a control parameter but not the attribute itself. Hence, how to design networks to synthesize desired photo-realistic attributes and in the meantime keep other attribute-irrelevant details unchanged is a significant challenge in the future. In addition, the multi-domain transfer has become the focus of many researchers liu2018unified ; zhang2018xogan , we expect that these novel algorithms will migrate to the field of the facial attribute for generating more appealing performance for multiple attribute manipulation.

Extra condition-based methods take attribute vectors or reference exemplars as the conditions to edit facial attributes through changing the values of vectors or the latent codes of reference exemplars. One advantage of this type of methods is that multiple attributes can be manipulated simultaneously by altering multiple corresponding values of conditions. However, these methods cannot change attributes continuously, and we believe that this can be solved by the strategy of interpolation

berthelot2018understanding in the future. In addition, as we mentioned above, algorithms based on reference exemplars are becoming a promising research direction, since more specific details that appear in the reference images can be explored to generate more realistic images other than altering attribute vectors manually when taking attribute vectors as conditions.

7.2.3 Applications

Facial makeup is a sub-issue in facial attribute manipulation and significantly applies to various mobile devices such as beauty cameras. Facial makeup pays more attention to facial details related to makeup, such as the types of eyeshadows and the colors of lipsticks. The focus of studies in makeup is facial makeup transfer and removal chang2018pairedcyclegan , where the make transfer aims to map one makeup style to another for generating different makeup styles li2018beautygan , and the removal task performs an opposite process which cleans off the existed makeup and provides support to makeup-invariant face verification.

As an essential branch of facial attribute manipulation, facial makeup transfer and removal algorithms follow the general strategy of FAM by taking the makeup transfer as a domain transfer task or a style transfer task among different makeup style images or makeup images and non-make up images. Even recently, facial makeup transfer based on reference exemplars becomes a popular topic of studies. For example, BeautyGAN li2018beautygan transfer makeup through incorporating the strategies in domain transfer and reference exemplars transfer. Specifically, discriminators distinguish generated images from the real ones at the domain level, while the pixel-level histogram loss on separate local facial regions in the reference exemplar level. In this way, delicate makeup information can be learned and transformed to generate photo-realistic makeup images with different styles.

In addition, note that existing methods only work at a restricted range of resolutions. This constraint gives an opportunity to combine face super resolution with facial attribute manipulation. For example, Lu et al.

lu2018attribute propose a conditional version of CycleGAN zhu2017unpaired to generate face images under the guidance of attributes for face super resolution. Taking a pair of low/high resolution faces as input, as well as an attribute vector that extracted from the high resolution one, the attribute-guided conditional CycleGAN learns to generate a high-resolution version of the original low-resolution image, with the attributes of the original high-resolution image. Besides, recently, Dorta et al. dorta2018gan apply smooth warp fields to GANs for manipulating face images at very high resolutions with a deep network at a lower resolution. All these schemes inspire us to gain knowledge from state-of-the-art face super resolution methods for solving the facial attribute manipulation related issues in the future.

Besides, there is still a challenge to face: facial attribute manipulation in the video still has not been studied, which is a more difficult issue when the video images are dynamic. We believe this is because of the lack of relevant data and expect that more focus can be shifted to this field in the future.

So far, we have provided some possible challenges to be faced in FAE and FAM, and in the following, we will discuss the relationship between the two issues and how they assist each other.

For facial attribute estimation, facial attribute manipulation can be taken as a vital scheme of data augmentation, where generated facial attribute images can significantly increase the amount of data that can be used to train deep neural networks. As a result, the overfitting could be reduced for further improving the accuracy of attribute prediction. When it comes to facial attribute manipulation, facial attribute estimation can be a significant quantitative performance evaluation criterion, where the accuracy gap between real images and generated images can be used to reflect the performance of facial attribute manipulation algorithms. However, despite the mutual assistance that provides by the two tasks, there are still some issues to be tackled.

Firstly, generated facial attribute image may not contain too much delicate facial information, which means that there is still a gap between the real and augmented generated images and such a gap might damage the performance of estimation. Hence, how to close the gap can be an essential research direction for data augmentation in the future. Secondly, due to the indirect assessment for facial attribute manipulation, the performance of facial attribute estimation has an indirect effect on the manipulation task. Therefore, how to balance the metric with the performance of prediction is another challenge to be dealt with. We anticipate that facial attribute estimation and facial attribute manipulation can be more complementary for significantly improving the performance of each other in the near future.

8 Conclusion

As semantic features describing visual properties of face images, facial attributes have received considerable attention in the computer vision field. The analysis targeting facial attributes, including facial attribute estimation and manipulation, provide an opportunity to improve the performance of massive real-world applications. This paper provides a comprehensive review of recent advances in facial attribute estimation and manipulation respectively. The commonly used databases and metrics are summarized, and the taxonomies of both two issues have been created respectively, together with their pros and cons. Besides, the future challenges and opportunities are highlighted in data, algorithms, and applications respectively. We are looking forward to further studies that tackle these challenges to promote the development of face attribute analysis in the future.

9 Reference

References

  • (1) N. Kumar, A. C. Berg, P. N. Belhumeur, S. K. Nayar, Attribute and simile classifiers for face verification, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, 2009, pp. 365–372.
  • (2) T. Berg, P. N. Belhumeur, Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2013, pp. 955–962.
  • (3) F. Song, X. Tan, S. Chen, Exploiting relationship between attributes for improved face verification, Computer Vision and Image Understanding 122 (2014) 143–154.
  • (4) S. Zhang, R. He, Z. Sun, T. Tan, Demeshnet: Blind face inpainting for deep meshface verification, IEEE Transactions on Information Forensics and Security (TIFS) 13 (3) (2018) 637–647.
  • (5) R. He, X. Wu, Z. Sun, T. Tan, Wasserstein cnn: Learning invariant features for nir-vis face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
  • (6) Z. Shi, T. M. Hospedales, T. Xiang, Transferring a semantic representation for person re-identification and search, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2015, pp. 4184–4193.
  • (7) R. He, T. Tan, L. Davis, Z. Sun, Learning structured ordinal measures for video based face recognition, Pattern Recognition 75 (2018) 4–14.
  • (8)

    L. Song, M. Zhang, X. Wu, R. He, Adversarial discriminative heterogeneous face recognition, in: Proceedings of the Conference on Artificial Intelligence(AAAI), 2018.

  • (9)

    Y. Li, R. Wang, H. Liu, H. Jiang, S. Shan, X. Chen, Two birds, one stone: Jointly learning binary code for large-scale face image retrieval and attributes prediction, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, 2015, pp. 3819–3827.

  • (10) H. M. Nguyen, N. Q. Ly, T. T. Phung, Large-scale face image retrieval system at attribute level based on facial attribute ontology and deep neuron network, in: Asian Conference on Intelligent Information and Database Systems, Springer, 2018, pp. 539–549.
  • (11) Y. Fang, Q. Yuan, Attribute-enhanced metric learning for face retrieval, EURASIP Journal on Image and Video Processing 2018 (1) (2018) 44.
  • (12) H. Huang, Z. Li, R. He, Z. Sun, T. Tan, Introvae: Introspective variational autoencoders for photographic image synthesis, arXiv preprint arXiv:1807.06358.
  • (13) J. Cao, Y. Hu, B. Yu, R. He, Z. Sun, Load balanced gans for multi-view face image synthesis, arXiv preprint arXiv:1802.07447.
  • (14) H. Huang, L. Song, R. He, Z. Sun, T. Tan, Variational capsules for image analysis and synthesis, arXiv preprint arXiv:1807.04099.
  • (15) L. Song, J. Cao, L. Song, Y. Hu, R. He, Geometry-aware face completion and editing, in: Proceedings of the Conference on Artificial Intelligence(AAAI), 2019.
  • (16) U. Mahbub, S. Sarkar, R. Chellappa, Segment-based methods for facial attribute detection from partial faces, arXiv preprint arXiv:1801.03546.
  • (17) M. M. Kalayeh, B. Gong, M. Shah, Improving facial attribute prediction using semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 4227–4235.
  • (18) N. Zhang, M. Paluri, M. Ranzato, T. Darrell, L. Bourdev, Panda: Pose aligned networks for deep attribute modeling, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2014, pp. 1637–1644.
  • (19) N. Kumar, P. Belhumeur, S. Nayar, Facetracer: A search engine for large collections of images with faces, in: European Conference on Computer Vision (ECCV), Springer, 2008, pp. 340–353.
  • (20) G. Gkioxari, R. Girshick, J. Malik, Actions and attributes from wholes and parts, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, 2015, pp. 2470–2478.
  • (21) R. Ranjan, V. M. Patel, R. Chellappa, Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
  • (22) M. Günther, A. Rozsa, T. E. Boult, AFFACT: alignment-free facial attribute classification technique, in: Proceedings of the IEEE International Joint Conference on Biometrics (IJCB), IEEE, 2017, pp. 90–99.
  • (23) L. Bourdev, J. Malik, Poselets: Body part detectors trained using 3d human pose annotations, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, 2009, pp. 1365–1372.
  • (24) R. Ranjan, S. Sankaranarayanan, C. D. Castillo, R. Chellappa, An all-in-one convolutional neural network for face analysis, in: Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition (FG), IEEE, 2017, pp. 17–24.
  • (25) K. He, Y. Fu, X. Xue, A jointly learned deep architecture for facial attribute analysis and face detection in the wild, arXiv preprint arXiv:1707.08705.
  • (26) H. Ding, H. Zhou, S. K. Zhou, R. Chellappa, A deep cascade network for unaligned face attribute classification, arXiv preprint arXiv:1709.03851.
  • (27) J. Li, F. Zhao, J. Feng, S. Roy, S. Yan, T. Sim, Landmark free face attribute prediction, IEEE Transactions on Image Processing 27 (9) (2018) 4651–4662.
  • (28) O. M. Parkhi, A. Vedaldi, A. Zisserman, et al., Deep face recognition., in: Proceedings of the British Machine Vision Conference 2015, (BMVC), Vol. 1, 2015, p. 6.
  • (29) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016, pp. 770–778.
  • (30) Y. Zhong, J. Sullivan, H. Li, Face attribute prediction using off-the-shelf cnn features, in: Proceedings of the IEEE International Conference on Biometrics (ICB), IEEE, 2016, pp. 1–7.
  • (31) Y. Zhong, J. Sullivan, H. Li, Leveraging mid-level deep representations for predicting face attributes in the wild, in: Proceedings of the IEEE International Conference on Image Processing (ICIP), IEEE, 2016, pp. 3239–3243.
  • (32) Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, R. Feris, Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, 2017, p. 6.
  • (33) M. I. Belghazi, S. Rajeswar, O. Mastropietro, N. Rostamzadeh, J. Mitrovic, A. Courville, Hierarchical adversarially learned inference, arXiv preprint arXiv:1802.01071.
  • (34)

    C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (3) (1995) 273–297.

  • (35) L. Bourdev, S. Maji, J. Malik, Describing people: A poselet-based approach to attribute classification, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, 2011, pp. 1543–1550.
  • (36) P. Luo, X. Wang, X. Tang, A deep sum-product architecture for robust facial attributes analysis, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, 2013, pp. 2864–2871.
  • (37) C. Huang, Y. Li, C. Change Loy, X. Tang, Learning deep representation for imbalanced classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5375–5384.
  • (38) C. Huang, Y. Li, C. C. Loy, X. Tang, Deep imbalanced learning for face recognition and attribute prediction, arXiv preprint arXiv:1806.00194.
  • (39) E. M. Rudd, M. Günther, T. E. Boult, Moon: A mixed objective optimization network for the recognition of facial attributes, in: European Conference on Computer Vision (ECCV), Springer, 2016, pp. 19–35.
  • (40) A. Rozsa, M. Günther, E. M. Rudd, T. E. Boult, Are facial attributes adversarially robust?, in: Processing of International Conference on Pattern Recognition (ICPR), IEEE, 2016, pp. 3121–3127.
  • (41) E. M. Hand, R. Chellappa, Attributes for improved attributes: a multi-task network utilizing implicit and explicit relationships for facial attribute classification, in: Proceedings of the Conference on Artificial Intelligence(AAAI), 2017, pp. 4068–4074.
  • (42) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems (NIPS), 2014, pp. 2672–2680.
  • (43) T. L. Miller, A. C. Berg, J. A. Edwards, M. R. Maire, R. M. White, Y.-W. Teh, E. Learned-Miller, D. Forsyth, Names and faces.
  • (44) Amazon, Amazon mechanical turk, https://www.mturk.com/.
  • (45) N. Kumar, A. Berg, P. N. Belhumeur, S. Nayar, Describable visual attributes for face verification and image search, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 33 (10) (2011) 1962–1977.
  • (46) Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 3730–3738.
  • (47) Y. Sun, Y. Chen, X. Wang, X. Tang, Deep learning face representation by joint identification-verification, in: Advances in Neural Information Processing Systems (NIPS), 2014, pp. 1988–1996.
  • (48) J. Wang, Y. Cheng, R. Schmidt Feris, Walk and learn: Facial attribute representation learning from egocentric video and contextual data, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2295–2304.
  • (49) Wikipedia, Evaluation measures (information retrieval), https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision.
  • (50) Y. Yue, T. Finley, F. Radlinski, T. Joachims, A support vector method for optimizing average precision, in: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2007, pp. 271–278.
  • (51) J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrieval with large vocabularies and fast spatial matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2007, pp. 1–8.
  • (52) Y. Choi, M. Choi, M. Kim, Stargan: Unified generative adversarial networks for multi-domain image-to-image translation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8789–8797.
  • (53) Z. Zhang, Y. Song, H. Qi, Age progression/regression by conditional adversarial autoencoder, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, 2017, pp. 4352–4360.
  • (54) R. Sun, C. Huang, J. Shi, L. Ma, Mask-aware photorealistic face attribute manipulation, arXiv preprint arXiv:1804.08882.
  • (55) G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, et al., Fader networks: Manipulating images by sliding attributes, in: Advances in Neural Information Processing Systems (NIPS), 2017, pp. 5967–5976.
  • (56) T. Xiao, J. Hong, J. Ma, Elegant: Exchanging latent encodings with gan for transferring multiple face attributes, arXiv preprint arXiv:1803.10562.
  • (57) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans trained by a two time-scale update rule converge to a local nash equilibrium, in: Advances in Neural Information Processing Systems (NIPS), 2017, pp. 6626–6637.
  • (58) Y. Wang, S. Wang, G. Qi, J. Tang, B. Li, Weakly supervised facial attribute manipulation via deep adversarial network, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2018, pp. 112–121.
  • (59) Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing 13 (4) (2004) 600–612.
  • (60) Z. He, W. Zuo, M. Kan, S. Shan, X. Chen, Arbitrary facial attribute editing: Only change what you want, arXiv preprint arXiv:1711.10678.
  • (61)

    C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning., in: Proceedings of the Conference on Artificial Intelligence (AAAI), Vol. 4, 2017, p. 12.

  • (62) D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, W.-Y. Ma, Dual learning for machine translation, in: Advances in Neural Information Processing Systems (NIPS), 2016, pp. 820–828.
  • (63) V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression trees, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1867–1874.
  • (64) C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, M. Pantic, A semi-automatic methodology for facial landmark annotation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2013, pp. 896–903.
  • (65) G. Perarnau, J. van de Weijer, B. Raducanu, J. M. Álvarez, Invertible conditional gans for image editing, arXiv preprint arXiv:1611.06355.
  • (66) A. B. L. Larsen, S. K. Sønderby, H. Larochelle, O. Winther, Autoencoding beyond pixels using a learned similarity metric, in: Proceedings of the IEEE International Conference on Machine Learning (ICML), 2016, pp. 1558–1566.
  • (67) G. B. Huang, M. Mattar, T. Berg, E. Learned-Miller, Labeled faces in the wild: A database forstudying face recognition in unconstrained environments, in: Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.
  • (68) E. M. Hand, C. D. Castillo, R. Chellappa, Doing the best we can with what we have: Multi-label balancing with selective learning for attribute prediction, in: Proceedings of the Conference on Artificial Intelligence (AAAI), 2018, pp. 6878–6885.
  • (69) V. Le, J. Brandt, Z. Lin, L. Bourdev, T. S. Huang, Interactive facial feature localization, in: European Conference on Computer Vision (ECCV), Springer, 2012, pp. 679–692.
  • (70) B. M. Smith, L. Zhang, J. Brandt, Z. Lin, J. Yang, Exemplar-based face parsing, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2013, pp. 3484–3491.
  • (71) K. He, Y. Fu, W. Zhang, C. Wang, Y.-G. Jiang, F. Huang, X. Xue, Harnessing synthesized abstraction images to improve facial attribute recognition, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2018.
  • (72) J. Cao, Y. Li, Z. Zhang, Partially shared multi-task convolutional neural network with local constraint for face attribute learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4290–4299.
  • (73) H. Han, A. K. Jain, S. Shan, X. Chen, Heterogeneous face attribute estimation: A deep multi-task learning approach, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
  • (74) J. A. Tropp, A. C. Gilbert, M. J. Strauss, Algorithms for simultaneous sparse approximation. part i: Greedy pursuit, Signal Processing 86 (3) (2006) 572–588.
  • (75) K. K. Singh, Y. J. Lee, End-to-end localization and ranking for relative attributes, in: European Conference on Computer Vision (ECCV), Springer, 2016, pp. 753–769.
  • (76) N. Zhuang, Y. Yan, S. Chen, H. Wang, C. Shen, Multi-label learning based deep transfer neural network for facial attribute classification, Pattern Recognition 80 (2018) 225–240.
  • (77) F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition and clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815–823.
  • (78) Q. Dong, S. Gong, X. Zhu, Class rectification hard mining for imbalanced deep learning, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, 2017, pp. 1869–1878.
  • (79) A. Sethi, M. Singh, R. Singh, M. Vatsa, Residual codean autoencoder for facial attribute analysis, Pattern Recognition Letters.
  • (80) M. Li, W. Zuo, D. Zhang, Deep identity-aware transfer of facial attributes, arXiv preprint arXiv:1610.05586.
  • (81) J.-Y. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2242–2251.
  • (82) M.-Y. Liu, T. Breuel, J. Kautz, Unsupervised image-to-image translation networks, in: Advances in Neural Information Processing Systems (NIPS), 2017, pp. 700–708.
  • (83) W. Shen, R. Liu, Learning residual images for face attribute manipulation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 1225–1233.
  • (84) X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel, Infogan: Interpretable representation learning by information maximizing generative adversarial nets, in: Advances in Neural Information Processing Systems (NIPS), 2016, pp. 2172–2180.
  • (85) J. Zhang, Y. Shu, S. Xu, G. Cao, F. Zhong, X. Qin, Sparsely grouped multi-task generative adversarial networks for facial attribute manipulation, arXiv preprint arXiv:1805.07509.
  • (86) X. Yan, J. Yang, K. Sohn, H. Lee, Attribute2image: Conditional image generation from visual attributes, in: European Conference on Computer Vision (ECCV), Springer, 2016, pp. 776–791.
  • (87) Y. Lu, Y.-W. Tai, C.-K. Tang, Attribute-guided face generation using conditional cyclegan, in: European Conference on Computer Vision (ECCV), Springer, 2018, pp. 293–308.
  • (88) H.-Y. Li, W.-M. Dong, B.-G. Hu, Facial image attributes transformation via conditional recycle generative adversarial networks, Journal of Computer Science and Technology 33 (3) (2018) 511–521.
  • (89) G. Zhang, M. Kan, S. Shan, X. Chen, Generative adversarial network with spatial attention for face attribute editing, in: European Conference on Computer Vision (ECCV), 2018, pp. 417–432.
  • (90) S. Zhou, T. Xiao, Y. Yang, D. Feng, Q. He, W. He, Genegan: Learning object transfiguration and attribute subspace from unpaired data, arXiv preprint arXiv:1705.04932.
  • (91) M. Mirza, S. Osindero, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784.
  • (92) G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, G. Bing, Learning from class-imbalanced data: Review of methods and applications, Expert Systems with Applications 73 (2017) 220–239.
  • (93) L. Chen, Q. Zhang, B. Li, Predicting multiple attributes via relative multi-task learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1027–1034.
  • (94) Q. Fan, P. Gabbur, S. Pankanti, Relative attributes for large-scale abandoned object detection, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 2736–2743.
  • (95) H. Shi, L. Tao, Fine-grained visual comparison based on relative attribute quadratic discriminant analysis, IEEE Transactions on Systems, Man, and Cybernetics: Systems.
  • (96) C. H. Lampert, H. Nickisch, S. Harmeling, Learning to detect unseen object classes by between-class attribute transfer, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2009, pp. 951–958.
  • (97) D. Parikh, K. Grauman, Relative attributes, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, 2011, pp. 503–510.
  • (98) R. N. Sandeep, Y. Verma, C. Jawahar, Relative parts: Distinctive parts for learning relative attributes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 3614–3621.
  • (99) F. Xiao, Y. Jae Lee, Discovering the spatial extent of relative attributes, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1458–1466.
  • (100) Z. Meng, N. Adluru, H. J. Kim, G. Fung, V. Singh, Efficient relative attribute learning using graph neural networks, in: European Conference on Computer Vision (ECCV), 2018, pp. 552–567.
  • (101) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, in: International Conference on Learning Representations(ICLR), 2014.
  • (102) A. Rozsa, M. Günther, E. M. Rudd, T. E. Boult, Facial attributes: Accuracy and adversarial robustness, Pattern Recognition Letters.
  • (103) I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, in: International Conference on Learning Representations(ICLR), 2015.
  • (104) S. Chhabra, R. Singh, M. Vatsa, G. Gupta, Anonymizing k-facial attributes via adversarial perturbations, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2018, pp. 656–662.
  • (105) L. Song, Z. Lu, R. He, Z. Sun, T. Tan, Geometry guided adversarial facial expression synthesis, in: Proceedings of the ACM International Conference on Multimedia (ACMMM), ACM, 2018, pp. 627–635.
  • (106) Z. Lu, T. Hu, L. Song, Z. Zhang, R. He, Conditional expression synthesis with face parsing transformation, in: Proceedings of the ACM International Conference on Multimedia (ACMMM), ACM, 2018, pp. 1083–1091.
  • (107) Y. Hu, X. Wu, B. Yu, R. He, Z. Sun, Pose-guided photorealistic face rotation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • (108) M. E. Fathy, V. M. Patel, R. Chellappa, Face-based active authentication on mobile devices, in: Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015, pp. 1687–1691.
  • (109) M. Günther, A. Costa-Pazo, C. Ding, E. Boutellaa, G. Chiachia, H. Zhang, M. de Assis Angeloni, V. Štruc, E. Khoury, E. Vazquez-Fernandez, et al., The 2013 face recognition evaluation in mobile environment, in: Proceedings of the International Conference on Biometrics (ICB), IEEE, 2013, pp. 1–7.
  • (110) A. Hadid, J. Heikkila, O. Silvén, M. Pietikainen, Face and eye detection for person authentication in mobile phones, in: Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras, IEEE, 2007, pp. 101–108.
  • (111) P. Samangouei, V. M. Patel, R. Chellappa, Facial attributes for active authentication on mobile devices, Image and Vision Computing 58 (2017) 181–192.
  • (112) T. Li, R. Qian, C. Dong, S. Liu, Q. Yan, W. Zhu, L. Lin, Beautygan: Instance-level facial makeup transfer with deep generative adversarial network, in: Proceedings of the ACM Multimedia Conference on Multimedia Conference (ACMMM), ACM, 2018, pp. 645–653.
  • (113) H. Chang, J. Lu, F. Yu, A. Finkelstein, Pairedcyclegan: Asymmetric style transfer for applying and removing makeup, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 40–48.
  • (114) J. Suo, S.-C. Zhu, S. Shan, X. Chen, A compositional and dynamic model for face aging, IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI) 32 (3) (2010) 385–401.
  • (115) A. H. Liu, Y.-C. Liu, Y.-Y. Yeh, Y.-C. F. Wang, A unified feature disentangler for multi-domain image translation and manipulation, in: Advances in Neural Information Processing Systems (NIPS), 2018, pp. 2591–2600.
  • (116) Y. Zhang, Xogan: One-to-many unsupervised image-to-image translation, arXiv preprint arXiv:1805.07277.
  • (117) D. Berthelot, C. Raffel, A. Roy, I. Goodfellow, Understanding and improving interpolation in autoencoders via an adversarial regularizer, arXiv preprint arXiv:1807.07543.
  • (118) G. Dorta, S. Vicente, N. D. Campbell, I. Simpson, The gan that warped: Semantic attribute editing with unpaired data, arXiv preprint arXiv:1811.12784.