The human face contains a lot of important information related to individual characteristics, such as identity, expression, age, attractiveness and gender. Such information has been widely applied in real-world applications such as video surveillance, customer profiling, human-computer interaction and person identification. Among these tasks, developing automatic age and attractiveness estimation methods has become an attractive yet challenging topic in recent years.
Why is it a challenging task to find age/attractiveness from facial images? First, compared with image classification deng2009imagenet or face recognition parkhi2015deep; guo2016ms; shakeel2019deep, the existing facial attribute datasets are always limited because it is very hard to gather a completely and sufficiently labeled dataset. In Table 1, we list detailed information of various facial attribute datasets. For example, there are only 2476 training images in the ChaLearn15 apparent age estimation challenge escalera2015chalearn. Second, the number of images is very imbalanced in different label groups. What is more, distributions of different datasets are also very different. As Fig. 1
depicts, there are two peaks on ChaLearn16 (the early one at around 2 years old and the latter one at 26 years old) and Morph (the early one at around 20 years old and the latter one at 40 years old), while ChaLearn15 has only one peak. Similar phenomenon also appears in facial attractiveness datasets. These imbalances bring a serious challenge for developing an unbiased estimation system. Third, compared to other facial attributes, such as gender or expression, age/attractiveness estimation is a very fine-grained recognition task,e.g., we human very hardly sense the change of one person’s facial characteristics when he/she grew from 25 to 26 years old.
The common evaluation metric of age/attractiveness estimation is the Mean Absolute Error (MAE) between the predicted value and ground-truth. Thus, it is very natural to treat facial attributes estimation as a metric regression problem ranjan2017all
which minimizes the MAE. However, such methods usually cannot achieve satisfactory performance because some outliers may cause a large error term, which leads to an unstable training procedure. Later, Rotheet al. rothe2016deep
trained deep convolutional neural network (CNN) for age estimation as multi-class classification, which maximizes the probability of ground-truth class without considering other classes. This method easily falls into over-fitting because of the imbalance problem among classes and limited training imagesgao2017deep.
Recently, ranking CNN niu2016ordinal; chen2017using; Chen2017Deep; li2017d2c and deep label distribution learning (DLDL) gao2017deep; fan2017label techniques achieved state-of-the-art performance on facial age estimation. Both methods use the correlation information among adjacent ages at different levels. The ranking method transforms single-value estimation to a series of binary classification problems at the training stage. Then, the output of the rankers are aggregated directly from these binary outputs at predication stage. DLDL firstly converts a real-value to a discrete label distribution. Then, the aim of the training is to fit the entire distribution. At inference stage, like rothe2016deep, an expected value over the predicted distribution is taken as the final output. We can easily find that there is an inconsistency between the training objectives and the evaluation metric in all these methods. Thus, they may be suboptimal. We expect to improve their performance if this inconsistency is removed.
In addition, we observe that almost all state-of-the-art facial attributes estimation methods rodriguez2017age; fan2017label; rothe2016deep; gao2017deep; antipov2016apparent are initialized by a pre-trained model which is trained on large-scale image classification (e.g
., ImageNetdeng2009imagenet) or face recognition (e.g., VGGFace parkhi2015deep) datasets, and fine-tuned on the target dataset. These pre-trained models adopt some popular and powerful architectures (e.g., VGGNet simonyan2015very). Unfortunately, these models often have huge computational cost and storage overhead. Taking VGG16 for example, it has 138.34 million parameters, taking up more than 500MB storage space. Therefore, it is hard to be deployed on resource-constrained devices, e.g., mobile phones. Recently, some researchers devoted to compressing these pre-trained models so that both reducing the number of parameters and keeping accuracy are possible iccv2017ThiNet. Unlike these compression methods, we directly design a thin and deep network architecture and train it from scratch.
In this paper, we integrate label distribution learning geng2016label and expectation regression into a unified framework to alleviate the inconsistency between training and evaluation stages with a simple and lightweight CNN architecture. The proposed approach effectively and efficiently improves the performance of the previous DLDL on both prediction error and inference speed for facial attributes estimation, so we call it DLDL-v2. Our contributions are summarized as follows.
We provide, to the best of our knowledge, the first analysis and show that the ranking method is in fact learning label distribution implicitly. This result thus unifies existing state-of-the-art facial attributes estimation methods into the DLDL framework;
We propose an end-to-end learning framework which jointly learns label distribution with the correlation information among neighboring labels and regresses single label ground-truth in both feature learning and classifier learning;
We create new state-of-the-art results on facial age and attractiveness estimation tasks using single and small model without external age/attractiveness labeled data or multi-model ensemble;
Our proposed framework is partly interpretable. We find the network employ different patterns to estimate age for people at different age stage. Meanwhile, we also quantitatively analyze the sensitivity of our approach to different face regions.
We organize the rest of this paper as follows. Related works on facial attributes (e.g., age and attractiveness) estimation are introduced in Section 2. Then, Section 3 presents the proposed DLDL-v2 approach including the problem definition, the relationship between existing methods, and our joint learning framework and its model architecture. After that, the experiments are reported in Section 4. In Section 5, we discuss how DLDL-v2 makes the final determination for an input facial image and analyze why it can work well. Finally, the conclusion is given in Section 6. Some preliminary results have been published in a conference presentation gao2018dldlv2.
2 Related Works
In the past two decades, many researchers have worked on facial attributes estimation. Earlier researches are two stage solutions, including feature extraction and model learning. Recently, deep learning methods are proposed, which integrate both stages into an end-to-end framework. In this section, we briefly review these two types of frameworks.
Two stage methods. The task of the first stage is how to extract discriminative features from facial images. Active appearance model (AAM) cootes2001active is the earliest method through extracting shape and appearance features of face images. Later, the Bio-inspired feature (BIF) guo2009human, as the most successful age feature, is widely used in age estimation. But, in face attractiveness analysis, geometric features zhang2011quantitative and texture features kagian2007humanlike
depended on facial landmark positions are widely used, since the BIF feature may be suboptimal for facial attractiveness prediction. Obviously, the drawback of hand-designed features is that one needs to re-design a feature extraction method when facing a new task, which usually requires domain knowledge and a lot of efforts. The second stage is how to exactly estimate facial attributes using these designed features. Classification and regression models often are used to estimate facial attributes. The former includes k-nearest neighbors (KNN), multilayer perceptron (MLP) and support vector machine (SVM), and the latter contains quadratic regression, support vector regression (SVR) and soft-margin mixture regressionhuang2017soft. Instead of classification and regression, ranking techniques chang2011ordinal; chen2013cumulative; Wang2015Relative; Li2015Human; Wan2018Auxiliary utilize the ordinal information of age to learn a model for facial age estimation.
In addition, Geng et al. proposed a label distribution learning (LDL) approach to utilize the correlation among adjacent labels, which improved performance on age estimation geng2013facial and beauty sensing rensense. Recently, some improvements of LDL xing2016logistic; he2017data have been proposed. Xing et al. xing2016logistic used logistic boosting regression instead of the maximum entropy model in LDL. Meanwhile, He et al. he2017data generated age label distributions through weighted linear combination of the label of input image and that of its context-neighboring images. These methods only learn a classifier, but not the visual representations.
Single stage methods. Deep CNNs have achieved impressive performance on various visual recognition tasks. The greatest success is learning feature representations instead of using hand-crafted features via the single stage learning strategy. Existing facial attribute estimation techniques fall into four categories: metric regression (MR) ranjan2017all, multi-class classification (DEX) rothe2016deep, Ranking niu2016ordinal; chen2017using; Chen2017Deep and DLDL gao2017deep.
MR treats age estimation as a real-valued regression problem. The training procedure usually minimizes the squared difference between the estimated value and the ground-truth.
DEX adopts a general image classification framework which maximizes the probability of the ground-truth class during training. In the inference stage, Rothe et al. rothe2016deep empirically showed that the expected value over the softmax-normalized output probabilities can achieve better performance than the class prediction of maximum probabilities. However, both MR and DEX easily lead to an unstable training gao2017deep.
Ranking methods transform facial attribute regression as a series of binary classification problems. Niu et al. niu2016ordinal proposed a multi-out CNN via integrating multiple binary classification problems to a CNN. Then, Chen et al. chen2017using; Chen2017Deep trained a series of binary classification CNNs to get better performance. Given a testing image, the output of the rankers are aggregated directly from these binary outputs.
DLDL converts a single value to a label distribution and learns it in an end-to-end fashion. Recently, Shen et al. shen2017label
proposed LDLFs via combining DLDL and differentiable decision trees. Huet al. hu2017facial exploited age difference information to improve the age estimation accuracy. These approaches have achieved state-of-the-art performance on age estimation. In addition, Yang et al. yangjoint proposed a multi-task deep framework via jointly optimizing image classification and distribution learning for emotion recognition. However, these methods may be suboptimal, because there is an inconsistency between the training objectives and evaluation metric.
In this paper, we focus on how to alleviate or remove this inconsistency in a deep CNN with fewer parameters. Age and attractiveness estimation from still face images are suitable applications of the proposed research.
3 Our Approach
In this section, we firstly give the definition of the joint learning problem. Next, we show that ranking is implicitly learning label distribution. Finally, we present our framework and network architecture.
3.1 The Joint Learning Problem
Notation. We use boldface lowercase letters like to denote vectors, and the -th element of is denoted as . denotes a vector of ones. Boldface uppercase letters like are used to denote matrices, and the element in the -th row and -th column is denoted as . The circle operator is used to denote element-wise multiplication.
The input space is , where , and are height, width and the number of channels of an input image, respectively. Label space is real-valued. A training set with instances is denoted as , where denotes the -th input image and its corresponding label. We may omit the image index for clarity. The joint learning aims to learn a mapping function such that the error between prediction and ground-truth be as small as possible on a given input image .
However, metric regression often cannot achieve satisfactory performance. We observe that people usually predict another person’s apparent age in a way like “around 25 years old” in real life, which indicates using not only 25 but also neighboring ages (e.g., 24 and 26) to describe the face. Similar case also happens in facial attractiveness assessment. Based on the observation, label distribution learning methods can utilize the information via transforming the single value regression problem to a label distribution learning problem.
To fulfill this goal, instead of outputting a single value for an input , we quantize the range of possible values into several labels. For example, it is reasonable to assume that in age estimation. Thus, we can define (MATLAB notation) as the ordered label vector, where is a fixed real number. A label distribution is then , where is the probability that (i.e., for ) gao2017deep. Since we use equal step size in quantizingfrom and :
where is a hyper-parameter. The goal of label distribution learning is to maximize the similarity between and the CNN generated distribution at training stage. In the prediction stage, predicted distribution is reversed to a single value by a special inference function. It is suboptimal because there exists inconsistence between training objective and evaluation metric. We are interested to not only learn the label distribution but also regress a real value in one framework in an end-to-end manner.
3.2 Ranking is Learning Label Distribution
The ranking-based niu2016ordinal; chen2017using; Chen2017Deep and DLDL-based gao2017deep; shen2017deep; shen2017label; fan2017label methods have achieved state-of-the-art performance in facial age/attractiveness estimation problems. In this section, we analyze the essential relationship between them.
We explore their relationship from the perspective of label encoding. In DLDL-based approaches, for a face image with true label and hyper-parameter , the target vector (i.e., label distribution) is generated by a normal p.d.f. (Eq. (1)). For example, the target vector of a 50 years old face is shown in Fig. (a)a, where . In ranking CNN, binary classifiers are required for ranks because the -th binary classifier focuses on determining whether the age rank of an image is greater than or not. For a face image with true label , the target vector with length is encoded as , where the first values are 1 and the rest being 0. The target ranking vector of a 50 years old face is shown in Fig. (c)c as the dark line.
(a) and (b) show p.d.f. and c.d.f. curves with the same mean and different standard deviation. (c) shows the curves of one minus c.d.f. and ranking encoding (Best viewed in color).
As we all know, for a generic normal distribution with p.d.f. , mean and deviation
, the cumulative distribution function (c.d.f.) is
As shown in Fig. (c)c, the curve of is very close to that of when is a small positive real number. Thus,
Eq. (4) shows is a specific case of label distribution learning, where the distribution is the cumulative one with . That is to say, Ranking is to learn a c.d.f. essentially, while DLDL aims at learning a p.d.f. More generally, we have
Therefore, there is a linear relationship between Ranking encoding and label distribution. The label distribution encoding can represent more meaningful age/attractiveness information with different , but ranking encoding does not. Furthermore, DLDL is more efficient, because only one network has to be trained.
However, as discussed earlier, all these methods may be suboptimal because there exists inconsistency between training objective and evaluation metric.
3.3 Joint Learning Framework
In order to jointly learn label distribution and output the expectation, in this section we propose the DLDL-v2 framework.
3.3.1 The Label Distribution Learning Module
In order to utilize the good properties of label distribution learning, we integrate it into our framework to formulate a label distribution learning module. As shown in Fig. 3
, this module includes a fully connected layer, a softmax layer and a loss layer. This module follows the DLDL method ingao2017deep.
Specifically, given an input image and the corresponding label distribution , we assume is the activation of the last layer of CNN, where denotes the parameters of the CNN. A fully connected layer transfers to by
Then, we use a softmax function to turn
into a probability distribution, that is,
Given an input image, the goal of the label distribution learning module is to find , , and to generate that is similar to .
3.3.2 The Expectation Regression Module
Note that the label distribution learning module only learns a label distribution but cannot regress a precise value. In order to reduce the inconsistency between training and evaluation stages, we propose an expectation regression module to further refine the predicted value. As shown in Fig. 3, this module includes an expectation layer and a loss layer.
The expectation layer takes the predicted distribution and label set as input and emits its expectation
where denotes the prediction probability that the input image belongs to label . Given an input image, the expectation regression module minimizes the error between the expected value and ground-truth . We use the loss as the error measurement as follows:
where denotes absolute value. Note that this module does not introduce any new parameter.
Given a training data set , the learning goal of our framework is to find , and via jointly learning label distribution and expectation regression. Thus, our final loss function is a weighted combination of the label distribution loss and the expectation regression loss :
We adopt stochastic gradient descent to optimize parameters of our model. The derivative ofwith respect to is
For any and , the derivative of softmax (Eq. (8)) is well known, as
where is 1 if
, and 0 otherwise. According to the chain rule, we have
Applying the chain rule for Eq. (7) again, the derivative of with respect to , and are easily obtained, as
Once , and are learned, the prediction value of any new instance is generated by Eq. (10) in a forward network computation.
3.4 Network Architecture
Considering both model size and efficiency, we modify VGG16 simonyan2015very
from four aspects as follows. VGG16 consists of 13 convolution (Conv) layers, five max-pooling (MP) layers and three fully connected (FC) layers, and each Conv layer and FC layer is followed by a ReLU layer.
First, we observe that the three FC layers roughly contain
parameters of the whole model. We remove all FC layers and add a hybrid-pooling (HP) layer which is constructed by an MP layer and a global avg-pooling (GAP) layer. We find that the HP strategy is more effective than single GAP. Second, to further reduce model size, we reduce the number of the filters in each Conv layer to make it thinner. Third, batch normalization (BN)ioffe2015batch has been widely used in the latest architecture such as ResNet he2016deep. Thus, we add a BN layer after each Conv layer to accelerate network training. Last but not least, we add the label distribution learning module and the expectation regression module after the HP layer, as shown in Fig. 3.
Since we design the network for age/attractiveness estimation and its architecture is thinner than the original VGG16, we call our model ThinAgeNet or ThinAttNet which employs the compression rate of 0.5 and has 3.7M parameters.1110.5 compression rate means every Conv layer has only 50% channels as that in VGG16. We also train a very small model with the compression rate of 0.25, and we call it TinyAgeNet or TinyAttNet which only has 0.9M parameters.
In this section, we conduct experiments to validate the effectiveness of the proposed DLDL-v2 approach on facial age and attractiveness datasets, based on the open source framework Torch7. All experiments are conducted on an NVIDIA M40 GPU. In order to re-produce all results in this paper, we will release source code and pre-trained models upon paper acceptance.
4.1 Implementation Details
Pre-preprocessing. We use multi-task cascaded CNN zhang2016joint to conduct face detection and facial points detection for all images. Then, based on these facial points, we align faces to the upright pose. Finally, all faces are cropped and resized to . Before feeding to the network, all resized images are to subtract mean and divide standard deviation for each color channel.
|Human han2015demographic||TPAMI 2015||-||-||-||-||0.340||-||-||6.30|
|OR-CNN niu2016ordinal||CVPR 2016||Yes||-||-||-||-||-||-||3.27|
|DEX rothe2016deep||IJCV 2016||Yes||134.6||133.30||5.369||0.456||-||-||3.25|
|DEX rothe2016deep||IJCV 2016||Yes||134.6||133.30||3.252||0.282||-||-||2.68|
|DLDL gao2017deep||TIP 2017||Yes||134.6||133.30||3.510||0.310||-||-||2.421|
|Rank-CNN chen2017using; Chen2017Deep||CVPR 2017||No||-||-||-||-||-||-||2.96|
|LDAE antipov2016apparent; antipov2017effective||PR 2017||No||1480.6||1446.30||-||-||-||0.2412||2.35|
|DLDLF shen2017label||NIPS 2017||Yes||-||-||-||-||-||-||2.24|
|DRFs shen2017deep||CVPR 2018||Yes||-||-||-||-||-||-||2.17|
1 Used 90% of Morph images for training and 10% for evaluation;
2 Used multi-model ensemble;
Data Augmentation. There are many non-controlled environmental factors such as face position, illumination, diverse backgrounds, image color (i.e
., gray and color) and image quality, especially in the ChaLearn datasets. To handle these issues, we apply data augmentation techniques to every training image, so that the network can take a different variation of the original image as input at each epoch of training. Specifically, we mainly employ five types of augmentation methods for a cropped and resized training image, including random horizontal flipping, random scaling, random color/gray changing, random rotation and standard color jittering.
Training Details. We pre-train a deep CNN model with softmax loss for face recognition on a subset of the MS-Celeb-1M dataset guo2016ms. One issue is that a small part of identities have a large number of images and others have only a few in this dataset. To avoid the imbalance problem among identities, we cut those identities whose number of images is lower than a threshold. In our experiments, we use about 5M images of 54K identities as training data.
After pre-training is finished, we remove the classification layer of the network and add the label distribution learning and expectation regression modules. Then, fine-tuning is conducted on target datasets. We set in Eq. (14). The ordered label vector is defined as (MATLAB notation). For age estimation, we set , , and . For attractiveness estimation, we set and . Because there are different scoring rules on SCUT-FBP and CFD dataset, is set to 5 and 7, respectively. The label distribution of each image is generated using Eq. (1). The ground-truth (age or attractiveness score) is provided in all datasets. The standard deviation, however, is provided in ChaLearn15, ChaLearn16 and SCUT-FBP, but not Morph and CFD. We simply set in Morph and in CFD. All networks are optimized by Adam, with , and . The initial learning rate is 0.001 for all models, and it is decreased by a factor of 10 every 30 epochs. Each model is trained 60 epochs using mini-batches of 128.
Inference Details. At the inference stage, we feed a testing image and its horizontally flipping copy into the network and average their predictions as the final estimation for the image.
4.2 Evaluation Metrics
MAE is used to evaluate the performance of facial age or attractiveness estimation,
where and are the estimated and the ground-truth of the -th testing image, respectively. In addition, a special measurement (-error) is defined by the ChaLearn competition, as
where is the standard deviation of the -th testing image.
We also follow xie2015scut; fan2017label to compute Root Mean Squared Error (RMSE) and Pearson Correlation (PC), which can be computed as:
where , and are the mean values of the ground-truth and predicted scores over all testing images. These two evaluation metrics are only utilized to evaluate the performance of facial attractiveness estimation.
4.3 Experiments on Age Estimation
4.3.1 Age Estimation Datasets
Two types of datasets are used in our experiments. The first type contains two small-scale apparent age datasets (ChaLearn15 escalera2015chalearn and ChaLearn16 escalera2016chalearn) which are collected in the wild. The second type is a large-scale real age dataset (Morph) ricanek2006morph. We follow the experimental setting in gao2018dldlv2 for evaluation.
4.3.2 Age Estimation Results
We compare our approach with the state-of-the-art in both prediction performance and inference time.
Low Error. Table 2 reports the comparisons of the MAE and -error performance of our method and previous state-of-the-art methods on three age estimation datasets.
In the ChaLearn15 challenge, the best result came from DEX. DEX method’s success relies on a lot of external age labeled training images (260282 additional photos). Under the same setting (without external data), our method outperforms DEX by a large margin in Table 6. On ChaLearn16, the -error of our approach is closest to the best competition result 0.241 antipov2016apparent on the testing set. Note that our result is only based on a single model without external age labeled data. In antipov2016apparent, they not only used external age labeled data but also employed multi-model ensemble. On Morph, our method creates a new state-of-the-art 1.969 MAE. To our best knowledge, this is the first time to achieve below two years in MAE on the Morph dataset.
In short, our DLDL-v2 (ThinAgeNet) outperforms the state-of-the art methods without external age labeled data and multi-model ensemble on ChaLearn15, ChaLearn16 and Morph.
High Efficiency. We measure the speed on one M40 GPU with batch size 32 accelerated by cuDNN v5.1. The number of parameters and the computation time of forward running of our approach and some previous methods are reported in Table 2. Since niu2016ordinal and chen2017using do not release pre-trained models, we cannot test the running time and report the number of parameters of these models. rothe2016deep, gao2017deep and antipov2016apparent all used similar network architecture (i.e., VGG16 or VGGFace). Since antipov2016apparent employed 11 models, it’s model size and running time is 11 times of rothe2016deep and gao2017deep.
|Regression (G+Tfeats) xie2015scut||SMC 2015||-||-||0.393||0.515||0.648||-||-||-|
|CNN (Six-layer ) xie2015scut||SMC 2015||-||-||-||-||0.819||-||-||-|
|SLDL (LBP+Hog+Gabor) rensense1||IJCAI 2017||-||-||0.302||0.408||-||-||-||-|
|LDL (ResNet50) fan2017label||TMM 2017||23.6||108.28||0.217||0.300||0.917||-||-||-|
|LDL (ResNet50+GFeats) fan2017label2||TMM 2017||23.6||108.28||0.213||0.278||0.930||-||-||-|
1 Used ten-fold cross validation, 90% of images for training and 10% for evaluation;
2 Used multi-features fusion.
Compared to the state-of-the-art, DLDL-v2 (ThinAgeNet) achieves the best performance using single model with 36 fewer parameters and 2.6 reduction in inference time. Furthermore, we also report DLDL-v2’s TinyAgeNet results on these datasets. The tiny model can achieve a better result (150 fewer parameters and 5.5 speed improvement) than the original DLDL gao2017deep.
4.3.3 Visual Assessment
Fig. 4 shows some examples on ChaLearn16 testing images using our DLDL-v2 ThinAgeNet. In many cases, our solution is able to predict the age of faces accurately. Failures may come from some special cases such as occlusion, low resolution, heavy makeup and extreme pose.
4.4 Experiments on Attractiveness Estimation
4.4.1 Attractiveness Estimation Datasets
To further demonstrate the effectiveness of the proposed DLDL-v2, we perform extensive experiments on two facial attractiveness datasets: SCUT-FBP xie2015scut and CFD ma2015chicago.
SCUT-FBP xie2015scut is a widely used facial beauty assessment dataset. It contains 500 Asian female faces with neutral expressions, simple backgrounds, no accessories, and minimal occlusion. Each face is scored by 75 workers with a 5-point scale, in which 1 means strong agreement about the face being the least attractive and 5 means strong agreement about the face being the most attractive. For each face, its mean score and the corresponding standard deviation are given. We follow the setting in fan2017label and xie2015scut, 80% images are randomly selected as the training set, and the remain 20% as the testing set.
CFD ma2015chicago provides high-resolution and standardized photographs with meaningful annotations (e.g., attractiveness, babyfacedness and expression etc.). Unlike SCUT-FBP, this dataset includes male and female faces of multiple ethnicity (Asian, Black, Latino, and White) between the ages of 17-65. Similar to SCUT-FBP, each faces is scored by some participants with diverse background in a 7-point scale (1 = Not at all, 7 = Extremely). In this study, we employ all 597 faces with natural expression and the corresponding attractiveness scores for experiments. We use 80% images for training and the remain 20% for testing.
4.4.2 Attractiveness Estimation Results
In Table 3, we report the performance on SCUT-FBP and CFD and compare with the state-of-the-art methods in the literature.
Comparing with those methods using hand-crafted features, such as Regression xie2015scut and SLDL rensense, the proposed DLDL-v2 (ThinAttNet) achieves 0.930 PC and 0.212 MAE on SCUT-FBP. It outperforms Regression xie2015scut by 0.282 in PC, and improves SLDL rensense by 0.135 in RMSE. What is more, for those methods using deep label distribution, such as LDL (ResNet50) fan2017label
as one of the state-of-the-art methods, our DLDL-v2 still outperforms it. Furthermore, our method is comparable to the fusional solution of deep features and geometric features infan2017label. There are two major reasons. First, our pre-trained model is trained on a face recognition dataset which is closer to facial attractiveness than those object classification datasets (ResNet50 is trained by ImageNet) in fan2017label. Second, we jointly learn label distribution and regress the facial attractiveness score in DLDL-v2, which can effectively erase the inconsistency between training objective and evaluation metric (MAE).
From the model parameters and inference time of view, as reported in Table 3, the performance of our DLDL-v2 (ThinAttNet) with 6 fewer parameters and 2.1 faster speed is comparable to that of the state-of-the-art fan2017label which is a fusional solution of deep features and geometric features. Meanwhile, we also report the performance of DLDL-v2 (TinyAttNet) with 26 fewer parameters and 4.5 faster inference speed, which is still comparable to the one using only ResNet50 in fan2017label.
4.4.3 Visual Assessment
In order to intuitively visualize the prediction performance of our DLDL-v2 on facial attractiveness task, we show the top eight and bottom eight test images based on the prediction scores of DLDL-v2 with ThinAttNet in Fig. 5. On selected 16 testing images, prediction scores of 12 images highly match with those of human raters. This result qualitatively demonstrates that our DLDL-v2 is able to generate human-like results. In addition, some possible facial attractiveness cues may be observed via comparing between the top and bottom faces with attractiveness score. Generally speaking, faces with higher attractive scores have smoother and lighter skin, oval face with larger eyes, narrower nose with a pointed tip, and better harmony in facial organs than those with lower scores.
4.5 Ablation Study and Diagnostic Experiments
DLDL-v2 (ThinAgeNet) is employed for ablation study on facial age datasets in this section. We firstly investigate the efficacy of the proposed data augmentation and the pooling strategy. For fair comparison, we fix and . Then, to investigate the effectiveness of the proposed joint learning mechanism, we compare it with two stage and single stage methods under the same setting. At last, we also explore the sensitivity of hyper-parameters in our DLDL-v2.
4.5.1 Influence of Data Augmentation
Data augmentation techniques increase the amount of training data using information only in training set, which is an effective way to reduce the over-fitting of deep models. From Table 4, we can observe 0.260.27 MAE improvements on apparent age datasets and 0.38 MAE improvement on Morph using data augmentation. This indicates that data augmentation can greatly improve the performance of age estimation.
4.5.2 Effectiveness of Pooling Strategy
GAP is one of the most popular and simple method for aggregating the spatial information of feature maps in state-of-the-art network architecture such as ResNet he2016deep. It outputs the spatial average of each feature map of the last convolution layer. Max-pooling takes the maximal value of each small region (e.g., ) in a feature map as its output. HP is constructed by a max-pooling and a GAP layer. HP firstly encourages the network to learn a discriminative feature in a small region via max-pooling, then all discriminative features are aggregated by GAP. Thus, the feature of HP is more discriminative than that of GAP. If we directly use global max-pooling instead of HP, the training of network easily fall into over-fitting. To explore the effect of the pooling strategy, we further use the HP to replace the traditional GAP when combining data augmentation. It can be seen in Table 4 that the proposed HP can consistently reduce the prediction error on all datasets.
4.5.3 Comparisons with Two Stage Methods
We compare the proposed approach with two stage methods considering two types of features. The first one is the BIF guo2009human, as the most successful hand-crafted age feature, which was widely used in age estimation. The second one is CNN features which are extracted from our pre-trained face recognition model. For BIF, we adopt 6 bands and 8 orientations guo2009human, which produces 4616-dimensional features. The CNN features are extracted from the hybrid pooling layer of the pre-trained model and their dimension is 256. These features are normalized by without using any dimensionality reduction technique.
We choose three classical age estimation algorithms, including SVR guo2009human, OHRank chang2011ordinal and BFGS-LDL geng2013facial. For SVR and OHRank, the Liblinear software is used to train regression or classification models.222https://www.csie.ntu.edu.tw/~cjlin/liblinear/ For BFGS-LDL, we use the open source LDL package.333http://ldl.herokuapp.com/download Instead of age prediction with the maximal probability in geng2013facial, we use the expected value over prediction distribution because it has better performance.
The experimental results are shown in Table 5. First, OHRank and BFGS-LDL using BIF and CNN features have similar performances on all datasets. This further validates our previous analysis that ranking is learning label distribution. Second, our proposed approach significantly outperforms all baseline methods. The major reason is that two stage methods cannot learn visual representations. This suggests that it is crucially important to jointly learn visual features and recognition model using an end-to-end manner. At last, OHRank and BFGS-LDL are much better than SVR, which indicates learning label distribution can really help us to improve estimation performance.
4.5.4 Comparisons with Single Stage Methods
We employ six very strong methods under the same setting as baselines:
MR: In MR, the ground-truth label is projected tofor speedup convergence. The and loss function is used to train MR.
DEX: In DEX, true label is quantized to different label group, which is treated as a class. To train DEX, we only need remove the expectation module and modify loss function to cross-entropy loss in DLDL-v2. In inference time, an expected value over prediction probabilities is used for final estimation.
Ranking: In chen2017using; Chen2017Deep, multiple binary classification networks are independently trained, which lead to time-consuming of training and storage overhead of model. We propose a new multiple output CNN and jointly train these binary classifiers. Specifically, we firstly remove the label distribution and expectation module in DLDL-v2. Then, we add an FC layer with output units and follow a sigmoid layer. For training Ranking CNN, we employ binary cross-entropy loss. In inference stage, the prediction is computed by , where . denotes the truth-test operator, which is 1 if the inner condition is true, and 0 otherwise. Our experiments showed that this new setup has lower MAE than that in niu2016ordinal; chen2017using; Chen2017Deep.
ER (): We only employ the expectation regression (ER) loss to optimize DLDL-v2’s parameters via removing label distribution loss in Eq. (12).
DLDL: We set in Eq. (12) to learn DLDL-v2.
Table 6 reports the results of all single stage methods. We can see that the MAE and -error of Ranking, ER and DLDL methods are significantly lower than that of MR and DEX on all datasets. This indicates that utilizing label distribution is helpful to reduce age estimation error. Meanwhile, we also find that the prediction error of Ranking is close to that of DLDL, which conforms to the analysis in Section 3.2. Furthermore, the performance of DLDL is better than that of Ranking, which suggests that learning p.d.f. is more effective than learning c.d.f. It is noteworthy that ER () and DLDL are two extreme cases of our DLDL-v2. DLDL-v2 consistently outperforms ER () and DLDL on all datasets, which indicates the joint learning can ease the difficult of network optimization. In Table 6, we can see that the proposed joint learning achieves the best performance among all methods. It means that erasing the inconsistency between training and evaluation stages can help us make a better prediction.
4.5.5 Sensitivity of Hyper-parameters
We explore the influence of hyper-parameters and , where is a weight which balances the importance between label distribution and expectation regression loss, and refers to the number of discrete labels (). In Table 7, we report results on all three age datasets with different value of and . We can see that our method is not sensitive to and with and . Note that, too many discrete labels lead to little training samples for per class in DEX rothe2016deep
, which may make prediction less precise. However, our method can ease the problem, because the training samples associated with each class label is significantly increased without actually increase the number of the total training examples. Surprising, there is also a good enough performance when the number output neurons (i.e., ) is 26. In our experiment, we fixed hyper-parameters and without carefully tuning them. In practice, it is convenient to find optimal hyper-parameters using a hold-out set.
5 Understanding DLDL-v2
We have demonstrated that DLDL-v2 has excellent performance for facial age and attractiveness estimation. A natural problem is how DLDL-v2 makes the final decision for an input facial image. In this section, we try to answer this question. Then, we analyze why it can work well when compared with existing methods.
5.1 How Does DLDL-v2 Estimate Facial Attributes?
In order to understand how DLDL-v2 makes the final decision for an input facial image, we visualize a score map that can intuitively show which regions of face image are related to the network decision. To obtain the score map, we firstly employ a class-discriminative localization technique zhou2016learning that can generate class activation maps. Then, these activation maps are aggregated by predicted probabilities.
Let us briefly review our framework. The last convolution block produce activation maps . These activations are then spatially pooled by a hybrid pooling and linearly transformed (i.e., Eq. (7)) to produce probabilities with a label distribution module. To produce class activation maps, we apply linearly transform layer to as follows
Then, the score map can be derived by
In Eq. (24), the value of represents the contribution of the network’s decision at position of -th row and -th column. Bigger values mean more contributions and vice versa. For comparing the correspondence between highlighted regions in and an input image, we scale to the size of an input image.
In Fig. 6, we visualize the score maps of testing images (ChaLearn16) coming from different age group. we can see that the highlighted regions (i.e., red regions) are significantly different for different age group faces. For infants, the highlighted region locates in the center of two eyes. For adults, the strong areas include two eyes, nose and mouth. For senior people, the highlighted regions consist of the forehead, brow, two eyes and nose. In short, the network uses different patterns to estimate different age.
We also show some examples coming from SCUT-FBP testing images in Fig. 7. We can observe that it is not significant for the highlighted regions between these faces with higher attractiveness score and that of lower score. An explanation is that DLDL-v2 may be able to estimate facial attractiveness through simply comparing the difference of the common facial traits such as eyebrows, eyes, nose, mouth etc.. In fact, the SCUT-FBP dataset indeed has the lower complexity (female faces with simple backgrounds, no accessories, and minimal occlusion) than age estimation on ChaLearn16.
5.2 Sensitivity to Different Face Regions
To further quantitatively analyze the sensitivity of DLDL-v2 to different face regions. We occlude different portions of the input image by setting it to mean values of all training images. Specifically, we use two type of occlusions, small square region (size of 3232) and horizontal stripe (size of 32224), as in zeiler2014visualizing; rothe2016deep. We occlude the input images (size of 224224) using this two type of occlusions in a sliding window fashion. In all, we obtain 49+7 occluded inputs for each input image. For each occluded input, we record prediction performance (i.e., MAE) on all testing images. Finally, we compute the relative performance loss between with and without occlusions to measure the sensitivity of occlusion region.
In Fig. 8, we show the quantitative results under different occlusions. First, we observe that larger values usually appear in some specific regions such as forehead, two eyes, nose, mouth, and chin. This indicates the decision of DLDL-v2 heavily depends on these crucial regions. Second, these values are significantly different in the different regions. For example, on ChaLearn16 testing images, the largest and second largest value appear around nose and eyes, which suggests nose and eyes are the most important facial traits for age estimation. Third, although SCUT-FBP and CFD both are used to evaluate facial attractiveness datasets, the distributions of the largest value are greatly different. The largest value of the former appears the region of eyes, and that of the latter appears the region of mouth and chin. In fact, the faces of SCUT-FBP come from Asia female, which is scored by Chinese, while CFD dataset consists multi-race faces and is scored by diverse background annotators. Therefore, this difference may be due to the phenomenon that different races may have an inconsistent understanding of facial attractiveness.
5.3 Why does DLDL-v2 make good estimation?
Compared to MR, the training procedure of our DLDL-v2 is more stable because it not only regresses the single value with expectation module but also learns a label distribution. Compared to DEX, through introducing label distribution learning module to DLDL-v2, the training instances associated with each class label is significantly increased without actually increasing the number of the total training images, which effectively alleviate the risk of over-fitting.
For Ranking and DLDL-based methods, we have proved that they are both learning a label distribution from different levels. Therefore, they both share the advantages of label distribution learning. However, there are three major differences in the network architectural between these methods and our DLDL-v2. First, these methods depend heavily on a pre-trained model such as VGGNet or VGGFace with more parameters while DLDL-v2 has a thinner architecture with fewer parameters. Therefore, DLDL-v2 has higher efficiency in inference time and storage overhead. Second, DLDL-v2 effectively avoids the inconsistency between training objective and evaluation metric via introducing the expectation regression module. Third, DLDL-v2 is a fully convolutional network, which removes all but the final fully connected layer. It is very helpful to understand that how DLDL-v2 makes the final decision. In a word, these differences make DLDL-v2 have good performance on accuracy, speed, model size and interpretability.
In this paper, we proposed a solution for facial age and attractiveness estimation problems. We firstly analyze that Ranking-based methods are implicitly learning label distribution as DLDL-based methods. This result unifies existing state-of-the-art facial age and attractiveness estimation methods into the DLDL framework. Second, our proposed DLDL-v2 framework can effectively erase the inconsistency between training and evaluation stages via jointly learning label distribution and regressing single value with a thin and deep network architecture. It creates new state-of-the-art results on facial age and attractive estimation tasks with fewer parameters and faster speed, which indicates it is easy to be deployed on resource-constrained devices. In addition, our DLDL-v2 is also a partly interpretable deep framework which employs different patterns to estimate facial attributes.
It is noteworthy that our approach is easily scalable to others label uncertainty tasks, such as skeletal maturity assessment on pediatric hand radiographs larson2017performance
, head pose estimationschwarz2017driveahead, popularity of selfie kalayeh2015selfie, image aesthetic assessment deng2017image etc. In addition, a further theoretical study between the ranking-CNN and DLDL will also be our future work.