1 Introduction
The human face contains a lot of important information related to individual characteristics, such as identity, expression, age, attractiveness and gender. Such information has been widely applied in realworld applications such as video surveillance, customer profiling, humancomputer interaction and person identification. Among these tasks, developing automatic age and attractiveness estimation methods has become an attractive yet challenging topic in recent years.
Why is it a challenging task to find age/attractiveness from facial images? First, compared with image classification deng2009imagenet or face recognition parkhi2015deep; guo2016ms; shakeel2019deep, the existing facial attribute datasets are always limited because it is very hard to gather a completely and sufficiently labeled dataset. In Table 1, we list detailed information of various facial attribute datasets. For example, there are only 2476 training images in the ChaLearn15 apparent age estimation challenge escalera2015chalearn. Second, the number of images is very imbalanced in different label groups. What is more, distributions of different datasets are also very different. As Fig. 1
depicts, there are two peaks on ChaLearn16 (the early one at around 2 years old and the latter one at 26 years old) and Morph (the early one at around 20 years old and the latter one at 40 years old), while ChaLearn15 has only one peak. Similar phenomenon also appears in facial attractiveness datasets. These imbalances bring a serious challenge for developing an unbiased estimation system. Third, compared to other facial attributes, such as gender or expression, age/attractiveness estimation is a very finegrained recognition task,
e.g., we human very hardly sense the change of one person’s facial characteristics when he/she grew from 25 to 26 years old.The common evaluation metric of age/attractiveness estimation is the Mean Absolute Error (MAE) between the predicted value and groundtruth. Thus, it is very natural to treat facial attributes estimation as a metric regression problem ranjan2017all
which minimizes the MAE. However, such methods usually cannot achieve satisfactory performance because some outliers may cause a large error term, which leads to an unstable training procedure. Later, Rothe
et al. rothe2016deeptrained deep convolutional neural network (CNN) for age estimation as multiclass classification, which maximizes the probability of groundtruth class without considering other classes. This method easily falls into overfitting because of the imbalance problem among classes and limited training images
gao2017deep.Recently, ranking CNN niu2016ordinal; chen2017using; Chen2017Deep; li2017d2c and deep label distribution learning (DLDL) gao2017deep; fan2017label techniques achieved stateoftheart performance on facial age estimation. Both methods use the correlation information among adjacent ages at different levels. The ranking method transforms singlevalue estimation to a series of binary classification problems at the training stage. Then, the output of the rankers are aggregated directly from these binary outputs at predication stage. DLDL firstly converts a realvalue to a discrete label distribution. Then, the aim of the training is to fit the entire distribution. At inference stage, like rothe2016deep, an expected value over the predicted distribution is taken as the final output. We can easily find that there is an inconsistency between the training objectives and the evaluation metric in all these methods. Thus, they may be suboptimal. We expect to improve their performance if this inconsistency is removed.
In addition, we observe that almost all stateoftheart facial attributes estimation methods rodriguez2017age; fan2017label; rothe2016deep; gao2017deep; antipov2016apparent are initialized by a pretrained model which is trained on largescale image classification (e.g
., ImageNet
deng2009imagenet) or face recognition (e.g., VGGFace parkhi2015deep) datasets, and finetuned on the target dataset. These pretrained models adopt some popular and powerful architectures (e.g., VGGNet simonyan2015very). Unfortunately, these models often have huge computational cost and storage overhead. Taking VGG16 for example, it has 138.34 million parameters, taking up more than 500MB storage space. Therefore, it is hard to be deployed on resourceconstrained devices, e.g., mobile phones. Recently, some researchers devoted to compressing these pretrained models so that both reducing the number of parameters and keeping accuracy are possible iccv2017ThiNet. Unlike these compression methods, we directly design a thin and deep network architecture and train it from scratch.Dataset  #Images  Label range  

Age  ChaLearn15 escalera2015chalearn  2476+1136  385 
ChaLearn16 escalera2016chalearn  5613+1978  096  
Morph ricanek2006morph  55134  1670  
Attractiveness  SCUTFBP xie2015scut  500  15 
CFD ma2015chicago  597  17 
In this paper, we integrate label distribution learning geng2016label and expectation regression into a unified framework to alleviate the inconsistency between training and evaluation stages with a simple and lightweight CNN architecture. The proposed approach effectively and efficiently improves the performance of the previous DLDL on both prediction error and inference speed for facial attributes estimation, so we call it DLDLv2. Our contributions are summarized as follows.

We provide, to the best of our knowledge, the first analysis and show that the ranking method is in fact learning label distribution implicitly. This result thus unifies existing stateoftheart facial attributes estimation methods into the DLDL framework;

We propose an endtoend learning framework which jointly learns label distribution with the correlation information among neighboring labels and regresses single label groundtruth in both feature learning and classifier learning;

We create new stateoftheart results on facial age and attractiveness estimation tasks using single and small model without external age/attractiveness labeled data or multimodel ensemble;

Our proposed framework is partly interpretable. We find the network employ different patterns to estimate age for people at different age stage. Meanwhile, we also quantitatively analyze the sensitivity of our approach to different face regions.
We organize the rest of this paper as follows. Related works on facial attributes (e.g., age and attractiveness) estimation are introduced in Section 2. Then, Section 3 presents the proposed DLDLv2 approach including the problem definition, the relationship between existing methods, and our joint learning framework and its model architecture. After that, the experiments are reported in Section 4. In Section 5, we discuss how DLDLv2 makes the final determination for an input facial image and analyze why it can work well. Finally, the conclusion is given in Section 6. Some preliminary results have been published in a conference presentation gao2018dldlv2.
2 Related Works
In the past two decades, many researchers have worked on facial attributes estimation. Earlier researches are two stage solutions, including feature extraction and model learning. Recently, deep learning methods are proposed, which integrate both stages into an endtoend framework. In this section, we briefly review these two types of frameworks.
Two stage methods. The task of the first stage is how to extract discriminative features from facial images. Active appearance model (AAM) cootes2001active is the earliest method through extracting shape and appearance features of face images. Later, the Bioinspired feature (BIF) guo2009human, as the most successful age feature, is widely used in age estimation. But, in face attractiveness analysis, geometric features zhang2011quantitative and texture features kagian2007humanlike
depended on facial landmark positions are widely used, since the BIF feature may be suboptimal for facial attractiveness prediction. Obviously, the drawback of handdesigned features is that one needs to redesign a feature extraction method when facing a new task, which usually requires domain knowledge and a lot of efforts. The second stage is how to exactly estimate facial attributes using these designed features. Classification and regression models often are used to estimate facial attributes. The former includes knearest neighbors (KNN), multilayer perceptron (MLP) and support vector machine (SVM), and the latter contains quadratic regression, support vector regression (SVR) and softmargin mixture regression
huang2017soft. Instead of classification and regression, ranking techniques chang2011ordinal; chen2013cumulative; Wang2015Relative; Li2015Human; Wan2018Auxiliary utilize the ordinal information of age to learn a model for facial age estimation.In addition, Geng et al. proposed a label distribution learning (LDL) approach to utilize the correlation among adjacent labels, which improved performance on age estimation geng2013facial and beauty sensing rensense. Recently, some improvements of LDL xing2016logistic; he2017data have been proposed. Xing et al. xing2016logistic used logistic boosting regression instead of the maximum entropy model in LDL. Meanwhile, He et al. he2017data generated age label distributions through weighted linear combination of the label of input image and that of its contextneighboring images. These methods only learn a classifier, but not the visual representations.
Single stage methods. Deep CNNs have achieved impressive performance on various visual recognition tasks. The greatest success is learning feature representations instead of using handcrafted features via the single stage learning strategy. Existing facial attribute estimation techniques fall into four categories: metric regression (MR) ranjan2017all, multiclass classification (DEX) rothe2016deep, Ranking niu2016ordinal; chen2017using; Chen2017Deep and DLDL gao2017deep.
MR treats age estimation as a realvalued regression problem. The training procedure usually minimizes the squared difference between the estimated value and the groundtruth.
DEX adopts a general image classification framework which maximizes the probability of the groundtruth class during training. In the inference stage, Rothe et al. rothe2016deep empirically showed that the expected value over the softmaxnormalized output probabilities can achieve better performance than the class prediction of maximum probabilities. However, both MR and DEX easily lead to an unstable training gao2017deep.
Ranking methods transform facial attribute regression as a series of binary classification problems. Niu et al. niu2016ordinal proposed a multiout CNN via integrating multiple binary classification problems to a CNN. Then, Chen et al. chen2017using; Chen2017Deep trained a series of binary classification CNNs to get better performance. Given a testing image, the output of the rankers are aggregated directly from these binary outputs.
DLDL converts a single value to a label distribution and learns it in an endtoend fashion. Recently, Shen et al. shen2017label
proposed LDLFs via combining DLDL and differentiable decision trees. Hu
et al. hu2017facial exploited age difference information to improve the age estimation accuracy. These approaches have achieved stateoftheart performance on age estimation. In addition, Yang et al. yangjoint proposed a multitask deep framework via jointly optimizing image classification and distribution learning for emotion recognition. However, these methods may be suboptimal, because there is an inconsistency between the training objectives and evaluation metric.In this paper, we focus on how to alleviate or remove this inconsistency in a deep CNN with fewer parameters. Age and attractiveness estimation from still face images are suitable applications of the proposed research.
3 Our Approach
In this section, we firstly give the definition of the joint learning problem. Next, we show that ranking is implicitly learning label distribution. Finally, we present our framework and network architecture.
3.1 The Joint Learning Problem
Notation. We use boldface lowercase letters like to denote vectors, and the th element of is denoted as . denotes a vector of ones. Boldface uppercase letters like are used to denote matrices, and the element in the th row and th column is denoted as . The circle operator is used to denote elementwise multiplication.
The input space is , where , and are height, width and the number of channels of an input image, respectively. Label space is realvalued. A training set with instances is denoted as , where denotes the th input image and its corresponding label. We may omit the image index for clarity. The joint learning aims to learn a mapping function such that the error between prediction and groundtruth be as small as possible on a given input image .
However, metric regression often cannot achieve satisfactory performance. We observe that people usually predict another person’s apparent age in a way like “around 25 years old” in real life, which indicates using not only 25 but also neighboring ages (e.g., 24 and 26) to describe the face. Similar case also happens in facial attractiveness assessment. Based on the observation, label distribution learning methods can utilize the information via transforming the single value regression problem to a label distribution learning problem.
To fulfill this goal, instead of outputting a single value for an input , we quantize the range of possible values into several labels. For example, it is reasonable to assume that in age estimation. Thus, we can define (MATLAB notation) as the ordered label vector, where is a fixed real number. A label distribution is then , where is the probability that (i.e., for ) gao2017deep. Since we use equal step size in quantizing
, the probability density function (p.d.f.) of normal distribution is a natural choice to generate the groundtruth
from and :(1) 
where is a hyperparameter. The goal of label distribution learning is to maximize the similarity between and the CNN generated distribution at training stage. In the prediction stage, predicted distribution is reversed to a single value by a special inference function. It is suboptimal because there exists inconsistence between training objective and evaluation metric. We are interested to not only learn the label distribution but also regress a real value in one framework in an endtoend manner.
3.2 Ranking is Learning Label Distribution
The rankingbased niu2016ordinal; chen2017using; Chen2017Deep and DLDLbased gao2017deep; shen2017deep; shen2017label; fan2017label methods have achieved stateoftheart performance in facial age/attractiveness estimation problems. In this section, we analyze the essential relationship between them.
We explore their relationship from the perspective of label encoding. In DLDLbased approaches, for a face image with true label and hyperparameter , the target vector (i.e., label distribution) is generated by a normal p.d.f. (Eq. (1)). For example, the target vector of a 50 years old face is shown in Fig. (a)a, where . In ranking CNN, binary classifiers are required for ranks because the th binary classifier focuses on determining whether the age rank of an image is greater than or not. For a face image with true label , the target vector with length is encoded as , where the first values are 1 and the rest being 0. The target ranking vector of a 50 years old face is shown in Fig. (c)c as the dark line.
(a) and (b) show p.d.f. and c.d.f. curves with the same mean and different standard deviation. (c) shows the curves of one minus c.d.f. and ranking encoding (Best viewed in color).
As we all know, for a generic normal distribution with p.d.f. , mean and deviation
, the cumulative distribution function (c.d.f.) is
(2) 
where . Fig. (b)b shows the c.d.f. corresponding to the p.d.f. in Fig. (a)a. From Eq. (2), we know
(3) 
As shown in Fig. (c)c, the curve of is very close to that of when is a small positive real number. Thus,
(4) 
where .
Eq. (4) shows is a specific case of label distribution learning, where the distribution is the cumulative one with . That is to say, Ranking is to learn a c.d.f. essentially, while DLDL aims at learning a p.d.f. More generally, we have
(5) 
where is a transformation matrix with for all and when . Substituting (5) in to (4), we have
(6) 
Therefore, there is a linear relationship between Ranking encoding and label distribution. The label distribution encoding can represent more meaningful age/attractiveness information with different , but ranking encoding does not. Furthermore, DLDL is more efficient, because only one network has to be trained.
However, as discussed earlier, all these methods may be suboptimal because there exists inconsistency between training objective and evaluation metric.
3.3 Joint Learning Framework
In order to jointly learn label distribution and output the expectation, in this section we propose the DLDLv2 framework.
3.3.1 The Label Distribution Learning Module
In order to utilize the good properties of label distribution learning, we integrate it into our framework to formulate a label distribution learning module. As shown in Fig. 3
, this module includes a fully connected layer, a softmax layer and a loss layer. This module follows the DLDL method in
gao2017deep.Specifically, given an input image and the corresponding label distribution , we assume is the activation of the last layer of CNN, where denotes the parameters of the CNN. A fully connected layer transfers to by
(7) 
Then, we use a softmax function to turn
into a probability distribution, that is,
(8) 
Given an input image, the goal of the label distribution learning module is to find , , and to generate that is similar to .
We employ the KullbackLeibler divergence as the measurement of the dissimilarity between groundtruth label distribution and prediction distribution. Thus, we can define a loss function on one training sample as follows
gao2017deep:(9) 
3.3.2 The Expectation Regression Module
Note that the label distribution learning module only learns a label distribution but cannot regress a precise value. In order to reduce the inconsistency between training and evaluation stages, we propose an expectation regression module to further refine the predicted value. As shown in Fig. 3, this module includes an expectation layer and a loss layer.
The expectation layer takes the predicted distribution and label set as input and emits its expectation
(10) 
where denotes the prediction probability that the input image belongs to label . Given an input image, the expectation regression module minimizes the error between the expected value and groundtruth . We use the loss as the error measurement as follows:
(11) 
where denotes absolute value. Note that this module does not introduce any new parameter.
3.3.3 Learning
Given a training data set , the learning goal of our framework is to find , and via jointly learning label distribution and expectation regression. Thus, our final loss function is a weighted combination of the label distribution loss and the expectation regression loss :
(12) 
where is a weight which balances the importance between two types of losses. Substituting Eq. (9), Eq. (10) and Eq. (11) into Eq. (12), we have
(13) 
We adopt stochastic gradient descent to optimize parameters of our model. The derivative of
with respect to is(14) 
For any and , the derivative of softmax (Eq. (8)) is well known, as
(15) 
where is 1 if
, and 0 otherwise. According to the chain rule, we have
(16) 
(17) 
Applying the chain rule for Eq. (7) again, the derivative of with respect to , and are easily obtained, as
(18) 
Once , and are learned, the prediction value of any new instance is generated by Eq. (10) in a forward network computation.
3.4 Network Architecture
Considering both model size and efficiency, we modify VGG16 simonyan2015very
from four aspects as follows. VGG16 consists of 13 convolution (Conv) layers, five maxpooling (MP) layers and three fully connected (FC) layers, and each Conv layer and FC layer is followed by a ReLU layer.
First, we observe that the three FC layers roughly contain
parameters of the whole model. We remove all FC layers and add a hybridpooling (HP) layer which is constructed by an MP layer and a global avgpooling (GAP) layer. We find that the HP strategy is more effective than single GAP. Second, to further reduce model size, we reduce the number of the filters in each Conv layer to make it thinner. Third, batch normalization (BN)
ioffe2015batch has been widely used in the latest architecture such as ResNet he2016deep. Thus, we add a BN layer after each Conv layer to accelerate network training. Last but not least, we add the label distribution learning module and the expectation regression module after the HP layer, as shown in Fig. 3.Since we design the network for age/attractiveness estimation and its architecture is thinner than the original VGG16, we call our model ThinAgeNet or ThinAttNet which employs the compression rate of 0.5 and has 3.7M parameters.^{1}^{1}10.5 compression rate means every Conv layer has only 50% channels as that in VGG16. We also train a very small model with the compression rate of 0.25, and we call it TinyAgeNet or TinyAttNet which only has 0.9M parameters.
4 Experiments
In this section, we conduct experiments to validate the effectiveness of the proposed DLDLv2 approach on facial age and attractiveness datasets, based on the open source framework Torch7. All experiments are conducted on an NVIDIA M40 GPU. In order to reproduce all results in this paper, we will release source code and pretrained models upon paper acceptance.
4.1 Implementation Details
Prepreprocessing. We use multitask cascaded CNN zhang2016joint to conduct face detection and facial points detection for all images. Then, based on these facial points, we align faces to the upright pose. Finally, all faces are cropped and resized to . Before feeding to the network, all resized images are to subtract mean and divide standard deviation for each color channel.
Methods  Date  External  Model  ChaLearn15  ChaLearn16  Morph  
Data  Single?  #Param(M)  #Time(ms)  MAE  error  MAE  error  MAE  
Human han2015demographic  TPAMI 2015          0.340      6.30  
ORCNN niu2016ordinal  CVPR 2016  Yes              3.27  
DEX rothe2016deep  IJCV 2016  Yes  134.6  133.30  5.369  0.456      3.25  
DEX rothe2016deep  IJCV 2016  Yes  134.6  133.30  3.252  0.282      2.68  
DLDL gao2017deep  TIP 2017  Yes  134.6  133.30  3.510  0.310      2.42^{1}  
RankCNN chen2017using; Chen2017Deep  CVPR 2017  No              2.96  
LDAE antipov2016apparent; antipov2017effective  PR 2017  No  1480.6  1446.30        0.241^{2}  2.35  
DLDLF shen2017label  NIPS 2017  Yes              2.24  
DRFs shen2017deep  CVPR 2018  Yes              2.17  
DLDLv2 (TinyAgeNet)  Yes  0.9  24.26  3.427  0.301  3.765  0.291  2.291  
DLDLv2 (ThinAgeNet)  Yes  3.7  51.05  3.135  0.272  3.452  0.267  1.969 
^{1} Used 90% of Morph images for training and 10% for evaluation;
^{2} Used multimodel ensemble;
Data Augmentation. There are many noncontrolled environmental factors such as face position, illumination, diverse backgrounds, image color (i.e
., gray and color) and image quality, especially in the ChaLearn datasets. To handle these issues, we apply data augmentation techniques to every training image, so that the network can take a different variation of the original image as input at each epoch of training. Specifically, we mainly employ five types of augmentation methods for a cropped and resized training image, including random horizontal flipping, random scaling, random color/gray changing, random rotation and standard color jittering.
Training Details. We pretrain a deep CNN model with softmax loss for face recognition on a subset of the MSCeleb1M dataset guo2016ms. One issue is that a small part of identities have a large number of images and others have only a few in this dataset. To avoid the imbalance problem among identities, we cut those identities whose number of images is lower than a threshold. In our experiments, we use about 5M images of 54K identities as training data.
After pretraining is finished, we remove the classification layer of the network and add the label distribution learning and expectation regression modules. Then, finetuning is conducted on target datasets. We set in Eq. (14). The ordered label vector is defined as (MATLAB notation). For age estimation, we set , , and . For attractiveness estimation, we set and . Because there are different scoring rules on SCUTFBP and CFD dataset, is set to 5 and 7, respectively. The label distribution of each image is generated using Eq. (1). The groundtruth (age or attractiveness score) is provided in all datasets. The standard deviation, however, is provided in ChaLearn15, ChaLearn16 and SCUTFBP, but not Morph and CFD. We simply set in Morph and in CFD. All networks are optimized by Adam, with , and . The initial learning rate is 0.001 for all models, and it is decreased by a factor of 10 every 30 epochs. Each model is trained 60 epochs using minibatches of 128.
Inference Details. At the inference stage, we feed a testing image and its horizontally flipping copy into the network and average their predictions as the final estimation for the image.
4.2 Evaluation Metrics
MAE is used to evaluate the performance of facial age or attractiveness estimation,
(19) 
where and are the estimated and the groundtruth of the th testing image, respectively. In addition, a special measurement (error) is defined by the ChaLearn competition, as
(20) 
where is the standard deviation of the th testing image.
We also follow xie2015scut; fan2017label to compute Root Mean Squared Error (RMSE) and Pearson Correlation (PC), which can be computed as:
(21) 
(22) 
where , and are the mean values of the groundtruth and predicted scores over all testing images. These two evaluation metrics are only utilized to evaluate the performance of facial attractiveness estimation.
4.3 Experiments on Age Estimation
4.3.1 Age Estimation Datasets
Two types of datasets are used in our experiments. The first type contains two smallscale apparent age datasets (ChaLearn15 escalera2015chalearn and ChaLearn16 escalera2016chalearn) which are collected in the wild. The second type is a largescale real age dataset (Morph) ricanek2006morph. We follow the experimental setting in gao2018dldlv2 for evaluation.
4.3.2 Age Estimation Results
We compare our approach with the stateoftheart in both prediction performance and inference time.
Low Error. Table 2 reports the comparisons of the MAE and error performance of our method and previous stateoftheart methods on three age estimation datasets.
In the ChaLearn15 challenge, the best result came from DEX. DEX method’s success relies on a lot of external age labeled training images (260282 additional photos). Under the same setting (without external data), our method outperforms DEX by a large margin in Table 6. On ChaLearn16, the error of our approach is closest to the best competition result 0.241 antipov2016apparent on the testing set. Note that our result is only based on a single model without external age labeled data. In antipov2016apparent, they not only used external age labeled data but also employed multimodel ensemble. On Morph, our method creates a new stateoftheart 1.969 MAE. To our best knowledge, this is the first time to achieve below two years in MAE on the Morph dataset.
In short, our DLDLv2 (ThinAgeNet) outperforms the stateofthe art methods without external age labeled data and multimodel ensemble on ChaLearn15, ChaLearn16 and Morph.
High Efficiency. We measure the speed on one M40 GPU with batch size 32 accelerated by cuDNN v5.1. The number of parameters and the computation time of forward running of our approach and some previous methods are reported in Table 2. Since niu2016ordinal and chen2017using do not release pretrained models, we cannot test the running time and report the number of parameters of these models. rothe2016deep, gao2017deep and antipov2016apparent all used similar network architecture (i.e., VGG16 or VGGFace). Since antipov2016apparent employed 11 models, it’s model size and running time is 11 times of rothe2016deep and gao2017deep.

Methods  Date  Model  SUBTFBP  CFD  
#Param(M)  #Time(ms)  MAE  RMSE  PC  MAE  RMSE  PC  
Regression (G+Tfeats) xie2015scut  SMC 2015      0.393  0.515  0.648       
CNN (Sixlayer ) xie2015scut  SMC 2015          0.819       
SLDL (LBP+Hog+Gabor) rensense^{1}  IJCAI 2017      0.302  0.408         
LDL (ResNet50) fan2017label  TMM 2017  23.6  108.28  0.217  0.300  0.917       
LDL (ResNet50+GFeats) fan2017label^{2}  TMM 2017  23.6  108.28  0.213  0.278  0.930       
DLDLv2 (TinyAttNet)  0.9  24.26  0.221  0.294  0.915  0.400  0.521  0.716  
DLDLv2 (ThinAttNet)  3.7  51.05  0.212  0.273  0.930  0.364  0.472  0.766 
^{1} Used tenfold cross validation, 90% of images for training and 10% for evaluation;
^{2} Used multifeatures fusion.

Compared to the stateoftheart, DLDLv2 (ThinAgeNet) achieves the best performance using single model with 36 fewer parameters and 2.6 reduction in inference time. Furthermore, we also report DLDLv2’s TinyAgeNet results on these datasets. The tiny model can achieve a better result (150 fewer parameters and 5.5 speed improvement) than the original DLDL gao2017deep.
4.3.3 Visual Assessment
Fig. 4 shows some examples on ChaLearn16 testing images using our DLDLv2 ThinAgeNet. In many cases, our solution is able to predict the age of faces accurately. Failures may come from some special cases such as occlusion, low resolution, heavy makeup and extreme pose.
4.4 Experiments on Attractiveness Estimation
4.4.1 Attractiveness Estimation Datasets
To further demonstrate the effectiveness of the proposed DLDLv2, we perform extensive experiments on two facial attractiveness datasets: SCUTFBP xie2015scut and CFD ma2015chicago.
SCUTFBP xie2015scut is a widely used facial beauty assessment dataset. It contains 500 Asian female faces with neutral expressions, simple backgrounds, no accessories, and minimal occlusion. Each face is scored by 75 workers with a 5point scale, in which 1 means strong agreement about the face being the least attractive and 5 means strong agreement about the face being the most attractive. For each face, its mean score and the corresponding standard deviation are given. We follow the setting in fan2017label and xie2015scut, 80% images are randomly selected as the training set, and the remain 20% as the testing set.
CFD ma2015chicago provides highresolution and standardized photographs with meaningful annotations (e.g., attractiveness, babyfacedness and expression etc.). Unlike SCUTFBP, this dataset includes male and female faces of multiple ethnicity (Asian, Black, Latino, and White) between the ages of 1765. Similar to SCUTFBP, each faces is scored by some participants with diverse background in a 7point scale (1 = Not at all, 7 = Extremely). In this study, we employ all 597 faces with natural expression and the corresponding attractiveness scores for experiments. We use 80% images for training and the remain 20% for testing.
4.4.2 Attractiveness Estimation Results
In Table 3, we report the performance on SCUTFBP and CFD and compare with the stateoftheart methods in the literature.
Comparing with those methods using handcrafted features, such as Regression xie2015scut and SLDL rensense, the proposed DLDLv2 (ThinAttNet) achieves 0.930 PC and 0.212 MAE on SCUTFBP. It outperforms Regression xie2015scut by 0.282 in PC, and improves SLDL rensense by 0.135 in RMSE. What is more, for those methods using deep label distribution, such as LDL (ResNet50) fan2017label
as one of the stateoftheart methods, our DLDLv2 still outperforms it. Furthermore, our method is comparable to the fusional solution of deep features and geometric features in
fan2017label. There are two major reasons. First, our pretrained model is trained on a face recognition dataset which is closer to facial attractiveness than those object classification datasets (ResNet50 is trained by ImageNet) in fan2017label. Second, we jointly learn label distribution and regress the facial attractiveness score in DLDLv2, which can effectively erase the inconsistency between training objective and evaluation metric (MAE).From the model parameters and inference time of view, as reported in Table 3, the performance of our DLDLv2 (ThinAttNet) with 6 fewer parameters and 2.1 faster speed is comparable to that of the stateoftheart fan2017label which is a fusional solution of deep features and geometric features. Meanwhile, we also report the performance of DLDLv2 (TinyAttNet) with 26 fewer parameters and 4.5 faster inference speed, which is still comparable to the one using only ResNet50 in fan2017label.
4.4.3 Visual Assessment
In order to intuitively visualize the prediction performance of our DLDLv2 on facial attractiveness task, we show the top eight and bottom eight test images based on the prediction scores of DLDLv2 with ThinAttNet in Fig. 5. On selected 16 testing images, prediction scores of 12 images highly match with those of human raters. This result qualitatively demonstrates that our DLDLv2 is able to generate humanlike results. In addition, some possible facial attractiveness cues may be observed via comparing between the top and bottom faces with attractiveness score. Generally speaking, faces with higher attractive scores have smoother and lighter skin, oval face with larger eyes, narrower nose with a pointed tip, and better harmony in facial organs than those with lower scores.
4.5 Ablation Study and Diagnostic Experiments
DLDLv2 (ThinAgeNet) is employed for ablation study on facial age datasets in this section. We firstly investigate the efficacy of the proposed data augmentation and the pooling strategy. For fair comparison, we fix and . Then, to investigate the effectiveness of the proposed joint learning mechanism, we compare it with two stage and single stage methods under the same setting. At last, we also explore the sensitivity of hyperparameters in our DLDLv2.
4.5.1 Influence of Data Augmentation
Data augmentation techniques increase the amount of training data using information only in training set, which is an effective way to reduce the overfitting of deep models. From Table 4, we can observe 0.260.27 MAE improvements on apparent age datasets and 0.38 MAE improvement on Morph using data augmentation. This indicates that data augmentation can greatly improve the performance of age estimation.
Factors  ChaLearn15  ChaLearn16  Morph  

Aug  Pooling  MAE  error  MAE  error  MAE 
HP  3.399  0.303  3.717  0.290  2.346  
GAP  3.210  0.282  3.539  0.274  2.039  
HP  3.135  0.272  3.452  0.267  1.969 
4.5.2 Effectiveness of Pooling Strategy
GAP is one of the most popular and simple method for aggregating the spatial information of feature maps in stateoftheart network architecture such as ResNet he2016deep. It outputs the spatial average of each feature map of the last convolution layer. Maxpooling takes the maximal value of each small region (e.g., ) in a feature map as its output. HP is constructed by a maxpooling and a GAP layer. HP firstly encourages the network to learn a discriminative feature in a small region via maxpooling, then all discriminative features are aggregated by GAP. Thus, the feature of HP is more discriminative than that of GAP. If we directly use global maxpooling instead of HP, the training of network easily fall into overfitting. To explore the effect of the pooling strategy, we further use the HP to replace the traditional GAP when combining data augmentation. It can be seen in Table 4 that the proposed HP can consistently reduce the prediction error on all datasets.
4.5.3 Comparisons with Two Stage Methods
We compare the proposed approach with two stage methods considering two types of features. The first one is the BIF guo2009human, as the most successful handcrafted age feature, which was widely used in age estimation. The second one is CNN features which are extracted from our pretrained face recognition model. For BIF, we adopt 6 bands and 8 orientations guo2009human, which produces 4616dimensional features. The CNN features are extracted from the hybrid pooling layer of the pretrained model and their dimension is 256. These features are normalized by without using any dimensionality reduction technique.
We choose three classical age estimation algorithms, including SVR guo2009human, OHRank chang2011ordinal and BFGSLDL geng2013facial. For SVR and OHRank, the Liblinear software is used to train regression or classification models.^{2}^{2}2https://www.csie.ntu.edu.tw/~cjlin/liblinear/ For BFGSLDL, we use the open source LDL package.^{3}^{3}3http://ldl.herokuapp.com/download Instead of age prediction with the maximal probability in geng2013facial, we use the expected value over prediction distribution because it has better performance.
Feats  Methods  ChaLearn15  ChaLearn16  Morph  
MAE  error  MAE  error  MAE  
BIF  SVR  6.832  0.545  9.225  0.595  4.303 
OHRank  6.403  0.525  7.680  0.533  3.841  
BFGSLDL  6.441  0.505  7.626  0.515  3.883  
CNN  SVR  5.333  0.471  6.348  0.495  4.370 
OHRank  4.202  0.383  4.668  0.380  3.919  
BFGSLDL  4.037  0.359  4.457  0.345  3.865  
DLDLv2  3.135  0.272  3.452  0.267  1.969 
The experimental results are shown in Table 5. First, OHRank and BFGSLDL using BIF and CNN features have similar performances on all datasets. This further validates our previous analysis that ranking is learning label distribution. Second, our proposed approach significantly outperforms all baseline methods. The major reason is that two stage methods cannot learn visual representations. This suggests that it is crucially important to jointly learn visual features and recognition model using an endtoend manner. At last, OHRank and BFGSLDL are much better than SVR, which indicates learning label distribution can really help us to improve estimation performance.
4.5.4 Comparisons with Single Stage Methods
We employ six very strong methods under the same setting as baselines:

MR: In MR, the groundtruth label is projected to
by a linear transform. For MR, we need to make a little modification to DLDLv2. Specifically, we add an FC layer with single output after HP, and follow a hyperbolic tangent activation function
for speedup convergence. The and loss function is used to train MR. 
DEX: In DEX, true label is quantized to different label group, which is treated as a class. To train DEX, we only need remove the expectation module and modify loss function to crossentropy loss in DLDLv2. In inference time, an expected value over prediction probabilities is used for final estimation.

Ranking: In chen2017using; Chen2017Deep, multiple binary classification networks are independently trained, which lead to timeconsuming of training and storage overhead of model. We propose a new multiple output CNN and jointly train these binary classifiers. Specifically, we firstly remove the label distribution and expectation module in DLDLv2. Then, we add an FC layer with output units and follow a sigmoid layer. For training Ranking CNN, we employ binary crossentropy loss. In inference stage, the prediction is computed by , where . denotes the truthtest operator, which is 1 if the inner condition is true, and 0 otherwise. Our experiments showed that this new setup has lower MAE than that in niu2016ordinal; chen2017using; Chen2017Deep.

ER (): We only employ the expectation regression (ER) loss to optimize DLDLv2’s parameters via removing label distribution loss in Eq. (12).

DLDL: We set in Eq. (12) to learn DLDLv2.
Table 6 reports the results of all single stage methods. We can see that the MAE and error of Ranking, ER and DLDL methods are significantly lower than that of MR and DEX on all datasets. This indicates that utilizing label distribution is helpful to reduce age estimation error. Meanwhile, we also find that the prediction error of Ranking is close to that of DLDL, which conforms to the analysis in Section 3.2. Furthermore, the performance of DLDL is better than that of Ranking, which suggests that learning p.d.f. is more effective than learning c.d.f. It is noteworthy that ER () and DLDL are two extreme cases of our DLDLv2. DLDLv2 consistently outperforms ER () and DLDL on all datasets, which indicates the joint learning can ease the difficult of network optimization. In Table 6, we can see that the proposed joint learning achieves the best performance among all methods. It means that erasing the inconsistency between training and evaluation stages can help us make a better prediction.
Methods  ChaLearn15  ChaLearn16  Morph  
MAE  error  MAE  error  MAE  
MR ()  3.665  0.337  3.696  0.294  2.282 
MR ()  3.655  0.334  3.722  0.301  2.347 
DEX  3.558  0.306  4.163  0.332  2.311 
Ranking  3.365  0.298  3.645  0.290  2.164 
ER ()  3.287  0.291  3.641  0.282  2.214 
DLDL  3.228  0.285  3.509  0.272  2.132 
DLDLv2  3.135  0.272  3.452  0.267  1.969 
4.5.5 Sensitivity of Hyperparameters
Hyperparam  ChaLearn15  ChaLearn16  Morph  

()  MAE  error  MAE  error  MAE  
0.01  1 (101)  3.223  0.282  3.493  0.270  1.960 
0.10  1 (101)  3.188  0.278  3.455  0.268  1.972 
1.00  1 (101)  3.135  0.272  3.452  0.267  1.969 
10.00  1 (101)  3.144  0.273  3.487  0.270  1.977 
1.00  4 (26)  3.182  0.276  3.473  0.270  1.963 
1.00  2 (51)  3.184  0.274  3.484  0.271  1.963 
1.00  0.50 (201)  3.184  0.278  3.484  0.269  1.992 
1.00  0.25 (401)  3.167  0.274  3.459  0.265  2.028 
We explore the influence of hyperparameters and , where is a weight which balances the importance between label distribution and expectation regression loss, and refers to the number of discrete labels (). In Table 7, we report results on all three age datasets with different value of and . We can see that our method is not sensitive to and with and . Note that, too many discrete labels lead to little training samples for per class in DEX rothe2016deep
, which may make prediction less precise. However, our method can ease the problem, because the training samples associated with each class label is significantly increased without actually increase the number of the total training examples. Surprising, there is also a good enough performance when the number output neurons (
i.e., ) is 26. In our experiment, we fixed hyperparameters and without carefully tuning them. In practice, it is convenient to find optimal hyperparameters using a holdout set.5 Understanding DLDLv2
We have demonstrated that DLDLv2 has excellent performance for facial age and attractiveness estimation. A natural problem is how DLDLv2 makes the final decision for an input facial image. In this section, we try to answer this question. Then, we analyze why it can work well when compared with existing methods.
5.1 How Does DLDLv2 Estimate Facial Attributes?
In order to understand how DLDLv2 makes the final decision for an input facial image, we visualize a score map that can intuitively show which regions of face image are related to the network decision. To obtain the score map, we firstly employ a classdiscriminative localization technique zhou2016learning that can generate class activation maps. Then, these activation maps are aggregated by predicted probabilities.
Let us briefly review our framework. The last convolution block produce activation maps . These activations are then spatially pooled by a hybrid pooling and linearly transformed (i.e., Eq. (7)) to produce probabilities with a label distribution module. To produce class activation maps, we apply linearly transform layer to as follows
(23) 
Then, the score map can be derived by
(24) 
In Eq. (24), the value of represents the contribution of the network’s decision at position of th row and th column. Bigger values mean more contributions and vice versa. For comparing the correspondence between highlighted regions in and an input image, we scale to the size of an input image.
In Fig. 6, we visualize the score maps of testing images (ChaLearn16) coming from different age group. we can see that the highlighted regions (i.e., red regions) are significantly different for different age group faces. For infants, the highlighted region locates in the center of two eyes. For adults, the strong areas include two eyes, nose and mouth. For senior people, the highlighted regions consist of the forehead, brow, two eyes and nose. In short, the network uses different patterns to estimate different age.
We also show some examples coming from SCUTFBP testing images in Fig. 7. We can observe that it is not significant for the highlighted regions between these faces with higher attractiveness score and that of lower score. An explanation is that DLDLv2 may be able to estimate facial attractiveness through simply comparing the difference of the common facial traits such as eyebrows, eyes, nose, mouth etc.. In fact, the SCUTFBP dataset indeed has the lower complexity (female faces with simple backgrounds, no accessories, and minimal occlusion) than age estimation on ChaLearn16.
5.2 Sensitivity to Different Face Regions
To further quantitatively analyze the sensitivity of DLDLv2 to different face regions. We occlude different portions of the input image by setting it to mean values of all training images. Specifically, we use two type of occlusions, small square region (size of 3232) and horizontal stripe (size of 32224), as in zeiler2014visualizing; rothe2016deep. We occlude the input images (size of 224224) using this two type of occlusions in a sliding window fashion. In all, we obtain 49+7 occluded inputs for each input image. For each occluded input, we record prediction performance (i.e., MAE) on all testing images. Finally, we compute the relative performance loss between with and without occlusions to measure the sensitivity of occlusion region.
In Fig. 8, we show the quantitative results under different occlusions. First, we observe that larger values usually appear in some specific regions such as forehead, two eyes, nose, mouth, and chin. This indicates the decision of DLDLv2 heavily depends on these crucial regions. Second, these values are significantly different in the different regions. For example, on ChaLearn16 testing images, the largest and second largest value appear around nose and eyes, which suggests nose and eyes are the most important facial traits for age estimation. Third, although SCUTFBP and CFD both are used to evaluate facial attractiveness datasets, the distributions of the largest value are greatly different. The largest value of the former appears the region of eyes, and that of the latter appears the region of mouth and chin. In fact, the faces of SCUTFBP come from Asia female, which is scored by Chinese, while CFD dataset consists multirace faces and is scored by diverse background annotators. Therefore, this difference may be due to the phenomenon that different races may have an inconsistent understanding of facial attractiveness.
5.3 Why does DLDLv2 make good estimation?
Compared to MR, the training procedure of our DLDLv2 is more stable because it not only regresses the single value with expectation module but also learns a label distribution. Compared to DEX, through introducing label distribution learning module to DLDLv2, the training instances associated with each class label is significantly increased without actually increasing the number of the total training images, which effectively alleviate the risk of overfitting.
For Ranking and DLDLbased methods, we have proved that they are both learning a label distribution from different levels. Therefore, they both share the advantages of label distribution learning. However, there are three major differences in the network architectural between these methods and our DLDLv2. First, these methods depend heavily on a pretrained model such as VGGNet or VGGFace with more parameters while DLDLv2 has a thinner architecture with fewer parameters. Therefore, DLDLv2 has higher efficiency in inference time and storage overhead. Second, DLDLv2 effectively avoids the inconsistency between training objective and evaluation metric via introducing the expectation regression module. Third, DLDLv2 is a fully convolutional network, which removes all but the final fully connected layer. It is very helpful to understand that how DLDLv2 makes the final decision. In a word, these differences make DLDLv2 have good performance on accuracy, speed, model size and interpretability.
6 Conclusion
In this paper, we proposed a solution for facial age and attractiveness estimation problems. We firstly analyze that Rankingbased methods are implicitly learning label distribution as DLDLbased methods. This result unifies existing stateoftheart facial age and attractiveness estimation methods into the DLDL framework. Second, our proposed DLDLv2 framework can effectively erase the inconsistency between training and evaluation stages via jointly learning label distribution and regressing single value with a thin and deep network architecture. It creates new stateoftheart results on facial age and attractive estimation tasks with fewer parameters and faster speed, which indicates it is easy to be deployed on resourceconstrained devices. In addition, our DLDLv2 is also a partly interpretable deep framework which employs different patterns to estimate facial attributes.
It is noteworthy that our approach is easily scalable to others label uncertainty tasks, such as skeletal maturity assessment on pediatric hand radiographs larson2017performance
, head pose estimation
schwarz2017driveahead, popularity of selfie kalayeh2015selfie, image aesthetic assessment deng2017image etc. In addition, a further theoretical study between the rankingCNN and DLDL will also be our future work.
Comments
There are no comments yet.