1 Introduction
The attractiveness of a face retains influence on many social endeavors. Beautiful faces can have effects on person’s personality, career prospect, and their personal relationships OHFP11_Calder
. Integrating machine learning and computer vision techniques, automatic facial attractiveness computation has become an evergrowing topic.
Although substantial progress has been achieved in the research of face attractiveness computation, challenges still remain. The first challenge is the lack of true attractiveness labels (scores). In order to derive an approximation of the ground truth, a typical way is to survey a diverse group of human raters who assign scores to a set of faces. The average score of each face is then defined as the groundtruth label for the following classification or regression task myMTAP . However, the average score is not always a good indicator of universally accepted preference, especially for the controversial faces. On the contrary, the score distribution collected from different raters provides more aestheticsdegree information of a face than a single label. Fig. 1 gives an example of the average attractiveness score versus score distribution of one face. We can easily find that the score distribution includes the distribution of a number of labels, and represents the extent to which each beauty level describes the overall attractiveness of the face. In this sense, the score distribution can be viewed as a natural representation of a label distribution. Therefore, we recast facial attractiveness computation as a Label Distribution Learning (LDL) problem PAMI13_Geng , a newly proposed machine learning paradigm. The LDL is able to deal with insufficient and incomplete training data, since each face is expected to contribute to the learning of a number of attractiveness levels.
The second challenge is the lack of an accurate face representation that captures the salient elements of attractiveness. Many heuristics rules have been quantitatively studied over the years, inspiring researchers to handdesign diverse features in the most previous studies. The features can be geometric, color, texture based
NC06_Eisenthal ; NIPS06_Kagian ; PR08_Schmid , as well as appearance descriptors ECCV10_Gray ; PRIACVA12_Bottinoeither at local or holistic scale. Recently, the uptodate deep learning methods, especially convolutional neural network (CNN), have been applied to automatically learn a hierarchical and higherlevel face representation for facial attractiveness computation task
Neucom14_Gan ; arXiv15_Xu. Even with the shallow and plain architecture, i.e., twolayer convolutional restricted Boltzmann machine in
Neucom14_Gan and sixlayer CNN in arXiv15_Xu , superior performance has been achieved to the previous work with traditional handcrafted features. Since face beauty is a complex concept with no universalaccepted representation, we are motivated to determine the most appropriate visual characteristics from the raw RGB image by very deep networks.In this work, we intend to explore the deeper facial aesthetics features by learning features from raw images directly through deeper architecture. A very deep convolutional residual network (ResNet) CVPR16_ResNet is utilized which takes RGB pixels as inputs and automatically learn an effective face representation. We also incorporate the idea of LDL and ResNet in facial attractiveness computation, and show the advantages of our method on a newlyconstructed benchmark dataset SCUTFBP SMC15_Xie .
2 Our method
2.1 Network architecture
In order to use deep convolutional residual network CVPR16_ResNet , the input images are need to normalized to a fixed size, i.e.,
in this work. We first extract face region from the original image, making it possible to concentrate solely on the attractiveness of the face itself. We then pad the border pixels with zeros to the shorter side of the image to generate a normalized square input of our network.
The overall architecture of the ResNet used for facial attractiveness computation is shown in Fig. 2
. It contains a convolutional layer, a maxpooling layer, and four convolutional “bottleneck” blocks followed by an averagepooling layer, a fullyconnected layer with 5 neurons, and a softmax layer. Batch normalization is performed after each convolution layer. With different value of
, , , in the bottleneck blocks, the residual network architectures can be constructed into different depth. More details about the architecture can be found in CVPR16_ResNet In this work, we consider the 50layer and 101layer ResNet where {,,,} is {2,3,5,2} and {2,3,22,2}, respectively.2.2 Loss function
While the existing work uses a singlelabel (average score) for supervised regression, we consider the score distribution to describe the attractiveness of a face, which can be viewed as a natural representation of a label distribution. In this work, therefore, we recast facial attractiveness computation as a LDL problem.
Similar to the definitions in PAMI13_Geng , suppose for the th image, the feature representation extracted from 5dfc layer is , the complete set of possible labels (scores) is , and the score distribution is where denotes the description degree (distribution) of the label to the instance and . Given a training set , the purpose is to train a set of parameters to generate a distribution similar to
. Here the Euclidean distance and KullbackLeibler divergence are used as the measurement of the similarity of these two distributions. The training of the last softmax layer is done by minimizing the either of the following overall loss functions:
(1) 
(2) 
Given a new test image, the normalized input and its patch is fed into the network to compute the feature representation , and the predicted label distribution can be generated by , which provides the distribution of attractiveness scores. We can also predict the exact score of this face by weighted mean of the score distribution.
3 Experiments
We evaluate our proposed method on the SCUTFBP dataset SMC15_Xie , which contains 500 facial images with around 70 attractiveness scores from 15. Following the data partition setting in arXiv15_Xu
, 400 images are randomly selected as the training set, and the remaining 100 images as the testing set. Since the comparable work aims at exact attractiveness estimation, we calculate the weighted mean score
from the predict distribution by . A Pearson correlation (PC) between and groundtruth average score from raters is used to evaluate the performance of our model.Our experiments are conducted on the deep learning platform of Caffe
ACMM14_Caffe . We fix the batch size as 32, and weight decay as 0.0005. The learning rate is initiated with 0.001, and reduced by a factor of 10 for every 4,000 iterations. The max iteration is set to 17,000. The learning rate of the last layer is increased by a factor of 10 for speeding up the convergence, and weight decay is multiplied by 100 to avoid overfitting.To start with, we identify the role of deeper networks in facial attractiveness computation in the traditional regression context. Due to the very limited number of training samples, we finetune on three classic models including ResNet50, ResNet101 and VGGNet19 ICLR15_VGGNet
, which were pretrained on the Imagenet dataset
CVPR09_Imagenet . The performance of three networks is compared in Table 1. We need to mention that the baseline result in arXiv15_Xu is the correlation of 0.83 rather than the highest 0.88 achieved by several inputting channels instead of raw images. All the three networks achieve superior performance to the baseline because of the much deeper architectures. The 50layer ResNet performs best, and VGGNet worst. To reveal the possible reasons, Fig. 3 presents the behaviors of three networks throughout the training procedure. Without the residual structure, VGGNet cannot be effectively trained due to the degradation problem which was described in CVPR16_ResNet , thus the training error of VGGNet is the highest during the whole training. Compared to ResNet50, the inferior performance of 101layer ResNet is caused by overfitting, leading to the lower training error and wavy testing error as shown in Fig. 3. It is notable that we experiment Euclidean distance and KL divergence in the loss function, where Euclidean distance achieves slightly better. Therefore, all the results are obtained under this setting.Networks  arXiv15_Xu  ResNet50  ResNet101  VGG19 
PC  0.83  0.87  0.85  0.84 
With the network fixed to ResNet50, we then identify the role of label distribution in facial attractiveness computation. As shown in Table 2, by introducing LDL to our task, the correlation is increased by 2%. It reinforces the advantage of label distribution that provides more aestheticsdegree information of than the average score. This framework potentially allows a more universallyaccepted attractiveness score in contrast to both ethnically and view’s gendertuned model. In order to further boost attractiveness computation accuracy, several augmentation techniques, including the standard color augmentation NIPS12_AlexNet , rotation, contrast enhancement, etc., are used to enlarge the size of the training data to 8,000. In this way, the highest correlation of 0.92 has been achieved.
Improvements  ResNet50  ResNet50  ResNet50 
+ LDL  
+ LDL  + Augmentation  
PC  0.87  0.89  0.92 
4 Conclusion
The purpose of this paper is to present a new deep learning based framework for facial attractiveness computation. Rather than using the average attractiveness score as the ground truth for singlelabel supervised learning, we recast this task as a label distribution learning (LDL) problem. In order to extract aestheticrelated face representation, a very deep residual network is utilized. Comprehensive experiments are performed on the SCUTFBP dataset, where our approach achieves significant better results than the current state of the art. Based on this work, we believe that the construction of largescale benchmark and more effective deep networks deserve attention in the future.
References
 (1) A. Calder, G. Rhodes, M. Johnson, J. Haxby, The Oxford handbook of face perception, Oxford University Press, Oxford, 2011.
 (2) S. Liu, Y. Fan, A. Samal, Z. Guo, Advances in computational facialattractiveness methods, Multimedia Tools and Applications (2016) 1–31Doi:10.1007/s1104201638303.
 (3) X. Geng, C. Yin, Z. H. Zhou, Facial age estimation by learning from label distributions, PAMI 35 (10) (2013) 2401–2412.
 (4) D. Xie, L. Liang, L. Jin, J. Xu, M. Li, SCUTFBP: A benchmark dataset for facial beauty perception, in: SMC, 2015, pp. 1821–1826.
 (5) Y. Eisenthal, G. Dror, E. Ruppin, Facial attractiveness: beauty and the machine, Neural Computation 18 (1) (2006) 119–142.
 (6) A. Kagian, G. Dror, T. Leyvand, D. CohenOr, E. Ruppin, A humanlike predictor of facial attractiveness, in: NIPS, 2006, pp. 649–656.
 (7) K. Schmid, D. Marx, A. Samal, Computation of face attractiveness index based on neoclassic canons, symmetry and golden ratio, PR 41 (8) (2008) 2710–2717.
 (8) D. Gray, K. Yu, W. Xu, Y. Gong, Predicting facial beauty without landmarks, in: ECCV, 2010, pp. 434–447.

(9)
A. Bottino, A. Laurentini, The intrinsic dimensionality of attractiveness: A study in face profiles, in: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 2012, pp. 59–66.
 (10) J. Gan, L. Li, Y. Zhai, Y. Liu, Deep selftaught learning for facial beauty prediction, Neurocomputing 144 (2014) 295–303.
 (11) J. Xu, L. Jin, L. Liang, Z. Feng, D. Xie, A new humanlike facial attractiveness predictor with cascaded finetuning deep learning model, arXiv preprint arXiv:1511.02465.
 (12) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR, 2016.
 (13) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: ACMM, 2014, pp. 675–678.
 (14) K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, in: ICLR, 2015.
 (15) J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, F.F. Li, Imagenet: A largescale hierarchical image database, in: CVPR, 2009, pp. 248–255.
 (16) A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: NIPS, 2012, pp. 1097–1105.
Comments
There are no comments yet.