. In the mobile computing era, billions of images per day are acquired and uploaded to social networks and online platform, leading to the demand for better image processing and analyzing technology. Recently, thanks to the big data and high-performance computational hardware, computational and data-driven approaches have been proposed for solving these questions such as face recognition, facial expression recognition, facial beauty analysis and etc.
The existing methods resort to machine learning and computer vision techniques to analyze facial beauty and achieve promising results
. The methods often include image feature descriptors (such as HOG, SIFT, LBP, etc) and supervised machine learning predictors (such as SVM, KNN, DNN, LR, etc).
In order to explore the best facial beauty prediction approach that precisely maps high-level features into face beauty ratings, we propose a method that combines transfer learning and Bayesian regression. The method achieves the improved or comparable performance on SCUT-FBP dataset  and ECCV HotOrNot dataset .
The main contributions of this paper are as follows:
We apply transfer learning to our facial beauty prediction problems for feature extraction. Experimental results show that the transferred deep features can attain more impressive performance compared with the traditional image feature descriptors such as HOG, LBP and gray value features.
We make a detailed analysis about deep features based on knowledge adaptation. Additionally, we perform an effective feature fusion strategy to build more informative facial features in our facial beauty prediction task.
Studies found that the neural networks are lack of satisfactory interpretation. We make ablative studies by visualizing the face feature and reveal the elements that influence facial beauty perception.
The rest of this paper is organized as follows. Section II reviews the related works of facial descriptor and learning methods. Section III describes our proposed method in details, which include deep feature extraction and Bayesian ridge regression. Experimental results and comparisons are presented in Section IV and Section V concludes this paper with a summary and future work.
Ii Related Work
Ii-a Facial Descriptors and Machine Learning Predictors
Many researchers focus on developing new machine learning algorithms to achieve better classification or regression performance, while others focus on designing better facial feature descriptors. Zhang et al. 7] use a vector of gray values created by concatenating the rows or columns of an image. Huang et al. 
propose a method to learn hierarchical representations of convolutional deep belief networks. Xie et al. resort to deep learning to train a predictor and achieve state-of-the-art performance. Amit et al.  use numerous facial features that describe facial geometry, color and texture to predict facial attractiveness. Lu et al.  detect face landmarks with ASM and then extract facial features based on Blocked-LBP which achieved the Pearson Correlation at 0.874 on 400 high-quality female face images. Zhang et al.  compute geometric distances between feature points and ratio vectors composed of geometric distances, and then treat them as features for machine learning algorithm. For the lack of abundant labeled images, it always takes lots of time to fine-tune the deep neural networks architecture and parameters to achieve a comparative result and avoid overfitting problems as well.
In addition, some research works towards developing or improving new machine learning algorithms. Eisenthal et al. 
employ KNN and SVM as classifiers to rate faces belongs to different levels. Gan et al. use deep self-taught learning to obtain hierarchical representations and learn the concept of facial beauty. Xu et al. 
propose a method which constructs a convolutional neural network (CNN) for facial beauty prediction using a new deep cascaded fine tuning scheme with various face inputting channels. Wang et al. use deep auto encoders to extract features and take a low-rank fusion method to integrate scores, and their method achieves promising results. Xu et al.  propose “psychologically inspired CNN (PI-CNN)” for automatically facial beauty prediction.
Ii-B Deep CNN and Transfer Learning
Deep learning allows computational models that are composed of processing layers to learn representations of data with multiple levels of abstraction . CNN is a type of neural networks which is designed to process data that come in form of multiple arrays. Deep learning has been used as a dramatically powerful tool in computer vision tasks such as image recognition [17, 18, 19, 20]. The features are automatically extracted via stacked layers. Neural networks are trained through back-propagation algorithm to minimize the cost function.
Deep convolutional neural networks show more extraordinary capacity in feature extraction than traditional hand-crafted descriptors. However, we may need to design different networks architectures and train the deep neural networks almost from scratch to satisfy our task, which takes much computational burden. Transfer learning allows us to fine-tune the higher layers based on a pretrained model, or even just treat the pretrained model as a feature extractor.
Yosinski et al.  show that initializing a network with transferred features from almost any number of layers can produce an improvement to the generalization even after fine-tuning to the target domain dataset. Yoshua Bengio et al.  explore why unsupervised pre-training of representations can be useful, and how it can be exploited in the transfer learning scenario. Donahue et al. 
show that the features extracted by deep convolutional neural networks pretrained on ImageNet can achieve much better performance than many algorithms on lots of classification tasks, which illustrates the great generality and transferability of deep convolutional neural networks.
Iii-a VGG Network
We include a brief review of VGG, which is employed by our proposed method. VGG  consists of 16-19 weight layers and very small () convolution filters as well. Fig. 1 shows the overall architecture of the VGG16 networks. Though VGG networks architecture is simple, it is widely used in many computer vision tasks. In our experiments, we take a VGG face model which is pretrained on a face verification task . Although the original task is absolutely different from our facial beauty prediction task, it shows dramatically impressive performance. We believe the main reason for this issue can be attributed to the extraordinary feature representation power of deep CNNs.
Iii-B Deep Feature Extraction
Several research works [22, 16] show that the deep convolutional neural networks can learn increasingly powerful representations as the feature hierarchy becomes deeper. However, due to the limited labeled face images, if we train a deep convolutional neural network directly, we may suffer from severe overfitting problems. Recently, transfer learning has aroused much attention , which enables us to fine-tune from a pretrained model or just treat the learned neural network as a feature extractor to satisfy our tasks .
We extract facial features with VGG face model  pretrained on face verification task. Despite their target task is different from our facial beauty prediction task, the feature can achieve remarkable performance, which indicates extraordinary feature representation power of CNNs to some degree. Researches  show that the features in lower layers contain more detailed information while features in higher layers represent more semantic meaning. Our method concatenates on both relatively low layer’s features and relatively high layer’s features as our facial representation. We also use HOG, grayscale and LBP features in our experiments for comparison to evaluate the feature extraction capacity of deep CNNs.
Iii-C Bayesian Ridge Regression
We feed the concatenated feature vectors into Bayesian ridge regressor. Bayesian ridge regression includes regularization during estimation procedure: the regularization item is not embedded with cost function directly, but tuned to your data distribution. Theregularization used in Bayesian ridge regression is equal to maximizing a posterior estimation of the parameters with precision under a Gaussian prior.
is assumed to be Gaussian distributed aroundin order to form a fully probabilistic model:
Bayesian ridge regressor evaluates a probabilistic model of the regression problem. The prior for the parameter is decided by a spherical Gaussian:
The priors over and
The parameters , and
are estimated jointly during the fit procedure. The remaining hyperparameters are the parameters of the Gamma priors overand . All the parameters are tuned by maximizing the marginal log likelihood.
We implement our method with TensorFlow and Scikit-Learn  on an Ubuntu server with NVIDIA Tesla K80 GPU and Intel Xeon CPU.
Iv-a SCUT-FBP Dataset
The SCUT-FBP dataset  contains images of 500 Asian females. Each image is scored by 10 raters, the main task is to build a computational model to predict the average score of the human portrait image.
Since the images in SCUT-FBP 
are not in same size, deep CNNs can only support fixed squared data as input. We conduct three methods named “Crop”, “Warp” and “Padding” to get squared images respectively. In “Crop” setting, we detect face provided by and crop the face region, then we resize it to . In “Warp” setting, we just warp the image forcely to form a image. In “Padding” setting, we resize the longer side to 224 and zero-pad the shorter side to form a image (See Fig. 2
). We also normalize the input image by substracting the mean and dividing the standard variance of the pixels. Furthermore, we manually crop the central region of the image and treat it as the input for our neural networks in case of failed face detection. In SCUT-FBP dataset, we concatenates theconv5_1 and conv4_1 layer’s features. The pipeline is shown in Fig. 3:
Iv-B Performance Evaluation
In our experiment, we use Pearson Correlation (PC), Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) as the criteria for evaluating our method.
where m denotes the number of images, denotes the input feature vector of image , denotes the learning algorithm, denotes the groundtruth attractiveness score of image .
MAE and RMSE measure the fit quality of the learning algorithms, the performance is better if the value is closer to zero. PC measures the linear correlation between and . Its value lays between 1 and -1, where 1 means absolutely positive linear correlation, 0 means no linear correlation, and -1 means absolutely negative linear correlation.
In order to make the prediction more reliable and reproducable, we follow the provision denoted in  for fair comparison. We randomly select 400 images as training set and the rest 100 images as test set. Finally, we average the 5 experimental results as the final performance to remove sample variances. The results are shown in TABLE I.
TABLE II shows performance comparison with other methods. The best performance is marked with bold font and the second best is highlighted with an underline. Our method ranks the second place on SCUT-FBP  dataset.
|Combined Features+SVR ||0.5120||0.3961||0.6433|
|Combined Features+Gaussian Reg ||0.5149||0.3931||0.6482|
Performance comparison with other methods, our method ranks the second place on PC and first place on RMSE and MAE, respectively. The best and second results are emphasized in bold and underline respectively. Since RMSE and MAE of CNN-based methods proposed in  and  are not given and are hence denoted with “-”.
Iv-C Ablation Analysis
It is almost a common sense in machine learning practice is that “feature matters”. To illustrate the feature extraction capability by deep learning, we conduct experiments based on different features including HOG, LBP, gray image and transferred deep features for performance comparison and visualization:
Raw Grayscale: we convert the RGB facial images into their corresponding gray scale ones, and the flattened pixel gray scale value is used as the feature.
HOG: HOG is an image feature descriptor which is widely used in computer vision and image processing for object detection tasks. Details can be found in .
LBP: LBP is a type of feature descriptor which especially cares more about texture details, and is widely used in many machine vision tasks.
Performance comparison with other feature descriptors on Bayesian ridge regression. Our transferred deep features outperforms other descriptors with a large margin. The best results are given in bold font.
In addition, we compare the feature performance from different layers to find which layer produces the most discriminative features (See Fig. 5).
Moreover, among three preprocessing methods (Crop, Warp, and Padding), Crop achieves the best performance on SCUT-FBP, which indicates that facial region plays a more significant part in beauty perception, while background may act as noise in our facial beauty prediction task on SCUT-FBP dataset (See TABLE. IV).
Fig. 5 depicts that as layer goes deeper the performance gets better, and reaches the best at conv5_1. While when feature maps are flattened into vectors, we see a sharp drop in performance, which may be attributed as the heavy spatial information loss.
Performance of diffrent preprocessing methods (“Crop”, “Warp”, and “Padding”) on SCUT-FBP. “Crop” achieves the best.
Iv-D ECCV HotOrNot Dataset
ECCV HotOrNot dataset  contains 2056 faces which are collected from the Internet. Each face is labeled with a score, and the dataset has already been split into 5 training and test datasets. Unlike SCUT-FBP dataset , the faces in ECCV HotOrNot dataset  are more challenging because of the variant postures, cluttered background, illumination, low resolution and unaligned faces problems, which make the facial beauty prediction more difficult (See Fig. 6).
ECCV HotOrNot dataset uses Pearson Correlation (PC) for performance metric. We also list MAE and RMSE for more detailed comparison.
Iv-E Ablation Study
We concatenate and layers’ feature maps and flatten them to form more informative features. The concatenated features are then fed into Bayesian ridge regression algorithm .
We implement two means to evaluate the impact of preprocessing techniques. In solution A, we run face detector  to detect 68 facial landmarks and the facial region. For grayscale images, we replicate the gray pixel value twice to form an RGB channels image. Then we calculate the inclination angle to the horizontal line with two eyes coordinates, which is denoted as . If , we rotate the face around the central point by
degree and crop the facial region. The mean pixel value is subtracted from the cropped image, which is normalized by its standard deviation. Solution B includes mean subtraction and standard error division on the original images. No additional preprocessing is taken.
|solution A||solution B|
The ECCV HotOrNot dataset  has been divided into 5 parts which contain training set and test set respectively. We compare the performance of solution A and solution B. Much to our surprise, solution B achieves better results with a large margin. It may be explained by the extra non-facial information such as hairstyle, wearing, posture, etc.
We find that solution B achieves much better performance than solution A, the results can be found in TABLE V. We believe the main reason is that the annotators may also take extra information such as haircut, posture, and clothing into consideration while labeling these facial beauty scores, instead of just measuring face region.
Additionally, we define , which describes the error between the predicted facial beauty score () and the ground truth beauty score (). If , we believe there is a relatively severe bias among the predicted values and ground truth scores. If , we believe our algorithm fits these samples perfectly.
In this part, we set and for detailed analysis (See Fig. 7 and Fig. 8). We believe the performance could be greatly improved through face alignment techniques. Besides, posture and facial expression may also contribute to beauty perception because our algorithm fails to capture these samples with variant postures.
Table VI compares the Pearson Correlation of our proposed method with five state-of-the-art methods. Our method outperforms other methods and achieves the best performance on ECCV HorOrNot dataset without face alignment.
|Single Layer Model||0.417|
|Two Layer Model||0.438|
|Multiscale Model ||0.458|
|Auto Encoder ||0.437|
In this paper, we propose a method which extracts rich deep facial features through knowledge adaptation, and then trains Bayesian ridge regression algorithm for face beauty prediction. Despite that the VGG model is pretrained for a totally different task, it also captures more descriptive information than conventional hand-crafted features, and even outperforms many deep learning-based methods in our facial beauty prediction task, which shows the great generality of deep features in transfer learning. With our feature fusion strategy, our method outperforms other methods and achieves the state-of-the-art performance on ECCV HotOrNot dataset  without face alignment and comparable performance on SCUT-FBP dataset . In our future work, we plan to explore 3D face alignment and novel networks architecture for extracting more descriptive features.
This work was primarily supported by Foundation Research Funds for the Central Universities (Program No.2662017JC049) and State Scholarship Fund (NO.261606765054).
-  D. I. Perrett, K. A. May, and S. Yoshikawa, “Facial shape and judgements of female attractiveness,” Nature, vol. 368, no. 6468, pp. 239–42, 1994.
-  Y. Liu, Z. Xie, X. Yuan, J. Chen, and W. Song, “Multi-level structured hybrid forest for joint head detection and pose estimation,” Neurocomputing, vol. 266, no. Aug, pp. 206–215, 2017.
-  D. Zhang, F. Chen, and Y. Xu, Computer models for facial beauty analysis. Springer, 2016.
-  D. Xie, L. Liang, L. Jin, J. Xu, and M. Li, “Scut-fbp: A benchmark dataset for facial beauty perception,” in Systems, Man, and Cybernetics (SMC), 2015 IEEE International Conference on. IEEE, 2015, pp. 1821–1826.
-  D. Gray, K. Yu, W. Xu, and Y. Gong, “Predicting facial beauty without landmarks,” ECCV 2010, pp. 434–447, 2010.
-  F. Chen, X. Xiao, and D. Zhang, “Data-driven facial beauty analysis: Prediction, retrieval and manipulation,” IEEE Transactions on Affective Computing, 2016.
-  Y. Eisenthal, G. Dror, and E. Ruppin, “Facial attractiveness: Beauty and the machine,” Neural Computation, vol. 18, no. 1, pp. 119–142, 2006.
G. B. Huang, H. Lee, and E. Learned-Miller, “Learning hierarchical
representations for face verification with convolutional deep belief
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2518–2525.
-  A. Kagian, G. Dror, T. Leyvand, D. Cohen-Or, and E. Ruppin, “A humanlike predictor of facial attractiveness,” in Advances in Neural Information Processing Systems, 2007, pp. 649–656.
-  G. Lu, X. Xiao, and F. Chen, “A new face beauty prediction model based on blocked lbp,” in International Conference on Computer Vision Theory and Applications, 2016, pp. 87–92.
-  D. Zhang, Q. Zhao, and F. Chen, “Quantitative analysis of human facial beauty using geometric features,” Pattern Recognition, vol. 44, no. 4, pp. 940–950, 2011.
-  J. Gan, L. Li, Y. Zhai, and Y. Liu, “Deep self-taught learning for facial beauty prediction,” Neurocomputing, vol. 144, pp. 295–303, 2014.
-  J. Xu, L. Jin, L. Liang, Z. Feng, and D. Xie, “A new humanlike facial attractiveness predictor with cascaded fine-tuning deep learning model,” arXiv preprint arXiv:1511.02465, 2015.
-  S. Wang, M. Shao, and Y. Fu, “Attractive or not?: Beauty prediction with attractiveness-aware encoders and robust late fusion,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 805–808.
-  J. Xu, L. Jin, L. Liang, Z. Feng, D. Xie, and H. Mao, “Facial attractiveness prediction using psychologically inspired convolutional neural network (pi-cnn),” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 1657–1661.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on CVPR, 2015, pp. 1–9.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on CVPR, 2016, pp. 770–778.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in neural information processing systems, 2014, pp. 3320–3328.
-  Y. Bengio, “Deep learning of representations for unsupervised and transfer learning,” in Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 2012, pp. 17–36.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in International conference on machine learning, 2014, pp. 647–655.
-  O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recognition.” in BMVC, vol. 1, no. 3, 2015, p. 6.
-  X. Yuan, D. Li, D. Mohapatra, and M. Elhoseny, “Automatic removal of complex shadows from indoor videos using transfer learning and dynamic thresholding,” Computers and Electrical Engineering, in press, doi:10.1016/j.compeleceng.2017.12.026, 2018.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” J. of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
-  D. E. King, Dlib-ml: A Machine Learning Toolkit. JMLR.org, 2009.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893.
D. J. MacKay, “Bayesian interpolation,”Neural computation, vol. 4, no. 3, pp. 415–447, 1992.