BeautifyBasedOnGAN
GAN-Based Facial Attractiveness Enhancement
view repo
We propose a generative framework based on generative adversarial network (GAN) to enhance facial attractiveness while preserving facial identity and high-fidelity. Given a portrait image as input, having applied gradient descent to recover a latent vector that this generative framework can use to synthesize an image resemble to the input image, beauty semantic editing manipulation on the corresponding recovered latent vector based on InterFaceGAN enables this framework to achieve facial image beautification. This paper compared our system with Beholder-GAN and our proposed result-enhanced version of Beholder-GAN. It turns out that our framework obtained state-of-art attractiveness enhancement results. The code is available at https://github.com/zoezhou1999/BeautifyBasedOnGAN.
READ FULL TEXT VIEW PDFGAN-Based Facial Attractiveness Enhancement
In the contemporary society, the pursuit for beauty has permeated people’s daily life and a person with an attractive facial appearance can enjoy more compliments and opportunities compared to someone who is less facially attractive but other conditions are quite equal. Digital facial beautification has been highly demanded and brought out many applications, e.g. professional applications of retouching facial image for commercial use and social life need, supporting plastic surgery and orthodontics, selecting make-up and hair style, and beautifying applicants for certain screening situations like entertainment and modeling [21].
Nowadays, beautifying facial images based on GAN is still an intriguing and relatively new topic that is worth delving into, because works in this aspect are relatively insufficient. Current introduced methods using GANs are [9, 28]. Although these methods have opened a new field of vision in generating beautified images, they also raise challenging problems, such as image quality loss and keep-identity failure. CGANs dominated in these two methods which yields loss of synthesis resolution and quality. In spite of the fact that Beholder-GAN [9] sheds its light on the biological and sort of ingrained rationale of high beauty rating congruence over race, social class, gender and age [21], through which, in results of Beholder-GAN, the higher a beauty score is, the higher possibility one person tends to feminize, rejuvenate and whiten, identity preserving in Beholder-GAN still cannot meet usual expectation about key identity attribute preservation, like sex. Therefore, we concluded that beautification of Beholder-GAN cannot control identity of generated images very well.
Major contributions in this paper include:
We managed to solve drawbacks in Beholder-GAN on the basis of it and compared our method with Beholder-GAN and modified Beholder-GAN in four respects, synthesized image realistic level, synthesized image quality, identity preserving level of generated images and the possibility of successful beautification.
Identity preservation here mainly means successful original image reconstruction and identity-preserving beautified image with quite high beauty level. Too much beautification usually pushes image away from original person identity, but beautified level can be controlled by parameters, which should not affect the evaluation of beautification. Our proposed approach achieves excellent performance in both high-quality and identity preserving.
The rest of the paper is organized as follows: Section 3 investigates previous works in facial beautification, facial attractiveness estimation and GANs providing foundation for beautification. Section 4 describes our approach. Evaluation experiments are detailed in Section 5. Section 6 presents further discussion and limitation. We concludes the paper in Section 7.
Traditional face beautification can be separated into 2D and 3D. Make-up and hair style transfer can also achieve excellent beautifying effect.
In facial image beautification, expect from some novel approaches, many achievements have been obtained in facial shape beautification and facial skin beautification [41]. In others words, image pixel color correction [41, 4] and face geometric correction [23, 22, 29] are prevailing and effective directions without affecting identity.
[41] improved multi-level median filtering method and performed filtering operation seven times on each facial image to remove facial blemishes after using the ASMs model to get facial landmarks. [4]
proposed a system that removes wrinkles and spots but preserves skin roughness using nonlinear filter.
[23] trains the Deep Beauty Predictor (DBP) to capture potential relations between face shape (facial landmarks) and beauty score and then modifies facial landmarks of the original image to achieve beautification. In [22], it calculates distances among facial landmarks to form a “face space”, searches a nearby point in this space to find point with higher beauty score and uses 2D warp to map input image to beautified output image.
Most methods mentioned above are on the basis of automatic beauty rating, considering the circumstance where such a subjective concept is beauty that data-driven methods for facial attractiveness estimation [10, 40, 9, 39]
have aroused increasingly attention among communities of computer vision and machine learning. Technologies about automatic human-like beauty prediction can reduce investment of human power as well.
3D facial beautification can be applied to more professional fields than 2D, like entertainment and medicine. [19, 25] both corrected facial asymmetry according to preference towards more symmetrical faces, but [25] applied face proportion optimization by Neoclassical Canons and golden ratios as well. For medical beauty industry, there are also many developed systems [7] helping surgery planning and visualization effect.
In recent years, Generative Adversarial Networks (GANs) [11] have been explored extensively from lacking stability in the training process to capability of being more stable and generating any high-quality images, e.g. WGAN [5], WGAN-GP [13] and PGGAN [17], to more diversified and novel extended GANs, e.g. StyleGAN [18] that separates high-level facial attributes and stochastic variation automatically and unsupervisedly, and generates highly realistic images, a network mapping edge maps to colorized images [16]
, and image-to-image translation
[8, 43, 27].[9] introduced Beholder-GAN that is based on previous works about Progressive Growing of GANs [17] learning from low-resolution to high-resolution image and Conditional GANs (CGANs) [30] generating images conditioning on some attribute, e.g. class label, feature vector. Beholder-GAN uses a variant of PGGAN conditioned on a beauty score to generate realistic facial images. To be specific, it uses the SCUT-FBP5500 dataset [24], in which the 60-size beauty score distribution of each subject is reported, to train a beauty rater and uses this model to label the CelebAHQ dataset [17] to enrich GAN training dataset. And recovery of latent vector and beauty score from input image and beautified image generation of trained Beholder-GAN are followed. [28] quantifies correlations between beauty and facial attributes and extends StarGAN [8] to transfer facial attributes to realize beautification.
One critical part about identity preservation of GANs-generated image is to find one latent vector that is corresponding to a image similar to original one. In general, there are two approaches: 1) train an encoder that can map image to certain latent vector [20]; 2) initialize latent vector and optimize it using gradient descent [26, 1]. We follow the more popular and stable way, i.e. latent vector recovery.
One possible approach we can refer from previous works in beautification is to transform images between two domains, similar to CycleGAN [43], image-to-image translation [27] and a extended multi-class image translation version, like StarGAN [8]. Another is, unlike ways stated ahead that feed data containing information about certain classes/attributes into the network to learn, another viable method is to investigate into traditional GANs latent spaces and search some internal patterns between image semantic information and input latent vector [34].
From our experiments, in order to make a GAN generate images with variant beauty levels or one beautified image from an input original image, two main tasks are 1) recover/map a input original image to corresponding latent vector that can generate image resemble to the original image; 2) control of input of GAN related to beauty of generated image to realize beautification. The framework we describe here is StyleGAN and InterFaceGAN based beautification.
A well-trained traditional GAN can be regarded as a deterministic function that can map a
, usually a d-dimensional latent vector in the Gaussian distribution and carrying semantic information, to a image, i.e.
and , where is generated image, , , , is a semantic space with m attributes.In [18]
, while proposing the separability metric, it also mentioned an idea that it is possible to find direction vectors that correspond to individual factors of gradual change of image in a sufficiently disentangle latent space. There are some 2D linear latent code interpolation facial morphing works based on StyleGAN as well. However,
[34]officially introduced an assumption that, for any binary attribute, there exists a hyperplane in the latent space considered as the separation boundary and semantics keep consistent while the latent code remains in one side of the hyperplane but turn to be the opposite in another side, and then empirically demonstrated and evaluated it.
The “distance” defined in [34] is , , where, considering a hyperplane with certain unit normal vector , is sample latent code.
When a latent code “vertically” crosses from one side of a hyperplane corresponding to a certain binary attribute to the opposite, this attribute of image would vary accordingly from the negative to the positive with a high possibility and stability of other attributes preserving, i.e. , . Besides, according to Property 2 mentioned by [34], it is very likely that random latent codes from locate near enough to the given hyperplane. Therefore, images can be edited from original state to one containing certain positive attribute.
Given the state-of-the-art performance of StyleGAN, we decided to combine StyleGAN with InterFaceGAN. [34] successfully evaluated the correctness of the assumption of its framework that, for an arbitrary well-trained traditional GAN, there is a hyperplane in its latent space that can separate any binary attribute. We followed the novel single attribute manipulation in latent space [34] proposed and found that, in the latent space ( latent space with the truncation trick), beauty score, a continuous value, can be separated like a binary attribute.
We randomly generated 40K sample images from StyleGAN as the dataset and stored their corresponding latent vectors and beauty scores, which are predicted by the same rating model in [9].
In the latent space, we trained a linear SVM on beauty score with 5600 positive and negative samples separately and then evaluated them on the validation set (4800 samples) and the remaining set (The entire dataset consists of 40K random samples). Accuracy for validation set is and for the remaining set is . The result showed that a beauty hyperplane exists in the latent space of StyleGAN and face editing [34] proposed is reasonable to be applied to beautification.
We used the StyleGAN-Encoder [6]
to get recovered latent vector from input image. The loss function used to optimize is as follows:
(1) |
(2) |
(3) |
(4) |
(5) |
(6) |
(7) |
(8) |
where are the target and predicted images, are the estimated and average latent vectors in one batch, respectively, is the feature output of the jth VGG-16 layer, is Multi-Scale Structural Similarity (MS-SSIM), is the metric proposed by [42] and the discriminator output for the predicted image is regarded as . And then face editing was applied with a start “distance” of 0.0 and an end “distance” of 3.0.
In this section, we compared our framework with current method, Beholder-GAN, and enhanced version of it, which, to some extent, solved identity-preserving weakness of Beholder-GAN.
This version of Beholder-GAN is inspired by identity-preserving GANs, using facial feature as the input of the generator and proposing their own identity preserving loss functions [35, 15]. Based on Beholder-GAN, which constructed a generator conditioned on a beauty score, we added identity feature as another condition and reconstructed the loss function of the generator and output of the discriminator. For enriching training dataset, following [9], we used its beauty rating model and FaceNet [33] to label the FFHQ dataset with beauty scores and identity features.
(9) |
where denotes the beauty score rating vector, is a normalized 512D feature embedding extracted from FaceNet, is a latent vector in random Gaussian distribution and is the generated image. The generator loss function is
(10) |
(11) |
(12) |
where means an adversarial loss to distinguish real images from synthesized fake images [14], denotes an identity preserving loss and its weight, i.e. , is 0.5 in our network. Preserving the identity in the training process is one key part in our method except following part of identity-preserving image reconstruction from input image. In this work, we referred from the identity preserving loss that is original exploited by [15] and made it calculate cross entropy using prediction from FaceNet fed with real and fake images respectively.
Aside from original version of Beholder-GAN making the discriminator predict beauty score as well as the real vs. fake probability, we reconstructed the PGGAN structure of the discriminator to fit additional feature condition and also applied the
loss as the score estimated controller on the discriminator loss, where is the concatenation of beauty scores and identity feature.(13) |
On the foundation of enhanced Beholder-GAN, which can learn beauty through input mixed label, another modification is more stable and correct face reconstruction from input image in spite of larger dimensional label input and more parameters.
We initialized input of gradient-descent latent vector recovery with ResNet-estimated and random , using an aggregate of the below separate loss functions as the final objective function to optimize for image-corresponding latent vector recovery with constant learning rate.
(14) |
(15) |
where are the target and predicted images with as their corresponding labels. The target constant label is estimated by the FaceNet and the beauty rater we trained. We used the method similar to the stochastic clipping proposed by [26] to clip labels to certain range every descent.
After recovering the latent vector , beauty score and facial feature , we fed recovered fixed and continuously increased , using the previous recovered as its baseline, into the feed forward model , where is an higher beauty level , to get beautified facial images.
The combination of these modifications endow the GAN with the ability to keep basic personal attributes, especially during face reconstruction and while a quite high beauty score is pushing the generated image away from person’s identity.
Method |
FID | BRISQUE | OFD | RA | ||
---|---|---|---|---|---|---|
OB | RB | B-IP | ||||
\⃝raisebox{-0.9pt}{1}BGAN | 371.8965 | 7.0443 | 4.9705 | 29.4391% | 1.3456 | N/A |
\⃝raisebox{-0.9pt}{2}BGAN+ |
371.8965 | 7.0443 | 12.0087 | -70.4754% | 1.0958 | N/A |
\⃝raisebox{-0.9pt}{3}BGAN+,w/o |
371.8965 | 7.0443 | 11.4624 | -62.7197% | 0.9900 | N/A |
\⃝raisebox{-0.9pt}{4}BGAN |
86.0071 | 17.2466 | 6.9791 | 59.5334% | 1.3392 | 90.1944% |
\⃝raisebox{-0.9pt}{5}BGAN+ |
86.0071 | 17.2466 | 15.1309 | 12.2675% | 1.2118 | N/A |
\⃝raisebox{-0.9pt}{6}BGAN+,w/o |
86.0071 | 17.2211 | 14.1685 | 17.7259% | 1.0059 | N/A |
\⃝raisebox{-0.9pt}{7}MBGAN |
123.9326 | 17.2211 | 22.5412 | -30.8931% | 0.9342 | N/A |
\⃝raisebox{-0.9pt}{8}MBGAN | 172.2134 | 17.2211 | 12.8857 | 25.1745% | 0.8912 | 94.7667% |
\⃝raisebox{-0.9pt}{9}Our Approach |
4.4159[18] | 9.2128 | 8.9666 | 2.6717% | 0.1299 | 90.9583% |
\⃝raisebox{-0.9pt}{10}Our Approach | 9.1837 | 17.2211 | 19.3427 | -12.3202% | 0.1571 | N/A |
128128 resolution, 10241024 resolution, 12000K, 6011K, 8908K, 6001K (the number of images used for training)
For evaluation, in this paper, we calculated the FIDs using the same method in [18] (random selection of 50,000 images from the training set and the lowest distance encountered during training reported). We randomly chose 400 images as test input images to calculate BRISQUE using the ImageMetrics part of [37] and OpenFace FD using [2]. All quantitative evaluations are represented in Table 1. To verify the correctness of relationship between beauty score/level and output beautified effect, we made use of the results from these 400 images to conduct an online user survey with 400 paired two-image contents with the distance of 0.1 in their beauty score in both our reproduced Beholder-GAN and modified version, and the “distance” of 1.2 from the trained beauty hyperplane in our proposed approach, respectively. The raters were asked to point out samples showing failures of our methods. Fig. 3 shows a few examples where the raters disagreed with the results of these beautification models. In Fig. 2 and 3, the first image in the same row is the input image, and following are images with 0.05 beauty score and 0.3 “distance” increment from the recovered beauty score and the beauty hyperplane, from left to right, respectively for Beholder-GAN based and our method.
Images used by Method \⃝raisebox{-0.9pt}{1}-\⃝raisebox{-0.9pt}{3} are not aligned twice, while Method \⃝raisebox{-0.9pt}{4}-\⃝raisebox{-0.9pt}{10} uses image alignment. Image Alignment would make image quality lower a bit.
Although all issues stated below can be controlled and avoided, to some extent, by parameters, improvements in those aspects can make beautification algorithm robuster and more practical.
In general, there have two ways to beautification based on GAN. One is CGAN-based approach and another is to use traditional GAN, which are rooted at different mechanisms.
In the experiments, when variants of PGGAN reach its maximum resolution, the longer it is trained, the better the FID scores become, but it is more likely for overfitting of GANs to happen. Despite FID score fluctuations, the relative relationship of FID scores among these models should be reflected through these evaluation experiments. The larger the dimension of label (not one-hot coding variables) added to CGAN is, the worse the realisticness and quality of generated images are, which can be verified by our evaluation and results, and can also be considered as the weakness of CGANs, while traditional GANs can synthesize more realistic images. We found that the combination of and based on Beholder-GAN, i.e. MBGAN, can achieve better results compared to only usage of on Beholder-GAN which can improve the face reconstruction process.
During modification of Beholder-GAN, one issue we encountered is that, due to the usage of PGGAN, different from [15]
, the initial generated images are blurry which somehow weakens the functionality of our generator loss function in MBGAN and also scales down the number of face recognition networks we can choose from, which requires the less sensitiveness of recognition networks towards pose and quality of facial images. Replacing network structure with another one that can work around these issues is a way for improvement as well as finding more informative identity features as the input of the generator.
Besides, one common pattern in these beautifications is that beautification in skin and makeup aspects comes first and then combination of skin and makeup beautification and facial geometric modification like face shape and the five sense organs is following with incremental beauty score and positive distance from the beauty hyperplane. The empirical thresholds for this in our experiment are 0.1 and 1.2 for Beholder-GAN based and our method, respectively. And within these thresholds, beautified images contain more realistic and practical value.
For the image recovery results, it is hard for networks to recover detailed accessories in facial images (Fig. 2). In order to improve this, more diverse and comprehensive accessories-contained high quality face databases are required.
And unlike the advantage the latent space of StyleGAN possesses, how to find a way to truly make condition variables control information on generated images of CGANs is still a valuable problem that can be digged into. From our experiment results and related results from other papers, we found latent space of StyleGAN has excellent performance on binary attribute distinguishment, while Beholder-GAN and modified version of it still lose some control in various important facial features of original images during beautification process, like gender and eyeglasses. For example, encountering facial images with eyeglasses, in spite of successful recovery of eyeglasses, modified Beholder-GAN fails to maintain eyeglasses attribute with increasing beauty score compared to our approach (Fig. 2), although these aspects of modified Beholder-GAN might outperform Beholder-GAN a little bit. We assume this disadvantage results from the incapability of input identity features to separate obvious facial attributes from each other significantly. In order to discover the relationship between the input identity features and binary attributes, we visualize some features using t-SNE [38] (Fig. 2). We can see FaceNet cannot separate those features very well, which might give rise to attribute changing with increased beauty score. Current well-performed face recognition networks are driven by mass data, therefore, informative and binary-attribute-separated features that can represent any face properly, an analogy of traditional GAN’s generating any face and one hyperplane likely found in traditional GAN’s latent space separating one binary attribute, demands very comprehensive and challenging face datasets. The one-hot coding of attributes as conditional variables for well-labeled face datasets might be a circumvention that is worth trying.
Another thing we discovered is that high resolution of StyleGAN can produce recovered images of higher sharpness and quality than low resolution of it, in spite of the fact that both can achieve excellent identity preservation.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() ![]() ![]() ![]() |
![]() ![]() ![]() ![]() |
![]() ![]() ![]() ![]() |
![]() ![]() ![]() ![]() |
![]() ![]() ![]() ![]() |
![]() ![]() ![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
![]() ![]() |
Our work has shown state-of-art results compared to former one, to a large extent. The discovery and research about explanation/special patterns of deep meaning of beauty embedded in GANs is intriguing, no matter which aspect we have tried to disclose, i.e. traditional latent space and functions of conditional variables. GAN-based facial attractiveness enhancement still has its possibility to make potential improvement, especially methods based on CGANs, e.g. more effective and informative identity features as conditional labels, more novel construction of CGANs for beautification, and image quality and resolution improvement of CGANs.
A system for beautifying face images using interactive evolutionary computing
. In 2005 International Symposium on Intelligent Signal Processing and Communication Systems, Vol. , pp. 9–12. External Links: Document, ISSN null Cited by: §3.1.1, §3.1.1.Image-to-image translation with conditional adversarial networks
. External Links: 1611.07004 Cited by: §3.2.1.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
. External Links: ISBN 9781467369640, Link, Document Cited by: §5.1.1.A new humanlike facial attractiveness predictor with cascaded fine-tuning deep learning model
. External Links: 1511.02465 Cited by: §3.1.1.Transferring rich deep features for facial beauty prediction
. pp. . Cited by: §3.1.1.