As one of the most representative regions of a human being, face acts as a vital role in biometrics. The appearance of a face is often deemed as a crucial feature for identifying individuals. Most face verification methods determine whether a pair of face images refers to the same person by comparing certain facial features from the given images. Following the remarkable progress of face verification research [26, 27, 24, 13, 30, 10, 29], the related technology brings great convenience to our lives ranging from social media to security services. Nevertheless, it is worth noticing that most of these application scenarios are concerned in information security and the importance of algorithm reliability is self-evident. Facing various special situations, there is still a long way to solve the problem thoroughly.
Although face is an inherent surface area of a person, its appearance on images may alter due to many factors, e.g., view angles, expressions and makeup changes. Among these factors, the makeup change has caught more attention than ever because of the striking development of facial cosmetics and virtual face beautification. Makeup is usually used to conceal flaws on the face, enhance attractiveness or alter appearance. Not only females, there are more and more males that agree makeup being a daily necessity and even a courtesy on certain occasions. Comparing to view angle and expression changes that are mostly out of uncooperative behaviors, makeup is a special case that may be inevitable during photographing. However, the appearance changes caused by makeup will also decrease the verification performance, just like other factors. This paper studies to remove facial makeup even under cases with pose and expression variations.
A typical facial makeup style can be divided into three main parts according to their locations: eye shadows, lipstick and foundation [3, 16]. The comprehensive effects of various cosmetics can lead to significant changes on facial appearance. Since there are theoretically innumerable kinds of makeup styles, the nondeterminacy and variability make it quite difficult to match the images before and after makeup. Besides, makeup is different from plastic surgery for it concerns non-permanent changes and is easy to be cleansed. In contrast to skin care products, cosmetics are regarded as a harm to skin health and thus most people only wear makeup when meeting others. The convenient facial changes induced by makeup are reported in  to have posed a severe impact on face verification systems (which often depend much on capturing various cues from facial appearance). In this paper, we study the effect of makeup and contrive to alleviate its impacts on face verification performance in a generative way.
To address makeup-invariant face verification, much efforts have been made in the past decade and we sum them up in two main streams. Early methods concentrate on exacting discriminative features directly from input images. By maximizing correlation between images of the same subjects, these methods map images into a latent space for better classification performance. The mapping function can be designed manually [11, 7, 4] or learned by a deep network . Another stream resorts to the success of deep generative models and uses the makeup removal outputs for verification, represented by [17, 18]. Given a face image with makeup, these methods first generate a non-makeup and identity-preserving image based on the input. And then the generated images along with real non-makeup images are utilized for verification. Since this stream makes changes at image level, it achieves the advantage of adapting existing verification methods for makeup problems with no requirement of retraining.
However, [17, 18] formulate makeup removal as a one-to-one image translation problem, which in fact lacks rationality. During training, there is a pair of makeup and non-makeup images of a subject, the networks in [17, 18] force the output to be like the non-makeup ground truth instead of just removing facial makeup. We show a sample pair in Figure 1 and obvious data misalignment can be observed. Apart from makeup, many other factors are also different in two images, including the hair style and the background. In this paper, we argue that makeup removal is naturally a task with unpaired data for it is illogical to obtain a pair of images with and without makeup simultaneously. When removing the makeup, other factors of an output are expected to retain the same with its input, based on the premise of realistic visual quality.
To this end, we propose a Semantic-Aware Makeup Cleanser (SAMC) to facilitate makeup-invariant face verification via generation. As mentioned above, facial makeup is the results of applying multiple cosmetics and different regions can appear different effects. Thus we raise the idea of removing the cosmetics with tailored schemes. Concretely, we adopt two semantic-aware learning strategies in SAMC, in both unsupervised and supervised manners. For an input image with makeup, we first utilize an attention module to locate the cosmetics and estimate their degrees. The attention module is jointly learned with the generator in an unsupervised way, since there is no access to applicable attention maps serving as supervision. The aim of this process is to get explicit knowledge of makeup automatically as well as adaptively. Although the attention module suggests where to stress, its confidence and accuracy are not as satisfying as desired, especially at the beginning of the training phase. To address this issue, we propose a semantic-aware texture loss to set a constraint on the synthesized texture of different cosmetic regions. In addition, the misalignment problem is overcome by a feat of face image warping. By warping the non-makeup ground truth according to its face keypoints, we obtain pixel-wise alignment data, benefiting precise supervision and favorable generation.
2.1 Motivation and Overview
In reality, makeup images are generally acquired in unrestricted conditions, i.e., in the wild. Besides makeup, there exist complex variations including pose, expression, illumination. Although these factor may also impact verification performance and should be adjusted, the proposed SAMC focuses on makeup removal but leaves out other factors for better user experience. When a customer is using a makeup removal system, the changes are expected to be limited to makeup. Therefore, given an image with makeup as the input, our method aims at removing the makeup in a generative manner while retaining other information. The output is expected to achieve appealing visualization as well as suffice face verification. Different from image style transfer, the effect of cosmetics merely involves certain face areas instead of the global image. Moreover, despite the inaccessibility of exactly aligned data, we can acquire image pairs of a subject, one with makeup and the other without. With proper processes and strategies, these image pairs can provide superior supervision.
To fully utilize the image pairs without exact alignment, we propose a semantic-aware makeup cleanser to remove facial cosmetics with tailored strategies. The simple yet effective method is implemented by a network that mainly contains a generator and an attention module . We employ image pairs to train the network and assume that and represent images with and without makeup. It is noting that refer to the same identity but may differ in expression, pose, background, etc. To obtain pixel-wise supervision, we warp the non-makeup ground truth according to , yielding the warped non-makeup ground truth . The warping process consists two steps: 1) detecting 68 facial keypoints by 
, and 2) non-linear transformation in
. In the following, we will elaborate the network and loss functions in details.
2.2 Basic Model
We first describe the basic model of SAMC which is typically a conditional generative adversarial network [6, 12] with an identity constraint, similar as . The diagram of the basic model lies in the top right corner of Figure 2. Taking as the input, the basic model produces the makeup removal result though the generator . Different from  that forces to resemble , we expect the changes to be limited in cosmetic areas while other regions are kept the same as the input. Therefore, the U-net structure [21, 12] is adopted in the generator for its skip connections help to maintain abundant context information during the forward pass. Instead of mapping to the output directly, the network learns a residual result as a bridge, inspired by the success of ResNet . The final output is obtained by adding the residual result to the input .
The generator receives two types of losses to update parameters in the basic model, i.e., an adversarial loss and an identity loss. The vanilla GAN  uses the idea of a two-player game tactfully to achieve the most promising synthetic quality. The two players are a generator and a discriminator that compete with each other. The generator aims at producing samples to full the discriminator, while the discriminator endevours to tell real and fake data apart. To refrain from artifacts and blurs in the output, we train a discriminator to serve the adversarial loss which can be formulated as
In addition to removing makeup, we also expect that the generated images can maintain the identity of the original image, contributing to improve the verification performance across makeup status. Different from visual quality, verification performance is calculated by comparing image features extracted by tailored schemes. For face verification, the key issue is to generate images with qualified features that indicates identity. Thus we also use the identity loss to keep the identity information consistent. The identity loss is used as a classical constraint in a wide range of applications, e.g., super-resolution, face frontalization , age estimation , and makeup removal . Similar to , we employ one of the leading face verification networks, i.e. Light CNN , to obtain the identity information of an face image. The ID loss is calculated by
where represents the pre-trained feature extractor. Noting that the parameters in stay constant during training.
2.3 Attention Module
By comparing images before and after makeup of a certain subject, we observe that the facial changes caused by makeup only concern several typical regions, like eye shadows and lips. Existing makeup removal methods like  treat the problem as a one-to-one image translation task and force the output to resemble the non-makeup image in the dataset. This behavior violates the observation above and thus is not capable of generating satisfying results. Instead, when removing makeup, we should concentrate on these cosmetic regions and leave other unrelated image areas out. In this paper, an attention module is developed to enhance the basic model by distinguishing where and how much are the cosmetics in a pixel-wise way.
On the other hand, the attention map can also contribute to ignore the severe distortion in the warped non-makeup image as shown in Figure 2. The aim of the warping process is to alleviate data misalignment and provide better supervision. However, the warping is based on matching the facial keypoints on two images and the distortion is somewhat inevitable after the non-linear transformation. If provided with distortion-polluted supervision, the generator may be misguided and produce results with fallacious artifacts and distortion. To mitigate this risk, we adopt the attention map to adaptively decide where to believe in the warped image. We present a learned attention map in Figure 2. The element in is normalized between and . In Figure 2, we adjust the colour of for better visualization. It can be concluded that the makeup focuses on eye shadows and lips, in accord with common sense. In general, there are dark and light pixels in . The dark ones indicate that the makeup removal result at this pixel should be like , and the light pixel should be like . In this way, the distortion pixels are successfully neglected owing to their weights close to 0. We formulate the intuition as the reconstruction loss, which is calculated by
where represents element-wise multiplication between matrices.
Nevertheless, there is another problem that the attention map is easy to converge to all or all . When the attention map is all , it means that is driven to be the reconstruction of . If the attention map is all , is induced to imitate . Neither case is expected to occur. To avert these cases,  introduces a threshold to control the learning effect of the attention map. However, it is difficult to choose an appropriate value for the threshold. Different from it, we employ a simple regularization to address the problem, which is implemented by the additional L1 norm constraint between the two terms in Equation 3. This regularization prevents the two terms from being too large or too small, and thus restricts the value in the attention map consequentially. A balanced weight is introduced to control the contribution of the regularization and we empirically set it as .
2.4 Semantic-Aware Texture Loss
The attention map is learned along with the generator to indicate makeup at image level. Considering that it is learned in an unsupervised manner, the confidence and accuracy cannot be ensured, especially at the beginning of the training process. Hence, we explore other supervision to pursue better generation quality and stable network training. To this end, a Semantic-Aware Texture (SAT) loss is proposed to make the synthesized texture of different cosmetic regions realistic. As has been analyzed, makeup is substantially a combination of cosmetics applied to multiple facial regions. A typical makeup effect can be divided into foundation, eye shadows and lipsticks. Based on these, we resort to the progress of face parsing  and further adapt the parsing results to obtain different cosmetic regions. Figure 3 presents a set of parsing results. There are three colours in Figure 3(c), each standing for a cosmetic region.
The aim of the SAT loss is to resemble the local texture of to that of . After acquiring the cosmetic region label map of and , the meanof each feature region is calculated accordingly. Noting that 1) we assume that shares the label map with for their appearance merely differs in makeup effects, and 2) the label map has been resized according to the corresponding feature map. As for the texture extractor, we continue to adopt but only use the output of the second convolutional layer. Finally, the SAT loss is defined as
where and represent the mean and standard deviation of with indicating the three cosmetic regions.
In total, the generator updates its parameters based on the elaborated four losses and the full objective is
3.1 Datasets and Training Details
We use three public datasets to build our training and test sets. Dataset1 is collected in  and contains 501 identities. For each identity, a makeup image and a non-makeup image are included. Dataset2 is assembled in  and contains 406 images. Similarly, each identity includes a makeup image and a non-makeup image. Dataset3 (FAM)  consists of 519 pairs of images. For fair comparison with other methods, we resize all the images to the resolution of in our experiments. We also employ the high-quality makeup data collected by our lab to train and test the network. We refer to it as the Cross-Makeup Face (CMF) dataset. There are 2600+ image pairs at high resolution (at least ) of 1400+ identities, involving both makeup variations and identity information.
The experiments involve images with two resolutions ( and
). Our network is implemented based on pytorch. We train SAMC on the CMF dataset and test it on all the four datasets mentioned above. Therefore, the experiments on Dataset 1–3 are conducted in a cross-dataset setting. There are two main subnetworks in SAMC, i.e., the generatorand the attention module . In the implementation, and share the same architecture but update parameters separately. As for the feature extractor in Eq. 2 and 4, we utilize the released model of Light CNN which is trained on MS-Celeb-1M . The batch size is set to . The model is trained using the Adam algorithm with a learning rate of . We set the balanced weights of all the losses as without loss of generality.
3.2 Visualization of Makeup Removal
Figure 4 presents some sample results of makeup removal, along with the learned attention of SAMC. Noting that we modify the display effect of attention maps for better visualization. We compare our results with other methods, including Pix2pix , CycleGAN  and BLAN . Pix2pix and CycleGAN are widely considered as representative methods in supervised and unsupervised image-to-image translation, respectively. To the best of our knowledge, BLAN firstly propose to generate non-makeup images from makeup ones for makeup-invariant face verification. We train these networks from scratch on CMF training data and the configurations are kept the same as described in their papers.
In Figure 4, it can be observed that Pix2pix and BLAN generates images with severe artifacts and distortion. The reason lies in that these methods assume the existence of well aligned data pairs and formulate the image translation problem at pixel level. As has been analyzed, makeup images inherently lack paired data, making the problem more difficult to tackle with. Although CycleGAN is learned in an unsupervised manner and produces images of higher quality than its neighbours, there exist apparent cosmetic residues due to the lack of proper supervision. As for the attention map, it demonstrates that our attention module can locate the cosmetics. And we will discuss the contribution of the attention module in the ablation studies. Comparing with other methods, our network achieves the most promising results. Not only are the cosmetics successfully removed, but also other image details are well preserved.
3.3 Makeup-Invariant Face Verification
In this paper, we propose a makeup removal method in the aim of facilitating makeup-invariant face verification via generation. To quantitatively evaluate the performance of our makeup cleanser, we show the verification results of SAMC and related algorithms in Table 1. As introduced above, we adopt the released Light CNN model as the feature extractor. Concretely, each image goes through the Light CNN and becomes a
-d feature vector. The similarity metric used in all experiments is cosine distance. For the baseline, we use the original images with and without makeup as inputs. For other methods, we instead use the makeup removal results and the original non-makeup images as inputs for verification.
For the CMF dataset, we observe that our approach brings significant improvements on all the three criteria. Instead of forcing the output to resemble the ground truth non-makeup image in the dataset like BLAN, we learn to locate and remove the cosmetics while maintaining other information including pose and expression. The accuracy improvements demonstrate that our network alleviates the side effects of makeup on face verification by generating high-quality images. On the other hand, CycleGAN fails to preserve identity information during generation, even though it produces outputs with moderate quality. The reason is that Pix2pix and CycleGAN are designed for general image translation and take no discrimination into account. For fair comparison, we further conduct experiments on three public makeup datasets and the results are exhibited in Table 2. It is worth noticing that BLAN is trained on these datasets while our SAMC is trained on CMF and tested on these without adaptation. Thanks to the stability of our network, SAMC can still outperform other methods in cross-dataset settings.
Besides, ablation studies are conducted to evaluate the contribution of each component in SAMC. We remove or modify one of the components and concern the changes in the corresponding metrics to verify their importance. To study the used loss functions to train the generator, we build three variants: 1) training without the ID loss, 2) training without the SAT loss, and 3) training without the adversarial loss. Since the reconstruction loss involves the learned attention map, we utilize different attention schemes to analyze the effect of the attention module. In particular, the attention map is set to all and all , respectively. The quantitative verification results are reported in Table 1 for comprehensive and fair comparison. The “” and the “wo” represent “without”. As expected, the performance of removing either component will experience a drop. By observing the accuracies in Table 1, we can find that there is an apparent decline when removing the ID loss. It indicates the effectiveness and importance of the ID loss in preserving discriminative information at feature level.
In this paper, we focus on the negative impact of facial makeup on verification and propose a semantic-aware makeup cleanser (SAMC) to remove cosmetics. Instead of considering makeup as an overall effect, we argue that makeup is the combination of various cosmetics applied to different facial regions. Therefore, a makeup cleanser network is designed with integration of two elaborate schemes. At image level, an attention module is learned along with the generator to locate the cosmetics in an unsupervised manner. Specifically, the elements in the attention map range from to with different values indicating the makeup degree. At feature level, a semantic-aware texture loss is designed to serve complements and provide supervision. Experiments are conducted on four makeup datasets. Both appealing makeup removal images and promising makeup-invariant face verification accuracies are achieved, verifying the effectiveness of SAMC.
This work is partially funded by National Natural Science Foundation of China (Grant No. 61622310), Beijing Natural Science Foundation (Grant No. JQ18017), and Youth Innovation Promotion Association CAS (Grant No. 2015109).
A. Bulat and G. Tzimiropoulos.
How far are we from solving the 2d & 3d face alignment problem?(and
a dataset of 230,000 3d facial landmarks).
Proceedings of the IEEE International Conference on Computer Vision, pages 1021–1030, 2017.
-  J. Cao, Y. Hu, H. Zhang, R. He, and Z. Sun. Learning a high fidelity pose invariant model for high-resolution face frontalization. In Advances in Neural Information Processing Systems, pages 2872–2882, 2018.
-  H. Chang, J. Lu, F. Yu, and A. Finkelstein. Pairedcyclegan: Asymmetric style transfer for applying and removing makeup. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
C. Chen, A. Dantcheva, and A. Ross.
An ensemble of patch-based subspaces for makeup-robust face recognition.Information Fusion, 32:80–92, 2016.
-  A. Dantcheva, C. Chen, and A. Ross. Can facial cosmetics affect the matching accuracy of face recognition systems? In the Fifth International Conference on Biometrics: Theory, Applications and Systems, pages 391–398. IEEE, 2012.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  G. Guo, L. Wen, and S. Yan. Face authentication with makeup changes. IEEE Transactions on Circuits and Systems for Video Technology, 24(5):814–825, 2014.
-  Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pages 87–102. Springer, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
R. He, X. Wu, Z. Sun, and T. Tan.
Learning invariant deep representation for nir-vis face recognition.
The Thirty-First AAAI Conference on Artificial Intelligence, pages 2000–2006. AAAI Press, 2017.
-  J. Hu, Y. Ge, J. Lu, and X. Feng. Makeup-robust face verification. In International Conference on Acoustics, Speech and Signal Processing, pages 2342–2346, 2013.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.In IEEE Conference on Computer Vision and Pattern Recognition, pages 5967–5976. IEEE, 2017.
-  X.-Y. Jing, F. Wu, X. Zhu, X. Dong, F. Ma, and Z. Li. Multi-spectral low-rank structured dictionary learning for face recognition. Pattern Recognition, 59:14–25, 2016.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
-  P. Li, Y. Hu, Q. Li, R. He, and Z. Sun. Global and local consistent age generative adversarial networks. arXiv preprint arXiv:1801.08390, 2018.
-  T. Li, R. Qian, C. Dong, S. Liu, Q. Yan, W. Zhu, and L. Lin. Beautygan: Instance-level facial makeup transfer with deep generative adversarial network. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 645–653. ACM, 2018.
-  Y. Li, L. Song, X. Wu, R. He, and T. Tan. Anti-makeup: Learning a bi-level adversarial network for makeup-invariant face verification. In The Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Y. Li, L. Song, X. Wu, R. He, and T. Tan. Learning a bi-level adversarial network with global and local perception for makeup-invariant face verification. Pattern Recognition, 2019.
-  Y. A. Mejjati, C. Richardt, J. Tompkin, D. Cosker, and K. I. Kim. Unsupervised attention-guided image to image translation. In Thirty-second Conference on Neural Information Processing Systems, 2018.
-  H. V. Nguyen and L. Bai. Cosine similarity metric learning for face verification. In Asian Conference on Computer Vision, pages 709–720. Springer, 2010.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
D. Ruprecht and H. Muller.
Image warping with scattered data interpolation.IEEE Computer Graphics and Applications, 15(2):37–43, 1995.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, 2015.
-  Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pages 1988–1996, 2014.
-  Y. Sun, L. Ren, Z. Wei, B. Liu, Y. Zhai, and S. Liu. A weakly supervised method for makeup-invariant face verification. Pattern Recognition, 66:153–159, 2017.
-  Y. Sun, X. Wang, and X. Tang. Hybrid deep learning for face verification. In the IEEE International Conference on Computer Vision, pages 1489–1496, 2013.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In the IEEE Conference on Computer Vision and Pattern Recognition, pages 1701–1708, 2014.
-  Z. Wei, Y. Sun, J. Wang, H. Lai, and S. Liu. Learning adaptive receptive fields for deep image parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2434–2442, 2017.
-  X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018.
-  S. Zhang, R. He, Z. Sun, and T. Tan. Multi-task convnet for blind face inpainting with application to face verification. In International Conference on Biometrics, pages 1–8, 2016.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, 2017.