In this work, we solve the makeup transfer task, namely transferring the makeup from an arbitrary reference image to a non-makeup source image. It is widely demanded in many popular portrait beautifying applications. Recent methods [15, 3, 2] for makeup transfer are mostly based on Generative Adversarial Networks (GANs) . They generally adopt the framework of CycleGAN  and train on two sets of images, namely, non-makeup images and with-makeup images. However, the existing methods only work well on frontal facial images without a specially designed module to handle the pose and expression differences between source and reference images. Also, they can not be directly used for partial makeup transfer during testing, since they lack the ability to extract makeup features in a spatial-aware way.
Targeting at these weaknesses, we aim to make a model that better aligns with real world scenarios. First, an ideal model should be pose-robust, which means it should be able to generate high quality results even if source images and reference images show different poses. In other words, it is expected that the makeup can be transferred from a profile face to a frontal face. Second, we hope our model can realize a part-by-part transfer process. That is, the desired method is able to transfer the makeup of specified face regions separately, e.g., eyeshadow or lipstick. Third, controllable shade effect generation is very demanded. The shade of the transferred makeup is expected to be controllable from light to heavy. Among existing methods,  proposes to disentangle the images into makeup latent code and face latent code to realize the shade controllable transfer. But this work does not consider the other two requirements.
In this paper, we propose a novel Pose-robust Spatial-aware GAN (PSGAN). Our PSGAN mainly contains three parts, namely a Makeup Distillation Network (MDNet), an Attentive Makeup Morphing (AMM) module and a De-makeup Re-makeup Network (DRNet). Inspired by current style transfer methods, we perform makeup transfer through scaling and shifting the feature map for only once using makeup matrices. But makeup transfer is a more complex problem than style transfer, which needs to consider both the realism of results and subtle details of makeup styles. We propose to use MDNet to distill the makeup from the reference image into two makeup matrices and that have the same spatial dimensions with visual features instead of scalars. Then, and are morphed and adapted to the source image by the AMM module to produce adaptive makeup matrices and . The AMM module can solve the misalignment caused by pose differences and make the PSGAN pose-robust. Finally, the proposed DRNet first conducts de-makeup on the source image and then performs re-makeup through applying the matrices and on the de-makeup result by pixel-wise weighted multiplication and addition.
Since the makeup style has been distilled in a spatial-aware way, part-by-part transfer can be realized by setting the weights in pixel-wise operations according to the face parsing results. For example in the top left panel of Figure LABEL:p1, the lip gloss, skin and eye shadow can be individually transferred from reference image to source image. Controllable shade can be realized by setting weights to scalar within . An example is shown in the bottom left panel of Figure LABEL:p1, where the makeup shade are increasingly heavier. The novel AMM module effectively solves the pose and expression differences issue. The top right panel and bottom right panel of Figure LABEL:p1 illustrate examples of large pose and expression difference cases respectively. With the three novel modules, the PSGAN is able to satisfy the requirements we pose for an ideal customizable makeup transfer method.
We make the following contributions:
We propose PSGAN, the first GAN-based method that can simultaneously realize part-by-part transfer, shade controllable transfer, and pose-robust transfer.
A novel AMM module is proposed which morphs the makeup matrices distilled from the reference image to the source image. The module enables PSGAN to transfer makeup styles between images that have variant poses.
Extensive qualitative and quantitative results have demonstrated the effectiveness of PSGAN.
Makeup transfer has been studied a lot these years [23, 10, 14, 18, 17, 1]. Recent advances in makeup transfer are mostly based on Generative Adversarial Networks (GANs) for their ability to generating realistic images.  first proposed a GAN framework with dual input and output for makeup transfer and removal simultaneously. They also introduced a makeup loss that matches the color histogram in different parts of faces for instance-level makeup transfer.  proposed a similar idea on Glow framework and decomposed makeup component and non-makeup component.  employed an additional discriminator to guide makeup transfer using pseudo transferred images generated by warping the reference face to the source face. Among these methods,  achieved transferring makeup with different lightness. Compared with them, our PSGAN can not only adjust the lightness but also transfer the chosen partial makeup during testing. Besides, PSGAN is able to transfer makeup between different poses.
Style transfer has been investigated extensively [7, 6, 13, 19, 22].  proposed to derive image representations from CNN, which can be separated and recombined to synthesize images. Besides, some methods are developed to solve the fast style transfer problem.  found the vital role of normalization in style transfer networks and achieved fast style transfer by the conditional instance normalization. However, their methods can only transfer a fixed set of styles and cannot adapt to arbitrary new styles. After that, 
proposed adaptive instance normalization (AdaIN) that aligns the mean and variance of the content features with those of the style features and achieved arbitrary style transfer. Makeup transfer can be regarded as a particular type of style transfer. We propose spatial-aware makeup transfer for each pixel rather than transferring a general style from the reference.
proposed the attention mechanism in natural language processing by leveraging a self-attention module to compute the response at a position in a sequence (e.g., a sentence) by attending to all positions and taking their weighted average in an embedding space. proposed the non-local network, which is to compute the response at a position as a weighted sum of the features at all positions. Inspired by these works, we explore the application of attention module by calculating the attention between two feature maps. Unlike the non-local network that only considers visual appearance similarities, our proposed AMM module computes the weighted sum of another feature map by considering both visual appearances and locations.
Let and be the source image domain and reference image domain. Also, we utilize and to represent the examples of two domains respectively. Note that paired dataset is not required. That is, the source and reference images have different identities. We assume is sampled from according to the distribution and is sampled from according to the distribution , where and are the height and width of images. Our proposed Pose-robust Spatial-aware GAN (PSGAN) learns a transfer function , where the transferred image has the makeup style of the reference image and preserves the identity of the source image .
Overall. The framework of PSGAN is shown in Figure 1. Mathematically, it is formulated as . It can be divided into three parts. 1) Makeup distillation network. MDNet extracts the makeup style from the reference image and represents it as two makeup matrices and , which are with the same height and width as the feature map instead of scalars. 2) Attentive makeup morphing module. Since source images and reference images may have large discrepancies in expressions and poses, the extracted makeup matrices cannot be directly applied to the source image . We then propose an AMM module to morph the two makeup matrices to two new matrices and which are adaptive to the source image by considering the similarities between pixels of the source and reference. 3) De-makeup & Re-makeup network. The original makeup (if any) of the source image is first removed. Then the adaptive makeup matrices and are applied to the bottleneck of the DRNet to perform makeup transfer with pixel-level guidance.
Makeup distillation network. The MDNet utilizes the encoder-bottleneck architecture used in [4, 15] without the decoder part. It disentangles the makeup related features, e.g., lip gloss, eye shadows, from the intrinsic facial features, e.g., facial shape, eye size. The makeup related features are represented as two makeup matrices and , which are used to transfer the makeup by pixel-level operations. As shown in Figure 1(B), the feature maps of the reference image are fed into two convolution layers to produce and , where , and are the height, width and channel of the feature map.
Attentive makeup morphing module. Since source and reference images may have different poses and expressions, the obtained spatial-aware and cannot be applied directly to the source image. The proposed AMM module calculates an attentive matrix to specify how a pixel in the source image is morphed from the pixels in the reference image , where indicates the attentive value between the -th pixel in image and the -th pixel in image .
We argue that the makeup should be transferred between the pixels with similar relative positions on the face, and the attentive values between these pixels should be high. For example, the lip gloss region of the transferred result should be sampled from the corresponding lip gloss region of the reference image . To describe the relative positions, each pixel is represented by a relative position vector . More specifically, the relative position of pixel is represented by , which is reflected in the differences of coordinates between pixel and facial landmarks, calculated by
where and indicate the coordinates on and axes, indicates the -th facial landmark obtained by the 2D facial landmark detector , which serves as the anchor point when calculating .
Moreover, to avoid unreasonable sampling pixels with similar relative positions but different semantics, we also consider the visual similarities between pixels (e.g., and ), which are denoted as and that extracted from the third bottleneck of DRNet and MDNet respectively. As Figure 1(B) shows, the attentive matrix is computed by considering the similarity of both visual features and relative positions via
denotes the concatenation operation between tensors,and indicate the visual features and relative position vectors, is an indicator function whose value is if the inside formula is true. are the face parsing map of source image and reference image , where stands for the number of facial regions (N is in our experiments including eyes, lip and skin), and indicate the facial regions that and belong to. Note when calculating , we only consider the pixels belonging to same facial region, i.e., , by applying indicator function .
Given a specific red point in the lower-left corner of the nose of source image, the middle image of Figure 1(C) shows its attentive values by reshaping a specific row of the attentive matrix to HW. We can see that only the pixels around the left corner of the nose have large values. After applying softmax, attentive values become more gathered. This verifies that our proposed AMM module is able to locate semantically similar pixels to attend.
We multiply attentive matrix by the and , and get the morphed makeup matrices and , which are then used to scale and shift the feature map of source image for only once. More specifically, the matrices and are computed by
where and are the pixel index of and . After that, the matrix and are duplicated and expanded along the channel dimension to produce the makeup tensors and , which will be the input of DRNet.
(A), the encoder part of DRNet shares the same architecture with MDNet, but they do not share parameters since their functions are different. In the encoder part, we use instance normalizations that have no affine parameters and make the feature map to be a normal distribution, which can be seen as the process of de-makeup.
In the bottleneck part, the morphed makeup tensors and obtained by the AMM module are applied to the source image feature map . The activation value of the transferred feature map is calculated by
Eq. (4) gives the function of makeup transfer. The updated feature map is then fed to the subsequent decoder part of DRNet.
Adversarial loss. We utilize two discriminators and for the source images domain and the reference images domain , which try to discriminate between generated images and real images and thus help the generators synthesize realistic outputs. Therefore, the adversarial loss , for discriminator and generator are computed by
Cycle consistency loss. Due to lack of triplets data (source image, reference image, and transferred image), we train the network in an unsupervised way. Here, we introduced the cycle consistency loss proposed by . We use L1 loss to constrain the reconstructed images and define the cycle consistency loss as
When transferring the makeup style, the transferred image is required to preserve personal identity. Instead of directly measuring differences at pixel-level, we utilize a VGG-16 model pre-trained on ImageNet to compare the activations of source images and generated images in the hidden layer. Letdenote the output of the -th layer of VGG-16 model. We introduce the perceptual loss to measure their differences using L2 loss:
Makeup loss. To provide coarse guidance for makeup transfer, we utilize the makeup loss proposed by . Specifically, we perform histogram matching on the same facial regions of and separately and then recombine the results, denoted as . As a kind of pseudo ground truth, preserves the identity of and shares a similar color distribution with . Then we calculate the makeup loss as coarse guidance by
Total loss. The loss and for discriminator and generator of our approach can be expressed as
where , , , are the weights to balance the multiple objectives.
Experimental Setting and Details
We train and test our network using the MT (Makeup Transfer) dataset [15, 3]. It contains source images and reference images. These images are mostly well-aligned, with the resolution of and the corresponding face parsing results. We follow the splitting strategy of  for frontal face experiments since most examples in the test set are frontal face images. To further prove the effectiveness of PSGAN for handling pose and expression differences, we select images with various pose and expressions from the MT dataset to construct a MT-wild test set. Additionally, we train the model using the training set that excludes MT-wild test set.
Attentive makeup morphing module. In PSGAN, AMM module morphs the distilled makeup matrices and to , . It alleviates the pose and expression differences between source and reference images. The effectiveness of the AMM module is shown in Figure 2. In the first row, the pose of source and reference images are very different. The bangs of the reference image are transferred to the skin of the source image without AMM. By applying AMM, the pose misalignment is well solved. A similar observation can be found in the second row: the expressions of source and reference images are smiling and neutral respectively, while the lip gloss is applied to the teeth region without the AMM module shown in the third column. After integrating AMM, lip gloss is applied to the lip region, bypassing the teeth area. The experiments demonstrate that the AMM module can specify how a pixel in the source image is morphed from pixels of the reference instead of mapping the makeup from the same location directly.
demonstrates that only considering the relative positions is not enough for makeup matrices morphing. If only relative position is considered, the attentive map in the middle is similar to a 2D Gaussian distribution. In the first row of Figure3, the red point on the skin of the source may receive makeup from nostrils area in the reference image. After considering the appearance feature, the attentive maps bypass the nostrils and focus more on the skin. In the second row, the attentive map crosses the face boundary and covers the earrings. By considering the appearance similarity, the attentive map concentrates on the skin area, avoiding interference of background. Besides, adding appearance features can also make our PSGAN robust to the face alignment error.
Partial and Interpolated Makeup
Since the makeup matrices and are spatial-aware, the partial and interpolated transfer can be realized during testing. To obtain partial makeup generation, we compute the new makeup matrices by weighting the matrices using the face parsing results. Let , denote a source image and a reference image. We can obtain , and , by MDNet. In addition, we can obtain the face parsing mask of
through the existing deep learning method. Suppose we only want to perform makeup transfer in specific regions of , e.g., the lip, which can be denoted as a binary mask . PSGAN can realize partial makeup transfer by assigning different makeup parameters on different pixels.
By modifying Eq. (4), the part-by-part transferred feature map can be calculated by
Figure 5 shows the results by mixing the makeup styles from two references partially. The results on the third column recombine the makeup of lip from reference 1 and the other part of makeup from reference 2, which are natural and realistic. The new feature of partial makeup makes PSGAN realize the customizable makeup transfer.
Moreover, we can interpolate the makeup with two reference images by the coefficient . We first get the makeup tensors of two references and , and then compute the new parameters by weighting them with a real number . When , we can adjust the shade of transfer from one reference image. The resulted feature map is calculated by
Figure 4 shows the interpolated makeup transfer results with one and two reference images. By feeding the new makeup tensors into DRNet, we yield a smooth transition between two reference makeup styles. The generated results demonstrate that our PSGAN can not only control the shade of makeup transfer but also generate a new style of makeup by mixing the makeup tensors of two makeup styles. The above experiments also demonstrate that PSGAN broadens the application range of makeup transfer significantly.
We conduct a user study for quantitative evaluation on Amazon Mechanical Turk (AMT). We use BGAN (BeautyGAN) , CGAN (CycleGAN) , DIA  as baselines. For a fair comparison, we compare with the methods whose code or pre-train model are released. We randomly select 20 source images and 20 reference images from both the MT test set and MT-wild test set. After using the above methods to perform makeup transfer between these images, we obtain 800 images for each method. Then, 5 different workers are given unlimited time to choose the best images generated by four methods through considering image realism and the similarities with reference makeup styles. Table 1 shows the human evaluation results on both MT and MT-wild test set. PSGAN outperforms other methods by a large margin, especially on the MT-wild test set.
Figure 6 shows the qualitative comparison of PSGAN with other state-of-the-art methods on frontal faces with neutral expressions. The result produced by DIA has an unnatural color on hair and background since it performs makeup transfer in the whole image. Comparatively, the result of Cycle-GAN is more realistic than that of DIA, but Cycle-GAN can only synthesize general makeup which is not similar to the reference. The results of BeautyGAN and BeautyGlow outperform the previous methods. However, the result of BeautyGlow fails to preserve the color of pupils and does not have the same foundation makeup as reference. Compared to the baselines, our method is able to generate vivid images with the same makeup styles as the reference.
We also conduct comparison on MY-wild test set with the state-of-the-art method that provide code, as shown in Figure 7. Since the current methods perform makeup transfer by calculating the loss between the whole facial regions and lack an explicit mechanism to guide the direction of make transfer at pixel-level, the makeup is applied on the wrong region of the face when dealing with images with different poses. For example, the bangs are transferred to the skin on the first row of Figure 7, and the makeup of eye deviates to the right on the second row. However, our proposed AMM module can accurately assign the makeup for every pixel through calculating similarities, which makes our results clean and look better.
In order to perform customizable makeup transfer, we propose the Pose-Robust Spatial-Aware GAN (PSGAN) that first distills the make style into two makeup matrices from the reference and then leverages an Attentive Makeup Morphing (AMM) module to conduct makeup transfer accurately. The experiments demonstrate our approach can achieve state-of-the-art transfer results on both frontal face images and in the wild face images that have various poses and expressions. Also, with the spatial-aware makeup matrices, PSGAN can transfer the makeup partially and adjust the shade of transfer, which greatly broadens the application range of makeup transfer. Moreover, we plan to apply our framework to other conditional image synthesis problems that require customizable and precise synthesis.
Examples-rules guided deep neural network for makeup recommendation. In AAAI, Cited by: Makeup Transfer.
-  (2018) PairedCycleGAN: asymmetric style transfer for applying and removing makeup. In CVPR, Cited by: Introduction, Makeup Transfer.
-  (2019) BeautyGlow : on-demand makeup transfer framework with reversible generative network. In CVPR, Cited by: Introduction, Introduction, Makeup Transfer, Experimental Setting and Details.
StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, Cited by: Framework, Framework.
-  (2016) A learned representation for artistic style. ArXiv abs/1610.07629. Cited by: Style Transfer.
-  (2016) Preserving color in neural artistic style transfer. ArXiv abs/1606.05897. Cited by: Style Transfer.
-  (2015) A neural algorithm of artistic style. ArXiv abs/1508.06576. Cited by: Style Transfer.
Image style transfer using convolutional neural networks. In CVPR, Cited by: Style Transfer.
-  (2014) Generative adversarial nets. In NeurIPS, Cited by: Introduction.
-  (2009) Digital face makeup by example. In CVPR, Cited by: Makeup Transfer.
-  (2017) Squeeze-and-excitation networks. In CVPR, Cited by: Attention Mechanism.
-  (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, Cited by: Style Transfer.
Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: Style Transfer.
-  (2015) Simulating makeup through physics-based manipulation of intrinsic image layers. In CVPR, Cited by: Makeup Transfer.
-  (2018) BeautyGAN: instance-level facial makeup transfer with deep generative adversarial network. In ACM MM, Cited by: Introduction, Makeup Transfer, Framework, Framework, Objective Function, Experimental Setting and Details, Quantitative Comparison.
-  (2017) Visual attribute transfer through deep image analogy. ACM TOG. Cited by: Quantitative Comparison.
-  (2013) ”Wow! you are so beautiful today!”. In ACM MM, Cited by: Makeup Transfer.
-  (2016) Makeup like a superstar: deep localized makeup transfer network. In IJCAI, Cited by: Makeup Transfer.
-  (2017) Deep photo style transfer. In CVPR, Cited by: Style Transfer.
-  (2014) Recurrent models of visual attention. In NeurIPS, Cited by: Attention Mechanism.
A neural attention model for abstractive sentence summarization. In EMNLP, Cited by: Attention Mechanism.
-  (2016) Unsupervised cross-domain image generation. ICLR. Cited by: Style Transfer.
-  (2007) Example-based cosmetic transfer. In PG, Cited by: Makeup Transfer.
-  (2017) Attention is all you need. In NeurIPS, Cited by: Attention Mechanism.
-  (2017) Non-local neural networks. In CVPR, Cited by: Attention Mechanism.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, Cited by: Attention Mechanism.
Joint face detection and alignment using multitask cascaded convolutional networks. Signal Processing Letters. Cited by: Framework.
-  (2016) Pyramid scene parsing network. In CVPR, Cited by: Partial and Interpolated Makeup.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: Introduction, Objective Function, Quantitative Comparison.