We propose to incorporate local adversarial discriminators into an image domain translation network for details transfer between two images, and apply these local adversarial discriminators on overlapping image regions to achieve image-based facial makeup and removal. By encouraging cross-cycle consistency between input and output, we can disentangle the makeup latent variable from other factors on a single facial image. Through increasing the number and overlapping local discriminators, complex makeup styles with high-frequency details can be seamlessly transferred or removed while facial identity and structure are both preserved. See Figure 1.
The contributions of our paper are:
By utilizing local adversarial discriminators rather than cropping the image into different local paths, our network can seamlessly transfer and remove dramatic makeup styles;
Through incorporating asymmetric loss functions on makeup transfer and removal branches, the network is forced to disentangle the makeup latent variable from others, and thus our network can generate photo-realistic results where facial identity is mostly preserved;
A dataset containing unpaired before-makeup and after-makeup facial images will be released for non-commercial purpose upon the paper’s acceptance.
Our target application, digital facial makeup [taaz, meitu], has been increasingly popular. Its inverse application, known as facial de-makeup [Wang_2016_facebehind, makeupgo]
is also starting to gain more attention. All current results in deep learning only either work for or demonstrate conventional or relatively simple makeup styles, possibly due to limitations of their network architectures and overfitting to their datasets. Existing methods often fail to transfer/remove dramatic makeup, which oftentimes is the main usage for such an application, before the user physically applies the dramatic makeup which may take hours to accomplish.
Given an image of a clean face without makeup as the source, and another image of an after-makeup face as the reference, the makeup transfer problem is to synthesize a new image where the specific makeup style from the reference is applied on the face of the source (Figure 1). The main problem stems from the difficulty of extracting the makeup-only latent variable, which is required to be disentangled from other factors in a given facial image. This problem is often referred to as content-style separation. Most existing works addressed this problem through region-specific style transfer and rendering [Liu_2016_IJCAI, Chang_2018_CVPR, Liu_2013_ACMMM, Nguyen_2017_SmartM, Alashkar_2017_AAAI]. This approach can precisely extract the makeup style in specific and well-defined facial regions such as eyes and mouth where makeup is normally applied, but it limits the application range in the vicinity of these facial regions, and thus fails to transfer/remove more dramatic makeup where color and texture details can be far away from these facial features.
By incorporating multiple and overlapping local discriminators in a content-style disentangling network, we successfully perform transfer (resp. removal) of complex/dramatic makeup styles with all details faithfully transferred (resp. removed).
2 Related Work
Given the bulk of deep learning work on photographic image synthesis, we will review related work in image translation and style transfer, and those on makeup transfer. We will also review approaches that involve global and local discriminators and describe the differences between ours and theirs.
|Global discriminator||GAN [gan]
|Single local||Image completion[IizukaSIGGRAPH2017, GFC_CVPR_2017],|
|discriminator||PatchGAN [Li2016PrecomputedRT], CycleGAN[cyclegan]|
|Multiple overlapping||LADN (ours)|
Style transfer and image domain translation. Style transfer can be formulated as an image domain translation problem, which was first formulated by Taigman et al. [taigman_2016_unsupervised] as learning a generative function to map a sample image from a source domain to a target domain. Isola et al. [pix2pix2016] proposed the pix2pix framework which adopted a conditional GAN to model the generative function. This method, however, requires cross-domain, paired image data for training. Zhu et al. [cyclegan] introduced the CycleGAN to relax this paired data requirement, by incorporating a cycle consistency loss into the generative network to generate images that satisfy the distribution of desired domain. Lee et al. [DRIT] recently proposed a disentangled representation framework, DRIT, to diversify the outputs with unpaired training data by adding a reference image from the target domain as input. They encode images into a domain-invariant content space and another domain-specific attribute space. By disentangling content and attribute, the generated output adopts the content of an image in another domain while preserving the attributes of its own domain. However, in the context of makeup/de-makeup transfer, DRIT can only be applied when the relevant makeup style transfer can be formulated into image-to-image translation. As our experiments show, this means that only light makeup styles can be handled.
Makeup transfer and removal. Tong et al. [Tong_2007_CVPR] first tackled this problem by solving the mapping of cosmetic contributions of color and subtle surface geometry. However, their method requires the input to be in pairs of well-aligned before-makeup and after-makeup images and thus the practicability is limited. Guo et al. [Guo_2009_CVPR] proposed to decompose the source and reference images into face structure, skin detail, and color layers and then transfer information on each layer correspondingly. Li et al. [Li_2015_CVPR] decomposed the image into intrinsic image layers, and used physically-based reflectance models to manipulate each layer to achieve makeup transfer. Recently, a number of makeup recommendation and synthesis systems have been developed [Liu_2013_ACMMM, Nguyen_2017_SmartM, Alashkar_2017_AAAI]
, but their contribution is on makeup recommendation and the capability of makeup transfer is limited. As recently the style transfer problem has been successfully formulated as maximizing feature similarities in deep neural networks, Liuet al. [Liu_2016_IJCAI] proposed to transfer makeup style by locally applying the style transfer technique on facial components.
In addition to makeup transfer, the problem of digitally removing makeup from portraits has also gained some attention from researchers [Wang_2016_facebehind, makeupgo]. But all of them treat makeup transfer and removal as separate problems. Chang et al. [Chang_2018_CVPR] formulated the makeup transfer and removal problem as an unsupervised image domain transfer problem. They augmented the CycleGAN with a makeup reference, so that the specific makeup style of the reference image can be transferred to the non-makeup face to generate photo-realistic results. However, since they crop out the regions of eyes and mouth and train them separately as local paths, more emphasis is given to these regions. Therefore, the makeup style on other regions (such as nose, cheeks, forehead or the overall skin tone/foundation) cannot be handled properly. Very recently, Li et al. [Li_2018_beautygan] also tackled the makeup transfer and removal problem together by incorporating “makeup loss” into the CycleGAN. Although their network structure is somewhat similar, we are the first to achieve disentanglement of makeup latent and transfer and removal on extreme and dramatic makeup styles.
Global and local discriminators. Since Goodfellow et al. [gan] proposed the generative adversarial networks (GANs), many related works have employed discriminators in a global setting. In the domain translation problem, while a global discriminator can distinguish images from different domains, it can only capture global structures for a generator to learn. Local (patch) discriminators can compensate this by assuming independence between pixels separated by a patch diameter and modeling images as Markov random fields. Li et al. [Li2016PrecomputedRT] first utilized the discriminator loss for different local patches to train a generative neural network. Such a “PatchGAN” structure was also used in [pix2pix2016], where a local discriminator was incorporated with an L1 loss to encourage the generator to capture local high-frequency details. In image completion [IizukaSIGGRAPH2017, GFC_CVPR_2017], a global discriminator was used to maintain global consistency of image structures, while a local discriminator was used to ensure consistency of the generated patches in the completed region with the image context. Azadi et al. [azadi2017multi] similarly incorporated local discriminator together with a global discriminator on the font style transfer problem.
Contrary to all previous works where only a single local discriminator is used and local patches are sampled, we incorporate multiple style discriminators specialized for different facial patches defined by facial landmarks. Therefore, our discriminators can distinguish whether the generated facial makeup style is consistent with the makeup reference, and force the generator to learn to transfer the specific makeup style from the reference.