Texture Deformation Based Generative Adversarial Networks for Face Editing

by   WenTing Chen, et al.
Shenzhen University

Despite the significant success in image-to-image translation and latent representation based facial attribute editing and expression synthesis, the existing approaches still have limitations in the sharpness of details, distinct image translation and identity preservation. To address these issues, we propose a Texture Deformation Based GAN, namely TDB-GAN, to disentangle texture from original image and transfers domains based on the extracted texture. The approach utilizes the texture to transfer facial attributes and expressions without the consideration of the object pose. This leads to shaper details and more distinct visual effect of the synthesized faces. In addition, it brings the faster convergence during training. The effectiveness of the proposed method is validated through extensive ablation studies. We also evaluate our approach qualitatively and quantitatively on facial attribute and facial expression synthesis. The results on both the CelebA and RaFD datasets suggest that Texture Deformation Based GAN achieves better performance.


page 2

page 4

page 6

page 8

page 9


StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

Recent studies have shown remarkable success in image-to-image translati...

Self-supervised Deformation Modeling for Facial Expression Editing

Recent advances in deep generative models have demonstrated impressive r...

LEED: Label-Free Expression Editing via Disentanglement

Recent studies on facial expression editing have obtained very promising...

US-GAN: On the importance of Ultimate Skip Connection for Facial Expression Synthesis

Recent studies have shown impressive results in multi-domain image-to-im...

Continuously Controllable Facial Expression Editing in Talking Face Videos

Recently audio-driven talking face video generation has attracted consid...

Sparsely Grouped Multi-task Generative Adversarial Networks for Facial Attribute Manipulation

Recently, Image-to-Image Translation (IIT) has made great progress in en...

Smart Mirror: Intelligent Makeup Recommendation and Synthesis

The female facial image beautification usually requires professional edi...

1 Introduction

Face editing aims to change or enhance facial attributes (e.g. hair color, expression, gender and age), and add virtual makeup to human faces. In recent years, face editing has attracted great interests in computer vision fields [1, 22, 16]. Several image-to-image translation methods [8, 27, 24] have achieved facial attributes and expressions manipulation on single or multiple domains. Most methods are based on the generative adversarial networks (GANs) [3] like Cycle GAN [27], IcGAN [14], StarGAN [2], etc. However, the generators of most image-to-image translation approaches are fed with the input image directly. While altering the facial expressions, synthesized expressions are not genuine enough since they only modify the face moderately.

Moreover, the task of face editing can also be tackled with the Encoder-Decoder architecture through decoding the latent representation from encoder conditioned on target attributes. This kind of architecture aims to figure out the relationship between the facial attributes and the latent representation and impose the latent representation to face editing [26, 21, 17, 13]. Commonly, most Encoder-Decoder architectures encode the input image to a low-dimension latent representation which may lead to the loss of information and representation capability. Besides, most approaches fail to preserve the identity during face editing.

To solve the issues raised from both image-to-image translation and latent representation approaches, we first adopt the DAE model [15] to transfer the input image to three different physical image signals, including shading, albedo and deformation with an Encoder-Decoder architecture, and then combine shading and albedo to generate a pure and well-aligned texture image that presents the illumination effects and the characteristic appearance of the face. Then, we feed both the generated texture and target domain labels into a GAN model to synthesize a new texture image with target attributes. Finally, we warp the generated texture with the spatial deformation to generate the ultimate result, we also employ an identity loss between the generated image and input image to preserve identity. Overall, our main contributes are summarized as follows:

1. We propose the Texture Deformation Based GAN, a novel framework that learns the mappings among multiple domains based on disentangled texture and warps the generated texture spatially to generate the face image with target domain features.

2. We empirically demonstrate the effectiveness of our TDB-GAN through the ablation studies on facial attribute editing and expression synthesis. We validate the superiority of texture-to-image translation over the image-to-image translation. We also prove the effectiveness of identity loss through the face verification.

3. The proposed TDB-GAN is evaluated on facial attributes and expression synthesis both qualitatively and quantitatively. The results suggest TDB-GAN outperforms the existing methods.

2 Related works

The popularity of generative models has a great effect on face editing. The Encoder-Decoder architecture and Generative Adversarial Network (GAN) [3] are the two major categories of methods for this task.

Intrinsic Deforming Autoencoder (DAE)

[15] is a novel generative model which decomposes the input image into texture and deformation. DAE follows the deformable template paradigm and models image generation through texture synthesis and spatial deformation. DAE can obtain the prototypical object by removing the deformation. Discarding variability due to deformations, the texture encoded from the original image is a purer representation. Moreover, by modeling the face image in terms of a low-dimensional latent code, we can more easily control the facial attributes and expression over the generative process.

Generative Adversarial Networks (GANs) [3] is a promising generative model and can be used to solve various computer vision tasks such as image generation [6, 23, 20], image translation [8, 27, 24], and face image editing [22, 2, 13]. The GAN model is mainly designed to learn a generator G to generate fake samples and a discriminator D to distinguish between real and fake samples. Besides leveraging the typical adversarial loss, a reconstruction loss is often employed [2, 4] to generate the faces as realistic as possible. Additionally, an identity loss is proposed to assure that the generated faces preserve the original identity in our approach.

Pix2Pix [8] is a typical image-to-image translation based method. The approach can learn the mapping between input and output domains and has achieved impressive results in several image translation tasks [27, 24, 11]. Pix2Pix combines adversarial loss with L1 loss to transfer images in a paired way. For unpaired images, several frameworks like MUNIT [7], CycleGAN [27], and Invertible Conditional GAN [14]

have been proposed. However, all the frameworks try to learn the joint distribution between two domains, which limits them to handle multiple domains at the same time.

StarGAN [2]

is the first generative model to achieve multi-domain image-to-image translation across different datasets with only one single generator. It also consists of two modules, a discriminator D to distinguish between real and fake images and classify the real images to its corresponding domain, and a generator G generates a fake image using both the image and target domain label (binary or one-hot vector). One of the novelties in StarGAN is that its generator G is allowed to reconstruct the original image from the fake image given the original domain label. StarGAN also utilizes a mask vector with the domain label to enable joint training between domains of different datasets. However, StarGAN is an image-to-image model and does not involve any latent representation, so its capability of changing facial attributes is limited.

AttGAN [4] is a multiple facial attribute editing model that contains three components at training: the attribute classification constraint, the reconstruction learning and the adversarial learning. The content that latent representation deliveries is uncertain and limited. Hence, imposing the attribute label to the latent representation might change other unexpected parts. Similar to StarGAN, AttGAN applies an attributes classification constraint to guarantee the correct attribute manipulation on the generated image and a reconstruction learning to preserve the attribute-excluding details. AttGAN tries to free the attribute-independent constraint from the latent representation, while our approach encodes the input to different latent representation to generate texture and employ an image-to-image translation to achieve face editing.

Figure 2: Overview of Texture Deformation Based GAN.

3 Texture Deformation Based GAN

In this section, we introduce the Texture Deformation Based GAN (TDB-GAN) framework for face attributes editing. As shown in Figure 2, the TDB-GAN consists of two major modules, i.e. the intrinsic deforming autoencoder DAE and GAN based image-to-image translation module.

3.1 Intrinsic Deforming Autoencoder

The recent works [2] aim at translating an original face image to a new face image with different attributes. However, the pose and shape of face might have influence on facial attributes and facial expression synthesis. Thus, we utilize the Intrinsic DAE [15] to separate a face image into texture and deformation to disentangle the variation. DAE adopts the intrinsic decomposition regularization loss to model the physical properties of shading and albedo. The shading and the albedo are then combined to generate the texture that eliminates the geometric information and can represent the identity, illumination, face attributes and so on, whereas the deformation describes the spatial gradient of the warping field (spatial transformation).

3.1.1 The architecture of encoder

In this module, we feed the encoder , a densely connected convolutional network, with an input image . Then, it generates a latent representation Z for the following decoders. Particularly, the latent representation can be decomposed as follows: Z=[,, ], where , and are shading-related, albedo-related and deformation-related representations, respectively.


3.1.2 Decomposition of shading, albedo and deformation

As visualized in Figure 2, we introduce three separate decoders for shading, albedo and deformation, including , and . The inputs to these decoders are delivered by a joint encoder network. The shading-related, albedo-related and deformation-related decoders are fed with the latent representations , and respectively. The decoders can provide us with a clear separation of shading, albedo and deformation. The equation can be written as follows:


where , and denote shading, albedo and deformation. Then the texture of the input image can be computed by the shading and albedo with the Hadamard product:


Finally, the generated texture is warped spatially with the deformation to synthesize the ultimate image . W denotes the operation of spatial warping.


3.1.3 The objective function

The objective function is composed of three losses, including , and . It can be written as:


where the reconstruction loss is defined as:


the smoothness cost is given by:


the bias reduce loss is formatted as:


and the shading loss is written as:


In the equations above, and represent the input images and reconstructed images respectively. and stands for local warping field. and denote the identity and average affine transform within the minibatch. and represent identity grid and average deformation grid within a minibatch.

3.2 Multi-Domain Texture-to-Image Translation

Similar to StarGAN, our goal is to train a multi-domain texture-to-image translation network. We first feed the generator with texture and target domain labels randomly sampled from training data. Then, we warp the generated texture with deformation to synthesize the fake face image. We also impose the domain classification loss to classify the domain of the fake face image. Furthermore, we use the reconstruction loss and identity loss to supervise the generator to synthesize more realistic and identity-preserved face images, respectively.

3.2.1 Adversarial loss

We utilize the adversarial loss to enable the generated images as genuine as the real samples. The adversarial loss can be written as:


In this loss function,

generates a new texture conditioned on both the face texture t and target domain label , while strives to differentiate the real face texture from the generated face texture. In Eq.,

denotes a probability distribution over sources given by

. The discriminator tries to maximize this objective, whereas the generator tries to minimize it.

3.2.2 Domain classification loss

To enable the generator to generate the fake image with the target domain, we add a domain classifier on the top of . For the optimization of and , we define the domain classification of the real image as follow:


where stands for the original domain label for the real face image. The term represents a probability distribution over domain labels produced by . In addition, the domain classification loss of the fake face texture is defined as


3.2.3 Reconstruction loss

By optimizing the adversarial and classification loss, is able to generate the realistic face texture with proper attributes. Nonetheless, we cannot guarantee that the generated face texture preserves the content of the input face texture while changing the domain-related parts of the input face texture. Therefore, the reconstruction loss is imposed to the reconstructed texture and image, respectively. For the texture image, we apply a cycle consistency loss proposed by Zhu et al. [27] to our generator, which is defined as:


where takes the generated face texture and the original domain label as input and tries to reconstruct the original face texture. We utilize the L1 norm to compute our reconstruction loss.

For the reconstructed image, the generator synthesizes the new texture with the original texture t and domain label . Then, the new texture is warped with deformation to generate the output image. L1 norm of the difference between the input and the generated image is defined as below:


3.2.4 Identity loss

Even though reconstruction loss can preserve some unrelated content of the input face texture, the generator might still change the identity of the output face texture. The generator would not only learn the attribute relative parts but also learns the identity corresponding to the person with label c from the training set. For example, majority of celebrities of face images in CelebA [12] come from Europe or America, and only few are from Asia. Therefore, when learning the attributes from European or American, Asian might not preserve its own particular facial features.

Therefore, we exploit an identity preserving network to retain the identity discrimination of the synthesized face texture, and an identity loss to preserve personal facial features. This approach is derived from the work proposed by Huang [5]. denotes a feature extractor to extract the feature of the synthesized face texture and the real face texture . We select the LightCNN [19] as our feature extractor and fix the parameters in the training procedure. Specifically, we apply the output of the second to last fully connected layer of to the identity loss :


where denotes the L2-norm.

3.2.5 GAN-related objective function

Overall, the final objective functions to optimize and are illustrated as:


where , and are hyper-parameters to control the weight of domain classification, reconstruction and identity loss.

4 Implementation

In this section, we demonstrate how we stabilize the training process and the details of the network architecture.

4.1 Network Architecture

Since the proposed TDB-GAN consists of two major modules, we directly utilize the encoder and decoder architectures from DAE [15]. The generator and discriminator architectures are adopted from StarGAN [2] in our framework. We also leverage PatchGANs[8, 27] for the discriminator to distinguish the real images from synthesized images.

4.2 Training Strategy

In order to stabilize and accelerate the training procedure of TDB-GAN, we propose a multi-stage training strategy. In the first stage, we only optimize the DAE model, namely the , , and . Then, we fix the pretrained weights of DAE model. Simultaneously, the generator G and discriminator D are trained with the (with ) and loss, respectively. Finally, we jointly train , and . Note that, we impose the identity loss in the final training stage to ensure that the generated image preserves the identity.

5 Experiments

In this section, we first compare TDB-GAN with and without the DAE module on facial attribute transfer. In addition, we demonstrate empirical results that the result of TDB-GAN with identity loss can preserve more identity information than that without it.

5.1 Datasets

The CelebFaces Attributes (CelebA) dataset [12] contains 202,599 face images of 10,177 celebrities, each annotated with 40 binary attributes. We resize all aligned images from the into . We randomly select 2,000 images as test set and use the remaining images for training. We mainly test ten domains with the following attributes: expression (smiling/not smiling), skin color (pale skin/normal skin), accessory (eyeglasses/no eyeglasses), gender (male/female) and age (young/old).

The Radboud Faces Database (RaFD) [10] consists of 4,824 images collected from 67 subjects. Each subject has eight facial expressions in three different gaze directions, which are captured from three different angles. We first detect all face images with MTCNN [25] and crop out the images with size resolution, where the faces are centered, and resized to .

5.2 Training

All the models are optimized with Adam [9], where and . We flip the images horizontally with a probability of 0.5 to augment the training set. We perform one generator update after five discriminator updates as described in [2]

. The batch size is set to 100 for all experiments. For the experiments on the CelebA, we first train the DAE module for 5 epochs with a learning rate of 0.0002. Then, we train the generator and discriminator with a learning rate of 0.0001 for the first 100 epochs and linearly decay the learning rate to 0 over the next 100 epochs. Next, we impose the identity loss to the GAN module and train the GAN-related part for 29 epochs with a learning rate of 0.0001 and apply the aforementioned decaying strategy over the next 29 epochs. The train strategy of RaFD is similar to that of CelebA. The weight in training objective is set as

for , , for , for , for , for and for .

5.3 Ablation studies

A unique advantage of the TDB-GAN is its capability of disentangling the texture from the input image and editing the facial attributes and expression without impact of the pose and shape of face We conduct an experiment on TDB-GAN with/without DAE module. Additionally, we prove that the proposed identity loss helps to preserve more identity information through a verification result.

5.3.1 Results with/without DAE

In TDB-GAN, we prefer to separate the texture and deformation from the input image. We propose that the information of deformation would significantly affect the quality of face editing and the convergence of the domain classification loss of fake face texture during training.

Figure 3: Facial attribute transfer results on the CelebA dataset. The first column demonstrates the input image, next five columns show the single attribute transfer results. The odd rows display the results generated by the TDB-GAN without DAE module, while the even rows show the results produced with DAE.

As illustrated in Figure 3, the eyeglasses generated by TDB-GAN with DAE are more obvious. For example, no glasses can be observed for the faces in row C, E, G and I generated by TDB-GAN without DAE. The images generated by TDB-GAN without DAE (A, C) do not show the pale skin as realistic as those generated by TDB-GAN. While the faces of C and E generated by TDB-GAN without DAE are still smiling, TDB-GAN with DAE transfers the face image to smile or not smile correctly and naturally. Lastly, our proposed method has more genuine changing of feminization, masculinity, aging and rejuvenation than the TDB-GAN without DAE module. The main reason is that DAE disentangles texture and deformation from the input image. The former preserves the main feature and identity of the face, whereas the latter contains the information about the pose of head, the shape of face and so on. While we feed the generator with the well-aligned texture, face editing does not need to consider shape invariance. By contrast, the TDB-GAN without DAE transfers the attributes with the constraint of the shape invariance. It can be also observed from Figure 3 that TDB-GAN with DAE achieves a lower domain classification loss of fake face textures than the TDB-GAN without DAE. There is a clear margin between the curves in the chart. The lower domain classification loss of fake face textures indicates the better attributes transferring.

Figure 4: The domain classification loss of the fake face textures generated by the TDB-GAN with/without DAE module.

5.3.2 Results for identity loss

While transferring the domains, the network would strive to transfer more average features of the domain to decrease the domain classification loss of the fake face textures even though the domain classification loss of the real face textures plays against it. Thus, we propose the identity loss to ensure the identity invariance. We evaluate the performance of the domain transferring in terms of face recognition accuracy generated by the identity loss. In the following sections, we present the verification results about the TDB-GAN with and without identity loss.

In this experiment, we train our model on RaFD to synthesize facial expressions. There are eight different expressions on RaFD. We fix the input domain as the ‘neutral’ expression and set the target domain to the seven remaining expressions. Thus, the proposed task aims to impose a particular expression to a neutral face.

We randomly split the RaFD dataset into training and testing sets with a 90%:10% ratio, namely 4,320 training images and 504 testing images including 63 neutral faces. For each of the neutral face, we apply our network to generate seven facial expression images, i.e. in total 441 fake facial expression images were generated. Based on the 441 generated faces and 504 test images, we randomly generate 3,000 client accesses and 3,000 impostor accesses. The network proposed by Wen and Zhang [18] is employed to extract 512-dimension identity features from the face images. The cosine distance is adopted to measure the similarity of two faces. The similarity was compared with a threshold (e.g. 0.5) to decide whether they are from the same person, or not. In this work, TPR (True Positive Rate), FPR (False Positive Rate), EER (Equal Error Rate), AP (Average Precision) and AUC (Area under curve) are used to evaluate the performance of face verification. The higher scores of these metrics, except EER, the better results.

Figure 5 and Table 1 show the ROC curves and the verification results of the TDB-GAN with/without identity loss. From Table 1, while the TPR@FPR=1% for TDB-GAN without identity loss is 8.70, the identity loss significantly increases the TPR@FRP=1% to as high as 11.07. Identity loss almost doubles the TPR@FPR=0.1% and TPR@FPR=0% of the TDB-GAN. Table 1 also suggests that the TDB-GAN with identity loss achieves the lower EER and higher AP and AUC than the TDB-GAN without identity loss.

Figure 5: ROC curves on the test set of RaFD dataset.
with identity loss w/o identity loss
TPR@FPR=1% 11.07 8.70
TPR@FPR=0.1% 1.60 0.60
TPR@FPR=0% 0.23 0.13
EER (%) 23.60 24.50
AP (%) 81.89 80.29
AUC (%) 83.73 82.82
Table 1: Verification performance on RaFD dataset.
Figure 6: Facial expression synthesis results on CelebA dataset.

5.4 Qualitative and quantitative evaluation on CelebA

We first display qualitative results of facial attribute transfer on the CelebA dataset. Then, the quantitative results are evaluated with a user questionnaire.

5.4.1 Qualitative evaluation

Figure 6 shows the face images generated by IcGAN, CycleGaN, StarGAN and our TDB-GAN for attribute transfer in smiling, pale, eyeglasses, gender and age. As visualized in the Figure, the images generated by image-to-image translation approaches are better than that generated by IcGAN. Our approach contains more information than the low-dimension latent representation and also preserves the attribute-independent information, like hairstyle. The faces generated by TDB-GAN for gender and age transfer are better than that generated by StarGAN, and the eyeglasses added by TDB-GAN are more natural than that added by CycleGAN. Furthermore, our proposed method not only achieves higher visual quality but also preserves the identity related to the input image due to the effect of identity loss.

5.4.2 Quantitative evaluation

For quantitative evaluation, we perform a user study on the visual effect of transferred facial attributes to access IcGAN [14], CycleGAN [27], StarGAN [2] and TDB-GAN. Each of the four approaches were applied to transfer the five facial attributes, i.e. smile, pale skin, eyeglasses, gender and age, of faces from twenty individuals. For each of the five attributes transferred for the 20 subjects, four images synthesized by different models were shown to volunteers and they were asked to select the best one, in terms of the realism, preservation of identity and quality of the facial attribute synthesis. As a number of 15 volunteers participated the questionnaire, a maximum of votes can be received for each approach and attribute. Table 2 lists the ratio of votes received for each model and attribute. While StarGAN received the highest votes for pale skin transfer, our TDB-GAN received the highest votes for four of the five attributes, i.e. smile, eyeglasses, gender and age.

Smile 2.33% 21.33% 19.00% 57.33%
Pale skin 2.00% 37.00% 36.67% 24.33%
Eyeglasses 0 28.00% 30.33% 41.67%
Gender 1.33% 35.00% 9.67% 54.00%
Age 0.33% 20.00% 17.67% 62.00%
Table 2: The perceptual evaluation of different models. Note that, the sum of probability of each row is not strictly equal to 100% due to numerical precision loss.
Figure 7: Facial expression synthesis results on RaFD dataset.

5.5 Qualitative and quantitative evaluation on RaFD

In the following sections, we demonstrate the qualitative and quantitative evaluation results on the RaFD dataset.

5.5.1 Qualitative evaluation

Figure 7 shows an example of seven facial expressions synthesized by IcGAN [14], CycleGAN [27] and StarGAN [2] and our TDB-GAN. As shown in the Figure, the images generated by StarGAN and our TDB-GAN have better visual quality than that generated by IcGAN and CycleGAN. IcGAN transfers the neutral expression to various expressions, but the generated fake images have lowest quality. We believe that the latent vector extracted from IcGAN lacks effective representability. While the performance of CycleGAN is considerably better than that of IcGAN, the fake images generated by CycleGAN are still ambiguous. The fake faces synthesized by StarGAN have much more natural and more distinct expressions. Nonetheless, TDB-GAN is superior to StarGAN for the sharper details and the more distinguishable expressions. For example, the faces generated by our TDB-GAN for angry, fearful and surprised are much more representative than that of StarGAN, especially in the eye regions.

We believe the ability of separating the texture and deformation of TDB-GAN contributes most to the image quality, which allows TDB-GAN to pay more attention to the face expression editing, instead of the pose, shape and so on.

5.5.2 Quantitative evaluation

For a quantitative evaluation, we compute the classification error of facial expression recognition on the generated images.

We first train a facial expression classifier with the 4,320 training images. And then we train all the GAN models using the same training set.

For testing, we first use the trained GANs to transfer all the neutral expression of the testing images to seven different expressions. Then we use the aforementioned classifier to classify these synthesized expressions. Table 3 lists the accuracies of the facial expression classifier on the images synthesized by different GAN models.

As shown in Table 3, the images synthesized by TDB-GAN model achieves the highest accuracy, which suggests that it synthesizes the most realistic facial expressions compared with the other methods.

Models Accuracy (%)
IcGAN 91.61
CycleGAN 88.44
StarGAN 92.06
TDB-GAN 97.28
Table 3: The expression classification accuracies of images synthesized by different GAN models.

6 Conclusion

In this paper, we proposed Texture Deformation Based GAN to perform texture-to-image translation among multiple domains. The proposed TDB-GAN can generate images with higher quality and more relative identity compared to the existing methods, due to the disentangled texture and deformation, and the identity loss.


  • [1] Y.-C. Chen, H. Lin, M. Shu, R. Li, X. Tao, Y. Ye, X. Shen, and J. Jia. Facelet-bank for fast portrait manipulation. In

    Computer Vision and Pattern Recognition, 2018. CVPR 2018. IEEE Conference on

    , 2018.
  • [2] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [4] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen. Arbitrary facial attribute editing: Only change what you want. arXiv preprint arXiv:1711.10678, 2017.
  • [5] R. Huang, S. Zhang, T. Li, R. He, et al. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. ECCV, 2017.
  • [6] X. Huang, Y. Li, O. Poursaeed, J. E. Hopcroft, and S. J. Belongie. Stacked generative adversarial networks. In CVPR, volume 2, page 3, 2017.
  • [7] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. In ECCV, 2018.
  • [8] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    arXiv preprint, 2017.
  • [9] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [10] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and A. Van Knippenberg. Presentation and validation of the radboud faces database. Cognition and emotion, 24(8):1377–1388, 2010.
  • [11] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pages 700–708, 2017.
  • [12] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
  • [13] R. Natsume, T. Yatagawa, and S. Morishima. Rsgan: Face swapping and editing using face and hair representation in latent spaces. arXiv preprint arXiv:1804.03447, 2018.
  • [14] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez. Invertible Conditional GANs for image editing. In NIPS Workshop on Adversarial Training, 2016.
  • [15] Z. Shu, M. Sahasrabudhe, A. Guler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. arXiv preprint arXiv:1806.06503, 2018.
  • [16] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5444–5453. IEEE, 2017.
  • [17] R. Sun, C. Huang, J. Shi, and L. Ma. Mask-aware photorealistic face attribute manipulation. arXiv preprint arXiv:1804.08882, 2018.
  • [18] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
  • [19] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018.
  • [20] W. Xian, P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Texturegan: Controlling deep image synthesis with texture patches. arXiv preprint, 2017.
  • [21] T. Xiao, J. Hong, and J. Ma. Dna-gan: Learning disentangled representations from multi-attribute images. International Conference on Learning Representations, Workshop, 2018.
  • [22] T. Xiao, J. Hong, and J. Ma. Elegant: Exchanging latent encodings with gan for transferring multiple face attributes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 172–187, September 2018.
  • [23] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
  • [24] Z. Yi, H. R. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV, pages 2868–2876, 2017.
  • [25] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
  • [26] S. Zhou, T. Xiao, Y. Yang, D. Feng, Q. He, and W. He. Genegan: Learning object transfiguration and attribute subspace from unpaired data. In Proceedings of the British Machine Vision Conference (BMVC), 2017.
  • [27] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.