1 Introduction
Our digital age has witnessed a soaring demand for flexible, highquality portrait manipulation, not only from smartphone apps but also from photography industry, ecommerce promotion, movie production etc. Portrait manipulation has also been extensively studied [34, 5, 8, 18, 1, 33]
in the academia of computer vision and computer graphics. Previous methods dedicate to adding makeups
[23, 6], performing style transfer [9, 14, 24, 12], age progression [42] and expression manipulation [1, 39] to name a few. However, these methods are tailored to a specific task and cannot be transported to perform continuous and general multimodality portrait manipulation.Recently, generative adversarial networks have demonstrated compelling effects in synthesis and image translation [15, 38, 4, 35, 44, 13], among which [44, 40] proposed cycleconsistency for unpaired image translation. In this paper, we extend this idea into a conditional setting by leveraging additional facial landmarks information, which is capable of capturing intricate expression changes. Benefits that arise with this simple yet straightforward modifications include: First, cycle mapping can effectively prevent manytoone mapping [44, 45] also known as modecollapse. In the context of face/pose manipulation, cycleconsistency also induces identity preserving and bidirectional manipulation, whereas previous method [1] assumes neutral face to begin with or is unidirectional [26, 29], thus manipulating in the same domain. Second, face images of different textures or styles are considered different modalities and current landmark detector will not work on those stylized images. With our design, we can pair samples from multiple domains and translate between each pair of them, thus enabling landmark extraction indirectly on stylized portraits. Our framework can also be extended to makeups/demakeups, aging manipulation etc. once corresponding data is collected. Considering the lack of groundtruth data for many face manipulation tasks, we leverage the result of [14] to generate pseudotargets to learn simultaneous expression and modality manipulations, but it can be replaced with any desired target domains.
However, there remain two main challenges to achieve highquality portrait manipulation. We propose to learn a single generator as in [7]. But StarGAN [7] deals with discrete manipulation and fails on highresolution images with irremovable artifacts. To synthesize images of photorealistic quality (512x512), we propose multilevel adversarial supervision inspired by [37, 41] where synthesized images at different resolution are propagated and combined before being fed into multilevel discriminators. Second, to avoid texture inconsistency and artifacts during translation between different domains, we integrate Gram matrix [9] as a measure of texture distance into our model as it is differentiable and can be trained endtoend using back propagation. Fig. 1 shows the result of our model.
Extensive evaluations have shown both quantitatively and qualitatively that our method is comparable or superior to stateoftheart generative models in performing highquality portrait manipulation (See Section 4.2). Our model is bidirectional, which circumvents the need to start from a neutral face or a fixed domain. This feature also ensures stable training, identity preservation and is easily scalable to other desired domain manipulations. In the following section, we review related works to ours and point out the differences. Details of PortraitGAN are elaborated in Section 3. We evaluate our approach in Section 4 and conclude the paper in Section 5.
2 Related Work
Face editing
Face editing or manipulation has been widely studied in the field of computer vision and graphics, including face morphing [3], expression edits [32, 21], age progression [16], facial reenactment [2, 34, 1]. However, these models are designed for a particular task, thus rely heavily on domain knowledge and certain assumptions. For example, [1] assumes neutral and frontal faces to begin with while [34] assumes the availability of target videos with variation in both poses and expressions. Our model differs from them as it is a datadriven approach that does not require domain knowledge, designed to handle general face manipulations.
Image translation
Our work can be categorized into image translation with generative adversarial networks [13, 43, 4, 11, 22, 40, 37], whose goal is to learn a mapping that induces an indistinguishable distribution to target domain , through adversarial training between a pair of generator and discriminator . For example, Isola et al. [13]
takes image as a condition for general imagetoimage translation trained on paired samples. Later, Zhu et.al
[44] extends [13] by introducing cycleconsistency loss to obviate the need of matched training pairs. In addition, it alleviates manytoone mapping during training generative adversarial networks also known as mode collapse. Inspired by this, we integrate this loss into our model for identity preservation between different domains.Another seminal work that inspired our design is StarGAN [7]
, where target facial attributes are encoded into a onehot vector. In StarGAN, each attribute is treated as a different domain and an auxiliary classifier used to distinguish these attributes is essential for supervising the training process. Different from StarGAN, our goal is to perform continuous edits in the pixel space that cannot be enumerated with discrete labels. This implicitly implies a smooth and continuous latent space where each point in this space encodes meaningful axis of variation in the data. We treat different style modalities as domains in this paper and use two words interchangeably. In this sense, applications like beautification/debeautification, aging/younger, with beard/without beard can also be included into our general framework. We compare our approach against CycleGAN
[44] and StarGAN [7] in Section 4 and illustrate in more details about our design in Section 3.Pose image generation
We are aware of works that use pose as condition in the task of person reidentification for person image generation [36, 20, 31, 29]. For example [26] concatenates onehot pose feature maps in a channelwise fashion to control pose generation similar to [30], where keypoints and segmentation mask of birds are used to manipulate locations and poses of birds. To synthesize more plausible human poses, Siarohin et.al [31] develop deformable skip connections and compute a set of affine transformations to approximate joint deformations. These works share some similarity with ours as both facial landmark and human skeleton can be seen as a form of pose representation. However, all those works deal with manipulation in the original domain and does not preserve identity. Moreover, generated results in those works are lowresolution whereas our model can successfully generate 512x512 resolution with photorealistic quality.
Style transfer
Neural style transfer was first proposed by Gatys et al. [9]. The idea is to preserve content from the original image and mimic “style” from the reference image. We adopt Gram matrix in our model to enforce texture consistency and replace LBFGS iteration with back propagation for endtoend training. Also, considering the lack of groundtruth data of many face manipulation tasks, we apply a fast neural style transfer algorithm [14] to generate pseudo targets for multimodality manipulations. Note that our model is easily extensible to any desired target domains with current design unchanged.
3 Proposed Method
3.1 Overall Framework
Problem formulation
Given domains of different modalities, our goal is to learn a single general mapping function
(1) 
that transforms from domain to in domain with continuous shape edits (Figure 1). Equation 1 also implies that is bidirectional given desired conditions. We use facial landmark to denote facial expression in domain . Facial expressions are represented as a vector of 2D keypoints with , where each point is the th pixel location in . We use attribute vector to represent the target domain. Formally, our input/output are tuples of the form .
Model architecture
The overall pipeline of our approach is straightforward, shown in Figure 2 consisting of three main components: (1) A generator , which renders an input face in domain to the same person in another domain given conditional facial landmarks. is bidirectional and reused in both forward as well as backward cycle. First mapping and then mapping back given conditional pair . (2) A set of discriminators at different levels of resolution that distinguish generated samples from real ones. Instead of mapping to a single scalar which signifies “real” or “fake” , we adopt PatchGAN [44] which uses a fully convnet that outputs a matrix where each element
represents the probability of overlapping patch
to be real. If we trace back to the original image, each output has areceptive field. (3) A loss function that takes into account identity preservation and texture consistency between different domains. In the following subsections, we elaborate on each module individually and then combine them together to construct
PortraitGAN.3.2 Base Model
To begin with, we consider manipulation of emotions in the same domain, i.e. and are of same texture and style, but with different face shapes denoted by facial landmarks and . Under this scenario, it’s sufficient to incorporate only forward cycle and conditional vector is not needed. The adversarial loss conditioned on facial landmarks follows Equation 2.
(2) 
A face verification loss is desired to enforce identity consistency between and . However in our experiments, we find loss to be enough and it’s better than loss as it alleviates blurry output and acts as an additional regularization [13].
(3) 
The overall loss is a combination of adversarial loss and loss, weighted by .
(4) 
3.3 Multilevel Adversarial Supervision
Manipulation at a landmark level requires highresolution synthesis, which is challenging for generative adversarial network. This is because training the whole system consists of optimizing two individual networks, where each update in either component could change the entire equilibrium. We first introduce how we improve training process and then propose a novel multilevel adversarial supervision that expedites training convergence.
Here we use two major strategies for improving generation quality and training stability. First is to provide additional constraints on the training process. For example, the facial landmark here can be seen as a constraint for generation, much similar to onehot vector in cGAN [28]. We also adopt a feature matching loss in our framework that minimizes the distance between the learned feature representation from real samples and fake ones
(5) 
In Equation 5,
is a real face randomly chosen from pool that queues authentic samples, similar to experience replay strategy used in reinforcement learning.
acts like a feature extraction function that “passes” its strong feature representation to relatively weak generator
. Our generator is similar to an encoderdecoder structure with residual blocks [10] in the middle.Our second strategy is to provide finegrained guidance by incorporating multilevel adversarial supervision. Cascaded upsampling layers in are connected with auxiliary convolutional branches to provide images at different scales (), where is the number of upsampling blocks. Images generated at the intermediate stage , together with corresponding downsampled images from the last stage, are fed into discriminator , which is trained to classify real samples from generated ones through minimizing the following loss,
(6) 
(7) 
where is sampled from real distribution and from model distribution at scale . indicates all possible values of . The auxiliary branches at different stages of generation provide more gradient signals for training the whole network, hence multilevel adversarial supervision. Compared to [7, 41], discriminators responsible for different levels are optimized as a whole rather than individually for each level, leading to faster training process. The increased discriminative ability from in turn provides further guidance when training and the two are alternatively optimized until convergence (Equation 7).
3.4 Texture consistency
When translating between different modalities in highresolution, texture differences become easy to observe. To enforce texture consistency, we adapt style loss proposed in [9] by replacing LBFGS iterative optimization with endtoend back propagation. Formally, let be the vectorized th extracted feature map of image
from neural network
at layer . Gram matrix is defined as(8) 
where is the number of feature maps at layer and is th element in the feature vector. Gram matrix can be seen as a measure of the correlation between feature maps and , which only depends on the number of feature maps, not the size of . For image and , the texture loss at layer is
(9) 
We obtain obvious improvement in quality of texture in modality manipulation evaluated in Section 4.2. We use pretrained VGG19 in our experiments with its parameters frozen during updates.
3.5 Going Beyond: Bidirectional Transfer
Bringing all pieces together, we are now ready to extend our Base Model in Section 3.2 to PortraitGAN by incorporating bidirectional mapping and conditional vector , which represents the target domain. Equation 2 now becomes
(10) 
Forward cycle and backward cycle encourages onetoone mapping from different modalities and thus helps preserve identity,
(11) 
where and encodes different modalities. Therefore, only one set of generator/discriminator is used for bidirectional manipulation. We find that both forward and backward cycle are essential for translation between domains, which is consistent with observation in [44]. can be written in a similar fashion and below is our full objective,
(12) 
where , , , controls the weight of cycleconsistency loss, feature matching loss, identity loss and texture loss respectively.
4 Experimental Evaluation
Implementation Details
Each training step takes as input a tuple of four images , , ,
randomly chosen from possible modalities of the same identity. Attribute conditional vector, represented as a onehot vector, is replicated spatially before channelwise concatenation with corresponding image and facial landmarks. Our generator uses 4 stride2 convolution layers, followed by 9 residual blocks and 4 stride2 transpose convolutions while auxiliary branch uses onechannel convolution for fusion of channels. We use two 3layer PatchGAN
[44] discriminators for multilevel adversarial supervision and Least Square loss [27] for stable training. We set , , , as 2, 10, 5, 10 for evaluation. The training time for PortraitGAN takes about 50 hours on a single Nvidia 1080 GPU.Dataset
Training and validation: The Radboud Faces Database [19] contains 4,824 images with 67 participants, each performing 8 canonical emotional expressions: anger, disgust, fear, happiness, sadness, surprise, contempt, and neutral. iCV MultiEmotion Facial Expression Dataset [25] is designed for microemotion recognition (5184x3456 resolution), which includes 31,250 facial expressions performing 50 different emotions. Testing: We collect 20 videos of highresolution from Youtube (abbreviated as HRY Dataset) containing people giving speech or address for testing. For the above datasets, we use dlib [17] for facial landmark extraction and a neural style transfer algorithm [14] for generating portraits of multiple modalities. Note that during testing, groundtruths are used only for evaluation purposes.
4.1 Quantitative Evaluation
In this section, we evaluate our model quantitatively using both evaluation metrics and subjective user study.
Evaluation metrics
CycleGAN [44] can only translate between two domains. StarGAN [7] extends CycleGAN using single generator, but requires an additional classifier for supervision, therefore can only work on discrete domains. In comparison, our model is the only one that also enables continuous editing. We randomly choose 368 images from HRY dataset with different identities and expressions for natural to single stylized modality evaluation. For fair comparison, we retrain 512x512 version of CycleGAN and StarGAN with domain dimension set as two. We fix extracted landmark unchanged during evaluation for PortraitGAN.
Method  MSE  SSIM  inference time(s) 
CycleGAN  0.028  0.473  0.365 
StarGAN  0.029  0.483  0.277 
PortraitGAN  0.025  0.517  0.290 
From Table 1, CycleGAN and StarGAN are close to each other in terms of MSE and SSIM metrics. Ours achieve the best score while maintaining fast inference speed. Note that these metrics are just indicative since mean square error may not correspond faithfully with human perception [44].
Method (%)  1st round  2nd round  Average 
StarGAN  29.5  31.5  30.5 
CycleGAN  32.5  31.5  32.0 
Ours  38.0  37.0  37.5 
Subjective user study
As pointed out in [13], traditional metrics can be biased for evaluating GAN, therefore we adopt the same evaluation protocol as in [13, 44, 37, 7] for human subjective study on natural to single stylized modality manipulation. Due to resources, we collect responses from two users based on their preferences about images displayed at each group in terms of perceptual realism and how well original figure’s identity is preserved. Each group consists of one photo input and three randomly shuffled manipulated images generated by cycleGAN [44], StarGAN [7] and our proposed PortraitGAN with landmarks unchanged (See 4.2 for more details). There are in total 100 images and each user is asked to rank three methods on each image twice. Our methods get the best score among three methods as shown in Table 2.
4.2 Qualitative Evaluation
In this section, we conduct ablation study and validate the effectiveness of our design in continuous editing. We also compare against stateoftheart generative models on tasks of continuous shape editing and simultaneous shape and modality manipulations. Our framework can be easily adapted to any desired modalities once corresponding data is acquired. Finally, we show some manipulation cases using our developed interactive interface.
Ablation study
Each component is crucial for the proper performance of the system. and for identity preservation between modalities, and for highresolution generation, for texture consistency. Removing any of these elements would damage our network. For example, Figure 3 shows the effect of multilevel adversarial supervision. As can be seen, generated result with our component displays better perceptual quality with more highfrequency details. Texture quality would be compromised without texture loss (Figure 9). Last but not least, bidirectional cycleconsistency eliminates the need of classifier used in [7] for multidomain manipulation.
Continuous shape editing
Figure 4
shows interpolated expression of our model on Rafd, which is beyond its original 8 canonical expressions. Note that CycleGAN can’t transfer in the same domain. On iCV dataset, we train StarGAN on 50 discrete micro emotions, but it collapsed (Figure
6). It’s because StarGAN requires strong classification loss for supervision, which is hard to obtain on iCV dataset. Our model on the other hand, operates in the continuous space that captures subtle variations of face shapes. Another intriguing fact we observed is that boundary width of landmark doesn’t have obvious influence on output (Figure 5). More results are available in Figure 9 (column 13).Simultaneous shape and modality manipulation
Simultaneous shape and modality manipulations on HRY dataset is shown in Figure 9 (column 48). If look closely, our model is capable of hallucinating teeth (1st row) and capturing details such as ear rings (5th row), If landmark is fixed, our model then acts like a modality transfer model except that it can achieve bidirectional modality transfer with a minor change of attribute conditional vector .
To compare our approach with cycleGAN [44] and StarGAN [7], we use the following pipeline: Given image pair {,}, which are from domain and , cycleGAN translates to , which has content from and modality from . This can be achieved with our approach but with landmark unchanged. Similarly, we treat modalities as visual attributes and train StarGAN accordingly. Figure 8 shows comparison results of our approach with cycleGAN and StarGAN. As can be seen, ours are much sharper and visually appealing compared to StarGAN and CycleGAN. This is because PortraitGAN leverages texture loss to generate more coherent textures.
Interactive user interface
Compared to discrete conditional labels, facial landmark gives full freedom for continuous shape edits. To test the limit of our model, we develop an online interactive editing interface, where users can manipulate facial landmarks manually and evaluate the model directly. This proves to be more challenging than landmark interpolation, as these edits may go far beyond normal expressions in the training set. Figure 7 shows some interesting results. As can be seen, our model can successfully perform simulatneous faceslimming and modality manipulation from input of the original modality. In addition, our model is supportive of bidirectional manipulation among modalities, owing to the design of bidirectional mapping.
5 Conclusion
Simultaneous shape and multimodality portrait manipulation in highresolution is not an easy task. In this paper, our proposed PortraitGAN pushes the limit of cycle consistency by incorporating additional facial landmark and attribute vector as condition. For bidirectional mapping, we only use one generator similar to [7], but with different training schemes. This enables us to perform multimodality manipulations simultaneously, in a continuous manner. We validate our approach with expression interpolation and different style modalities. For better image quality, we adopt multilevel adversarial supervision to provide stronger guidance during training where generated images at different scales are combined and propagated to discriminators at different scales. We also leverage texture loss to enforce texture consistency among modalities. However, due to lack of data in many face manipulation tasks, modality manipulation beyond style transfer are not presented. Nonetheless, our proposed framework presents a step towards interactive manipulation and could be extended to manipulation across more modalities once corresponding data is obtained, which we leave as future work.
References
 [1] H. AverbuchElor, D. CohenOr, J. Kopf, and M. F. Cohen. Bringing portraits to life. ACM Transactions on Graphics (TOG), 36(6):196, 2017.
 [2] V. Blanz, C. Basso, T. Poggio, and T. Vetter. Reanimating faces in images and video. In Computer graphics forum, volume 22, pages 641–650. Wiley Online Library, 2003.
 [3] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194. ACM Press/AddisonWesley Publishing Co., 1999.
 [4] Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In The IEEE International Conference on Computer Vision (ICCV), volume 1, 2017.

[5]
Y.C. Chen, H. Lin, M. Shu, R. Li, X. Tao, X. Shen, Y. Ye, and J. Jia.
Faceletbank for fast portrait manipulation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3541–3549, 2018.  [6] Y.C. Chen, X. Shen, and J. Jia. Makeupgo: Blind reversion of portrait edit. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4501–4509, 2017.
 [7] Y. Choi, M. Choi, M. Kim, J.W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multidomain imagetoimage translation. arXiv preprint arXiv:1711.09020, 2017.
 [8] O. Fried, E. Shechtman, D. B. Goldman, and A. Finkelstein. Perspectiveaware manipulation of portrait photos. ACM Transactions on Graphics (TOG), 35(4):128, 2016.

[9]
L. A. Gatys, A. S. Ecker, and M. Bethge.
Image style transfer using convolutional neural networks.
In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2414–2423. IEEE, 2016.  [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [11] J. Hoffman, E. Tzeng, T. Park, J.Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycleconsistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
 [12] X. Huang and S. Belongie. Arbitrary style transfer in realtime with adaptive instance normalization. CoRR, abs/1703.06868, 2017.

[13]
P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros.
Imagetoimage translation with conditional adversarial networks.
arXiv preprint, 2017. 
[14]
J. Johnson, A. Alahi, and L. FeiFei.
Perceptual losses for realtime style transfer and superresolution.
In European Conference on Computer Vision, pages 694–711. Springer, 2016.  [15] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
 [16] I. KemelmacherShlizerman, S. Suwajanakorn, and S. M. Seitz. Illuminationaware age progression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3334–3341, 2014.

[17]
D. E. King.
Dlibml: A machine learning toolkit.
Journal of Machine Learning Research, 10:1755–1758, 2009.  [18] I. Korshunova, W. Shi, J. Dambre, and L. Theis. Fast faceswap using convolutional neural networks. In The IEEE International Conference on Computer Vision, 2017.
 [19] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and A. Van Knippenberg. Presentation and validation of the radboud faces database. Cognition and emotion, 24(8):1377–1388, 2010.
 [20] C. Lassner, G. PonsMoll, and P. V. Gehler. A generative model of people in clothing. arXiv preprint arXiv:1705.04098, 2017.
 [21] M. Lau, J. Chai, Y.Q. Xu, and H.Y. Shum. Face poser: Interactive modeling of 3d facial expressions using facial priors. ACM Transactions on Graphics (TOG), 29(1):3, 2009.
 [22] M.Y. Liu, T. Breuel, and J. Kautz. Unsupervised imagetoimage translation networks. In Advances in Neural Information Processing Systems, pages 700–708, 2017.
 [23] S. Liu, X. Ou, R. Qian, W. Wang, and X. Cao. Makeup like a superstar: Deep localized makeup transfer network. arXiv preprint arXiv:1604.07102, 2016.
 [24] F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep photo style transfer. CoRR, abs/1703.07511, 2017.

[25]
I. Lüsi, J. C. J. Junior, J. Gorbova, X. Baró, S. Escalera, H. Demirel,
J. Allik, C. Ozcinar, and G. Anbarjafari.
Joint challenge on dominant and complementary emotion recognition using micro emotion features and headpose estimation: Databases.
In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on, pages 809–813. IEEE, 2017.  [26] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. Pose guided person image generation. In Advances in Neural Information Processing Systems, pages 405–415, 2017.
 [27] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2813–2821. IEEE, 2017.
 [28] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [29] A. Pumarola, A. Agudo, A. Sanfeliu, and F. MorenoNoguer. Unsupervised person image synthesis in arbitrary poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8620–8628, 2018.
 [30] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In Advances in Neural Information Processing Systems, pages 217–225, 2016.
 [31] A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe. Deformable gans for posebased human image generation. arXiv preprint arXiv:1801.00055, 2017.
 [32] T. Sucontphunt, Z. Mo, U. Neumann, and Z. Deng. Interactive 3d facial expression posing through 2d portrait manipulation. In Proceedings of graphics interface 2008, pages 177–184. Canadian Information Processing Society, 2008.
 [33] S. Suwajanakorn, S. M. Seitz, and I. KemelmacherShlizerman. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4):95, 2017.
 [34] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Realtime face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2387–2395, 2016.
 [35] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
 [36] J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows: Video forecasting by generating pose futures. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 3352–3361. IEEE, 2017.
 [37] T.C. Wang, M.Y. Liu, J.Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. Highresolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017.

[38]
C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li.
Highresolution image inpainting using multiscale neural patch synthesis.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017.  [39] F. Yang, L. Bourdev, E. Shechtman, J. Wang, and D. Metaxas. Facial expression editing in video using a temporallysmooth factorization. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 861–868. IEEE, 2012.
 [40] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for imagetoimage translation. arXiv preprint, 2017.
 [41] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. arXiv: 1710.10916, 2017.

[42]
Z. Zhang, Y. Song, and H. Qi.
Age progression/regression by conditional adversarial autoencoder.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.  [43] J.Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597–613. Springer, 2016.
 [44] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
 [45] J.Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal imagetoimage translation. In Advances in Neural Information Processing Systems, pages 465–476, 2017.