In image-to-image translation tasks, mappings between two visual domains are learnt. Various computer vision and graphics problems are addressed and formulated using the image-to-image translation framework, including super-resolution[30, 28]29, 49], inpainting [38, 21], style transfer [23, 34] and photorealistic image synthesis [22, 6, 46]. In the photorealistic image synthesis problem, images are generated from abstract semantic label maps such as pixel-wise segmentation maps or sparse landmarks. In this paper, we study the problem of example-guided image synthesis. Given an input semantic label map and a guidance image , the goal is to synthesize a photorealistic image, , which is semantically consistent with the label map , while being style-consistent with the exemplar , so . Style consistency is automatically determined: in portraits, style consistency refers to the fact that we want our synthetic output to be plausibly of the same genetic type as an input exemplar; in full body images style consistency means the same clothing; and in street scenes it includes such things as the same weather and time of day. Representative applications are shown in Figure 1.
Example-based image synthesis cannot be solved with a straightforward combination of photorealistic image synthesis based on pix2pixHD [22, 46] and style transfer ; the style of the input exemplar is not well kept in the synthetic result, see Figure 14. Recently, example-guided image-to-image translation frameworks [20, 31, 2] are proposed using a disentangled model to represent content and style or identity and attributes, however they fail to synthesize photorealistic results from abstract semantic label maps. The challenges are multi-fold: first, the ground truth photorealistic result for each label map given an arbitrary exemplar is not available for training; second, the synthetic results should be photorealistic while semantically consistent with the source label maps; last but not least, the synthetic result should be stylistically consistent with the corresponding image exemplar.
We present a method for this example-guided image synthesis problem with conditional generative adversarial networks. We build on the recent pix2pixHD  for image synthesis to ensure photorealism, with the crucial contributions of:
2 Related Work
Generative Adversarial Networks.
In recent years, generative adversarial networks (GANs) [11, 1] for image generation have progressed rapidly [22, 46]. Driven by adversarial losses, generators and discriminators compete with each other: discriminators aim to distinguish the generated fake images from the target domain; generators try to fool discriminators. Technologies to improve GANs include: progressive GANs [19, 48, 24], training objective and process designs [42, 1, 37, 43], etc. In this paper, we use GANs for example-guided image generation with style consistency awareness.
Image-to-Image Translation and Photorealistic Image Synthesis.
The goal of image-to-image translation is to translate images from a source domain to a target domain. Isola et al.  proposed the conditional GAN framework for various image-to-image translation tasks with paired images for supervision. Wang et al.  extended this work for high-resolution image synthesis and interactive manipulation. Recently, researchers proposed to solve the unsupervised image-to-image translation problem with cycle consistency to overcome the lack of unpaired training data [51, 25, 33, 52, 20, 31, 5]. Photorealistic image synthesis [6, 39, 46] is a specific application of image-to-image translation, where images are synthesized semantically from abstract label maps. Chen et al.  proposed a cascade framework to synthesis high-resolution images from pixel-wise labeling maps. Wang et al.  proposed a framework for instance-level image synthesis with conditional GANs.
Very recently, a few works [16, 20, 31, 35] have been proposed to transfer the style or attributes of an exemplar to the source image, where the images belong to photorealistic domains (aka domain adaptation). Our goal differs from these works by aiming at synthesizing photos from an abstract semantic label domain rather than a photorealistic image domain. Zheng et al.  proposed a clothes changing system to change the clothing of a person in image. Chan et al.  presented a network to synthesize a dance video from a target dance video and an source exemplar video. Different from our model, it was trained for every input exemplar video. Ma et al.  proposed to synthesize person images from pose keypoints. We show in Section 4 that our method outperforms the state-of-the-art methods.
Style transfer is a long-standing problem in computer vision and graphics, which aims to transfer the style of a source image to a target image or target domain. Some approaches [14, 10, 23, 34, 18, 32, 12, 5, 17] transfer style based on single exemplar, where others learn a general style of a target domain with a holistic sense [51, 20, 31, 7]. Similar to our model, the PairedCycleGAN model  uses a style discriminator to distinguish whether a pair of facial images wear the same make-up in the making-up application. However, in their discriminator, the input image pair must be accurately aligned via warping; a generator is learned for each facial component. Our style consistency discriminator, in contrast, provides a general solution for image synthesis from both sparse labels (e.g. sketch and pose) and pixel-wise dense labels (e.g. scene parsing).
3 Example-guided Image Synthesis
In this section, we first review the baseline model pix2pixHD , then describe our method, a conditional generative adversarial network for synthesizing photorealistic images from semantic label maps given specific exemplars. Finally we show how to appropriately prepare training data for our framework.
3.1 The pix2pixHD Baseline
The pix2pixHD  is a powerful image synthesis and interactive manipulation framework based on the pioneering conditional image-to-image translation method pix2pix . Let be a label map from a semantic label domain , the goal of pix2pixHD is to synthesize an image , from : . It consists of a hierarchically integrated generator and multi-scale discriminators to handle high-resolution synthesis tasks. The goal of the generator is to translate semantic label maps to photorealistic images, and the objective of the discriminators is to distinguish generated fake images from real ones at different resolution. The training dataset consists of pairs of label map and corresponding real image .
The pix2pixHD optimizes a multi-task problem with a standard GAN loss and feature matching loss :
where is the standard GAN loss given by:
and is the feature matching loss given by:
where is the layer size and is the feature size in corresponding discriminator layer. An optional perceptual loss is introduced as the loss between pre-trained VGG network  features.
One appealing feature of pix2pixHD is the instance-level image manipulation with a feature embedding technique. Given an instance-level segmentation map, pix2pixHD is able to synthesize an image with a specific appearance from an instance exemplar in the same object category. We will show that without the input instance-level pixel-wise segmentation map as a constraint, our model is still able to synthesize images with styles automatically transferred from exemplar images.
3.2 Our Model
Let be a guidance image from a natural image domain . Our goal is to synthesize an image , from a semantic label map and an image : . The role of is to provide a style constraint to image synthesis: the output image must be style-consistent with the exemplar . Our problem is more difficult than that solved by pix2pixHD. One particular challenge we face is that given an input label map , the ground truth images for arbitrary guided style exemplars are missing. To solve this weakly-supervised problem, we learn style consistency between pairs of images: they could be style-consistent image pairs or style-inconsistent image pairs (see Section 3.3).
An overview of our method is illustrated in Figure 2. It builds upon a single-scale version of pix2pixHD, and contains: (i) a generator , with semantic map , style example and its corresponding label as input and output a synthetic image; (ii) a standard discriminator to distinguish real images from fake ones given conditional inputs; and (iii) we introduce a style consistency discriminator to detect whether the synthetic image and the guidance image are style-compatible, which operates on image pairs from domain . Here, is an operator which, given an image produces a set of semantic labels that represent the image (choices of are given in Section 4.2); for convenience can be visualized as an image, provided the viewer recalls that the image contains semantic labels. Our objective function contains three losses: a standard adversarial loss; a novel adversarial style consistency loss; and a novel adaptive semantic consistency loss.
3.2.1 Standard Adversarial Loss
We apply standard adversarial losses via the standard discriminator as:
where the tries to synthesize images that look similar to real images from image domain regardless of specific styles, while given an image conditioned with the corresponding label map, the aims to determine the image is real or fake.
3.2.2 Adversarial Style Consistency Loss
With the standard adversarial loss, the generator is able to synthesize images matching the data distribution of domain , however the synthetic results are not guaranteed to be style-consistent with the corresponding guidance . We introduce the style consistency loss using a discriminator associated with a pair of images — either both real, or one real and one synthetic:
where and are a pair of sampled real images from domain with the same style, and are a pair of sampled real images from domain with different styles. We introduce the data sampling strategy in Section 3.3.
With the proposed adversarial style consistency loss , the discriminator tries to learn awareness of style consistency between a pair of images, while the generator tries to fool by generating an image with the same style to exemplar .
3.2.3 Adaptive Semantic Consistency Loss
The semantic consistency loss is introduced to reconstruct an image from a label map in the semantic sense of e.g. sketch. It may appear we could use the error between the input labels , and the predicted labels from the synthetic image, , for example or some variant thereof. However, different applications give distinct meanings to the semantic label maps, with the consequence that the gradient of the loss will, in general, vary between applications. This would mean selecting hyper-parameters to combine losses on a per-application basis.
We avoid this problem by always computing semantic consistency losses between images: the synthetic image and specifically an image which is a priori known to be consistent with a given semantic map . Typically the image is drawn from the training dataset and we have . A particular issue with our adopted scheme is that such losses will try to converge the network output on the image , which by choice is photorealistic and is semantically consistent with . Such behavior would work perfectly when and are sampled from images with the same style, but could force the output away from the desired style when and are “style-wise” different, i.e. .
Our solution, is to use a novel adaptive VGG loss computed via a pre-trained model  between the synthetic image and the real image of label map . An adaptive weighting scheme is proposed for per-layer VGG loss computation, to ensure the semantic consistency of the synthetic image to :
where represents the -th layer feature extractor of the VGG network, and is the adaptive weight for the -th layer. We set to gain the impact of details from shallow layers when and are from style-consistent sampled pairs , and to suppress the impact of detail matching for style-inconsistent pair ; is the number of elements in the -th feature layer. The adaptive weighting scheme is illustrated in Figure 3.
The final loss is formulated as:
where and control the relative importance of the terms, our full objective is given by:
3.3 Sampling Strategy for Style-consistent and Style-inconsistent Image Pairs
So far, we have introduced the core techniques of our network. However one prerequisite to our method is to obtain style-consistent image pairs and style-inconsistent image pairs . Thus the datasets for prior image-to-image translation works [22, 46, 51, 31, 20] are not feasible for our training.
A key idea for training data acquisition is to collect image pairs from videos. In face and dance synthesis tasks, we observed that: (i) within a short temporal period of a video, the style of frame contents are ensured to be the same, and (ii) frames from different videos probably have different styles (e.g. different gender, hairstyles, skin colors and make-up in the face image synthesis application). We thus randomly sample pairs of frames withinframes from a video and regard them as style-consistent ones . For style-inconsistent pairs , we firstly randomly sample pairs of frames from different videos, then manually label whether images from each sampled pair are style-consistent or not.
In the street view synthesis task, as large scale street view videos with different styles are not easy to collect, we use images from the BDD100K dataset . In BDD100K, street view images and the weather, time of day attributes are provided. We coarsely categorize the images into style groups based on the attributes, then sample style-consistent image pairs inside each group and sample style-inconsistent image pairs between groups. Figure 4 shows representative sampled pairs of images.
4.1 Implementation Details
We implement our model based on the single-scale pix2pixHD framework and experiment with images with size ( for street view synthesis). The generator13] and finally some Convolution-InstanceNorm-ReLU-Stride-0.5 layers to synthesize images. For both discriminators and , we use PatchGANs  with several Convolution-InstanceNorm-LeakyReLU-Stride-2 layers with the exception that InstanceNorm is not applied in the first layer. The slope for LeakyReLU is set as . For all the experiments, we set and in Equation 7. All the networks are trained from scratch on an NVIDIA GTX 1080 Ti GPU using the Adam solver  with a batch size of 1. The learning rate is initially fixed as for the first 500K iterations and linearly decayed to zero over the next 500K iterations. We use LSGANs  for stable training. For more details, please refer to the supplementary material.
We evaluate our method on face, dance and street view image synthesis tasks, using the following datasets:
SketchFace. We use the real videos in the FaceForensics dataset , which contains videos of reporters broadcasting news. We use the image sampling strategy described in Section 3.3 to acquire training image pairs from video, then apply face alignment algorithm  to localize facial landmarks, crop facial regions and resize them to size . The detected facial landmarks are connected to create face sketches as function .
PoseDance. We download solo dance videos from YouTube, crop out the central body regions and resize them to . As the number of videos is small, we evenly split each video into the first part and the second part along the time-line, then sample training data only from the first parts and sample testing data only from the second parts of all the videos. The function is implemented using concatenated pre-trained DensePose  and OpenPose  pose detection results to provide pose labels.
We compare our method with the following algorithms:
pix2pixHD and pix2pixHD  with DPST . pix2pixHD is the image-to-image translation baseline. A default image could be synthesized using pix2pixHD with its style then transfered to the guided example using Deep Photo Style Transfer (DPST) method.
MUNIT  and PairedMUNIT. MUNIT is the state-of-the-art unsupervised image-to-image translation method with disentangled content and style representations that are able to translate images to given exemplars. We modify MUNIT by integrating pairwise style information to the original model and adaptively computing losses with style (denoted as PairedMUNIT).
Ours without , or adaptive weights for ablation studies. All of the methods are trained on the datasets introduced in Section 4.2.
4.4 Evaluation Metrics
Photorealism and Semantic Consistency. We use the Fréchet Inception Distance  to evaluate the realism and faithfulness of the synthetic results. This metric is widely used for implicit generative models, because it correlates with the visual quality of generated samples. A smaller FID is often favored by the human subjects. We further evaluate semantic consistency by translating the synthetic images back to the label domain and comparing the accuracy to the input labels. For tasks SketchFace and PoseDance, we use the labeling endpoint error (LEPE) between the input label map and the labels generated by to compute the label accuracy. For task Scene parsingStreet view, we use scene parsing score (SPS)  on synthetic street view images to measure the segmentation accuracy.
Style Consistency. We perform a human perceptual study to compare style consistency from human point of view. We show pairs of our result and the result from baseline methods to invited subjects and ask which one they see as being closer to the guidances’ style.
Main Results. In Figure 14, we show our results (column 3) and the results from baseline methods in the SketchFace synthesis application on the test set. While the pix2pixHD is able to generate photorealistic images consistent with the input semantic labels, it is not able to keep the style (e.g. gender, hair, skin color) from input exemplars in the synthetic results, even enhanced by the deep photo style transfer effect (column 7 and 8). The unsupervised method MUNIT and its improvement PairedMUNIT fail to generate photorealistic results from semantic maps in this application (column 9 and 10). The possible reason for their failures is that they assume that the input and output domains share the same content space, which is not true in image synthesis applications from semantic label maps.
Table 1 gives the quantitative evaluation of the photorealism measured by FID in various image synthesis tasks, where our method performs the best. The semantic consistency of synthetic results to the input labels is given by LEPE in Table 2. It can be seen that the pix2pixHD obtains the best semantic consistency to the input labels, because it does not lose semantic accuracy by totally ignoring style consistency. Our method outperforms MUNIT and PairedMUNIT.
For style consistency evaluation, we conduct a human perception study commonly used in image-to-image translation works [22, 51, 6, 46, 8]. The input exemplars and pairwise synthetic results sampled from our method and a baseline method are shown to the subjects with unlimited watching time. Then the subjects were asked “Which image is closer to the exemplar in terms of style?” Images for user study were randomly sampled from the test set; each pair was shown in random order and guaranteed to be examined by at least 30 subjects. The ratios of votes our method got over baseline methods are given in Table 3. Our method won more user preferences in pairwise comparison. The quantitative results shown that our results are more photorealistic and more style-consistent with the exemplars.
We conducted ablation studies to verify our model. As can be seen in Figure 14, without the adaptive weight scheme in , the quality of results is slightly reduced; without the semantic loss , the semantic consistency would lose; without the style consistency adversarial loss , the target style is not maintained. Quantitative photorealism statistics reported in Table 4 validated the above observation. We further extract eye patches from synthetic images and exemplars and compute the VGG feature distance between them. Table 5 indicates that the weight adaptation makes a quantitative improvement of style consistency.
Figure 15 shows the in-the-wild synthesis results from our model using Internet images. The results indicate that the model generalizes well for “unseen” cases. We provide more results in the supplementary material.
|Ours w/o adaptive weights||35.59||Ours w/o||58.08|
|Ours||Ours w/o adapt. weights||w/o||w/o|
PoseDance Synthesis. Figure 7 shows a visual comparison of our method and baselines in the PoseDance synthesis application. The semantic consistency of synthetic results to the input labels measured using LEPE are given in Table 2. Although the facial regions of our results are blurry without including facial landmarks in the input pose labels, our model still produces images that are style-consistent with the guidance images while consistent with the semantic labels. Figure 8 shows the visual comparison with Ma et al.  on the dancing dataset. The generated poses and clothes in our results are visually better.
Scene parsingStreet view Synthesis. A comparison of our method and baselines in the Scene parsingStreet view task is given in Figure 10. The semantic consistency of synthetic results to the input labels measured using SPS are given in Table 6. Although the scene in the guidance images are not quite the same as the semantics of the input label maps, our model is able to produce images that are semantically consistent with the segmentation map and style-consistent with the guidance image.
Figure 9 shows more results. Our network can faithfully synthesize images from various semantic labels and exemplars. Please find more results in the supplementary file.
|Method||Per-pixel acc.||Per-class acc.||Class IOU|
In this paper, we present a novel method for example-guided image synthesis with style-consistency from general-form semantic labels. During network training, we propose to sample style-consistent and style-inconsistent image pairs from video to provide style awareness to the model. Beyond that, we introduce the style consistency adversarial losses and the style consistency discriminator, as well as the semantic consistency loss with adaptive weights, to produce plausible results. Qualitative and quantitative results in different applications show that the proposed model produces realistic and style-consistent images better than those from prior arts.
Limitations and Future Work. Our network is mainly trained on cropped video data whose resolution is limited (e.g. ), we did not use the multi-scale architecture as pix2pixHD did for high-resolution image synthesis (e.g. resolution or more). Moreover, the synthetic background in face and dance image synthesis tasks may be blurry, because the semantic labels do not specify any background scenes. Lastly, we have demonstrated the efficiency of our method in several synthesis applications, however the results in other applications could be effected by the performance of the state-of-the-art semantic labeling function . In the future, we plan to extend this framework to video domain  and synthesize style-consistent videos to given exemplars.
Acknowledgements. We thank the anonymous reviewers for the valuable discussions. This work was supported by the Natural Science Foundation of China (Project Number: 61521002, 61561146393). Shi-Min Hu is the corresponding author.
-  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. Towards open-set identity preserving face synthesis. In CVPR, 2018.
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh.
Realtime multi-person 2d pose estimation using part affinity fields.In CVPR, 2017.
-  Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. arXiv preprint arXiv:1808.07371, 2018.
-  Huiwen Chang, Jingwan Lu, Fisher Yu, and Adam Finkelstein. Pairedcyclegan: Asymmetric style transfer for applying and removing makeup. In CVPR, 2018.
-  Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. In ICCV, 2017.
-  Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan: Generative adversarial networks for photo cartoonization. In CVPR, pages 9465–9474, 2018.
-  Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
-  Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. arXiv preprint arXiv:1809.02983, 2018.
Leon A Gatys, Alexander S Ecker, and Matthias Bethge.
Image style transfer using convolutional neural networks.In CVPR, 2016.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
-  Shuyang Gu, Congliang Chen, Jing Liao, and Lu Yuan. Arbitrary style transfer with deep feature reshuffle. In CVPR, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Aaron Hertzmann, Charles E. Jacobs, Nuria Oliver, Brian Curless, and David H. Salesin. Image analogies. In SIGGRAPH, pages 327–340, 2001.
-  Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
-  Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. CyCADA: Cycle-consistent adversarial domain adaptation. In ICML, 2018.
-  Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma, Wenhao Jiang, Xiaolong Zhu, Zhifeng Li, and Wei Liu. Real-time neural style transfer for videos. In CVPR, July 2017.
-  Xun Huang and Serge J Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
-  Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, and Serge Belongie. Stacked generative adversarial networks. In CVPR, 2017.
-  Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In ECCV, 2018.
-  Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and Locally Consistent Image Completion. ACM Trans. Graph., 36(4):107:1–107:14, 2017.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.
Image-to-image translation with conditional adversarial networks.arxiv, 2016.
-  Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
-  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
-  Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
Davis E King.
Dlib-ml: A machine learning toolkit.Journal of Machine Learning Research, 10(Jul):1755–1758, 2009.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, 2017.
-  Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization. In ECCV, 2016.
-  C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
-  Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Kumar Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In ECCV, 2018.
-  Jing Liao, Yuan Yao, Lu Yuan, Gang Hua, and Sing Bing Kang. Visual attribute transfer through deep image analogy. ACM Trans. Graph., 36(4), 2017.
-  Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In NIPS, pages 700–708, 2017.
-  Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. Deep photo style transfer. arXiv preprint arXiv:1703.07511, 2017.
-  Liqian Ma, Xu Jia, Stamatios Georgoulis, Tinne Tuytelaars, and Luc Van Gool. Exemplar guided unsupervised image-to-image translation. arXiv preprint arXiv:1805.11145, 2018.
-  Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose guided person image generation. In NeurIPS, pages 405–415, 2017.
-  Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In ICCV, 2017.
-  Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
-  Xiaojuan Qi, Qifeng Chen, Jiaya Jia, and Vladlen Koltun. Semi-parametric image synthesis. In CVPR, 2018.
-  Iasonas Kokkinos Riza Alp Güler, Natalia Neverova. Densepose: Dense human pose estimation in the wild. arXiv, 2018.
-  Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv, 2018.
-  Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NIPS, 2016.
-  Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. In NIPS, 2018.
-  Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
-  Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, 2018.
-  Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint, 2017.
-  Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.
-  Zhao-Heng Zheng, Hao-Tian Zhang, Fang-Lue Zhang, and Tai-Jiang Mu. Image-based clothes changing system. Computational Visual Media, 3(4):337–347, 2017.
-  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
-  Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In NIPS, 2017.
Appendix A Datasets
As described in the main manuscript, we evaluate our model on face, dance and street view image synthesis tasks, using following datasets and semantic functions:
SketchFace. We use the real videos in the FaceForensics dataset , which contains videos of reporters broadcasting news. We use the image sampling strategy described in Section 3.3 of the main manuscript to acquire training image pairs from video, then apply face alignment algorithm  to localize facial landmarks, crop facial regions and resize them to size . We sample images from videos for training and images from distinct videos for testing. The detected facial landmarks are connected to create face sketches; this is the function , in both training set and test set. For each sketch extracted from a training image, we randomly sample guidance images from other videos for training, and for each testing sketch, we randomly sample guidance images from other videos for testing.
SceneParsingStreetView. We use the BDD100k dataset  to synthesize street view images from pixel-wise semantic labels (i.e. scene parsing) maps. For each street view image in the dataset, the corresponding scene parsing map and weather and timeofday attributes are provided. Based on these attributes, we divide images into style groups as listed in Table 7, then sample style-consistent image pairs inside each group and style-inconsistent image pairs between groups. The training set contains images and test set contains images, both resized to width . We use scene parsing network DANet  as the function for each street view image during testing. For each scene parsing map, we randomly select an image inside each style group as the guidance, both in training and testing phases.
|2||Foggy||Dawn or Dusk|
|4||Rainy||Dawn or Dusk|
|5||Snowy||Dawn or Dusk|
|6||Clear||Dawn or Dusk|
|8||Partly cloudy||Dawn or Dusk|
|12||Overcast||Dawn or Dusk|
PoseDance. We downloaded solo dance videos from YouTube, cropped out the central body regions and resized them to size . As the number of videos is small, we evenly split each video into the first part and the second part along the timeline, then sample training data only from the first parts and sample testing data only from the second parts of all the videos. The function is implemented using concatenated pre-trained DensePose  and OpenPose  pose detection results to provide pose labels. As a result, we have images for training and images for testing. For each pose extracted from a training image, we randomly sample guidance images from other dancing videos, and for each testing pose, we randomly sample guidance images from other dancing videos.
Appendix B Network Architectures
We follow the naming convention used in Johnson et al. , CycleGAN  and pix2pixHD . Let c7s1-k denote a Convolution-InstanceNorm-ReLU layer with filters and stride 1. dk denotes a Convolution-InstanceNorm-ReLU layer with filters and stride
. Reflection padding is used to reduce boundary artifacts.Rkt denotes residual blocks each contains two convolutional layers with filters, repeated times. uk denotes a fractional-strided-Convolution-InstanceNorm-ReLU layer with filters and stride .
The architecture of generator is represented as:
c7s1-64, d128, d256, d512, d1024, R10249, u512, u256, u128, u64, c7s1-3
We use PatchGAN  in both of the two discriminators and . Let Ck denote a Convolution-InstanceNorm-LeakyRU layer with filters and stride . The last layer is send to an extra convolution layer to produce a dimensional output. InstanceNorm is not used for the first C64 layer. Leaky ReLU slope is set as .
The architectures of and are identical:
C64, C128, C256, C512
Appendix C Training Details
All the networks were trained from scratch. Weights were initialized from a Gaussian distribution with mean 0 and standard deviation 0.02. In the first 250K iterations, the learning rate was fixed as 0.0002 with the adversarial style-consistency lossturned-off. In the next 250K iterations, we turned on the loss. In the final 500K iterations, the learning rate linearly decayed to zero with all the losses turned-on.
The models were trained on an NVIDIA TITAN 1080 Ti GPU with 11GB memory. The inference time is about 8-10 milliseconds per image.
Appendix D Additional Results
In Figure 11 and following pages, we show further experimental results from our method and baselines.