Example-Guided Style Consistent Image Synthesis from Semantic Labeling

06/04/2019 ∙ by Miao Wang, et al. ∙ 0

Example-guided image synthesis aims to synthesize an image from a semantic label map and an exemplary image indicating style. We use the term "style" in this problem to refer to implicit characteristics of images, for example: in portraits "style" includes gender, racial identity, age, hairstyle; in full body pictures it includes clothing; in street scenes, it refers to weather and time of day and such like. A semantic label map in these cases indicates facial expression, full body pose, or scene segmentation. We propose a solution to the example-guided image synthesis problem using conditional generative adversarial networks with style consistency. Our key contributions are (i) a novel style consistency discriminator to determine whether a pair of images are consistent in style; (ii) an adaptive semantic consistency loss; and (iii) a training data sampling strategy, for synthesizing style-consistent results to the exemplar.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 5

page 7

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In image-to-image translation tasks, mappings between two visual domains are learnt. Various computer vision and graphics problems are addressed and formulated using the image-to-image translation framework, including super-resolution

[30, 28]

, colorization

[29, 49], inpainting [38, 21], style transfer [23, 34] and photorealistic image synthesis [22, 6, 46]. In the photorealistic image synthesis problem, images are generated from abstract semantic label maps such as pixel-wise segmentation maps or sparse landmarks. In this paper, we study the problem of example-guided image synthesis. Given an input semantic label map and a guidance image , the goal is to synthesize a photorealistic image, , which is semantically consistent with the label map , while being style-consistent with the exemplar , so . Style consistency is automatically determined: in portraits, style consistency refers to the fact that we want our synthetic output to be plausibly of the same genetic type as an input exemplar; in full body images style consistency means the same clothing; and in street scenes it includes such things as the same weather and time of day. Representative applications are shown in Figure 1.

Example-based image synthesis cannot be solved with a straightforward combination of photorealistic image synthesis based on pix2pixHD [22, 46] and style transfer [34]; the style of the input exemplar is not well kept in the synthetic result, see Figure 14. Recently, example-guided image-to-image translation frameworks [20, 31, 2] are proposed using a disentangled model to represent content and style or identity and attributes, however they fail to synthesize photorealistic results from abstract semantic label maps. The challenges are multi-fold: first, the ground truth photorealistic result for each label map given an arbitrary exemplar is not available for training; second, the synthetic results should be photorealistic while semantically consistent with the source label maps; last but not least, the synthetic result should be stylistically consistent with the corresponding image exemplar.

We present a method for this example-guided image synthesis problem with conditional generative adversarial networks. We build on the recent pix2pixHD [46] for image synthesis to ensure photorealism, with the crucial contributions of:

  • a novel style consistency discriminator to enforce style consistency of a pair of images (see Section 3.2.2) ;

  • an adaptive semantic consistency loss to ensure quality (see Section 3.2.3);

  • a data sampling strategy that ensures we need only a weakly supervised approach for training (see Section 3.3).

2 Related Work

Generative Adversarial Networks.

In recent years, generative adversarial networks (GANs) [11, 1] for image generation have progressed rapidly [22, 46]. Driven by adversarial losses, generators and discriminators compete with each other: discriminators aim to distinguish the generated fake images from the target domain; generators try to fool discriminators. Technologies to improve GANs include: progressive GANs [19, 48, 24], training objective and process designs [42, 1, 37, 43], etc. In this paper, we use GANs for example-guided image generation with style consistency awareness.

Image-to-Image Translation and Photorealistic Image Synthesis.

The goal of image-to-image translation is to translate images from a source domain to a target domain. Isola et al. [22] proposed the conditional GAN framework for various image-to-image translation tasks with paired images for supervision. Wang et al. [46] extended this work for high-resolution image synthesis and interactive manipulation. Recently, researchers proposed to solve the unsupervised image-to-image translation problem with cycle consistency to overcome the lack of unpaired training data [51, 25, 33, 52, 20, 31, 5]. Photorealistic image synthesis [6, 39, 46] is a specific application of image-to-image translation, where images are synthesized semantically from abstract label maps. Chen et al. [6] proposed a cascade framework to synthesis high-resolution images from pixel-wise labeling maps. Wang et al. [46] proposed a framework for instance-level image synthesis with conditional GANs.

Very recently, a few works [16, 20, 31, 35] have been proposed to transfer the style or attributes of an exemplar to the source image, where the images belong to photorealistic domains (aka domain adaptation). Our goal differs from these works by aiming at synthesizing photos from an abstract semantic label domain rather than a photorealistic image domain. Zheng et al. [50] proposed a clothes changing system to change the clothing of a person in image. Chan et al. [4] presented a network to synthesize a dance video from a target dance video and an source exemplar video. Different from our model, it was trained for every input exemplar video. Ma et al. [36] proposed to synthesize person images from pose keypoints. We show in Section 4 that our method outperforms the state-of-the-art methods.

Style Transfer.

Style transfer is a long-standing problem in computer vision and graphics, which aims to transfer the style of a source image to a target image or target domain. Some approaches [14, 10, 23, 34, 18, 32, 12, 5, 17] transfer style based on single exemplar, where others learn a general style of a target domain with a holistic sense [51, 20, 31, 7]. Similar to our model, the PairedCycleGAN model [5] uses a style discriminator to distinguish whether a pair of facial images wear the same make-up in the making-up application. However, in their discriminator, the input image pair must be accurately aligned via warping; a generator is learned for each facial component. Our style consistency discriminator, in contrast, provides a general solution for image synthesis from both sparse labels (e.g. sketch and pose) and pixel-wise dense labels (e.g. scene parsing).

3 Example-guided Image Synthesis

In this section, we first review the baseline model pix2pixHD [46], then describe our method, a conditional generative adversarial network for synthesizing photorealistic images from semantic label maps given specific exemplars. Finally we show how to appropriately prepare training data for our framework.

3.1 The pix2pixHD Baseline

The pix2pixHD [46] is a powerful image synthesis and interactive manipulation framework based on the pioneering conditional image-to-image translation method pix2pix [22]. Let be a label map from a semantic label domain , the goal of pix2pixHD is to synthesize an image , from : . It consists of a hierarchically integrated generator and multi-scale discriminators to handle high-resolution synthesis tasks. The goal of the generator is to translate semantic label maps to photorealistic images, and the objective of the discriminators is to distinguish generated fake images from real ones at different resolution. The training dataset consists of pairs of label map and corresponding real image .

The pix2pixHD optimizes a multi-task problem with a standard GAN loss and feature matching loss :

(1)

where is the standard GAN loss given by:

(2)

and is the feature matching loss given by:

(3)

where is the layer size and is the feature size in corresponding discriminator layer. An optional perceptual loss is introduced as the loss between pre-trained VGG network [44] features.

One appealing feature of pix2pixHD is the instance-level image manipulation with a feature embedding technique. Given an instance-level segmentation map, pix2pixHD is able to synthesize an image with a specific appearance from an instance exemplar in the same object category. We will show that without the input instance-level pixel-wise segmentation map as a constraint, our model is still able to synthesize images with styles automatically transferred from exemplar images.

Figure 2: Overview of our framework consisting of a generator and two discriminators and . (a) Given an input label map, a guided example and its labels generated by a known function , the generator tries to synthesize an image semantically consistent to the labels, while being style-consistent to the exemplar. (b) The standard discriminator learns to distinguish between real and synthetic images on conditional input. (c) The style consistency discriminator aims to distinguish between style-consistent image pairs and style-inconsistent image pairs.

3.2 Our Model

Let be a guidance image from a natural image domain . Our goal is to synthesize an image , from a semantic label map and an image : . The role of is to provide a style constraint to image synthesis: the output image must be style-consistent with the exemplar . Our problem is more difficult than that solved by pix2pixHD. One particular challenge we face is that given an input label map , the ground truth images for arbitrary guided style exemplars are missing. To solve this weakly-supervised problem, we learn style consistency between pairs of images: they could be style-consistent image pairs or style-inconsistent image pairs (see Section 3.3).

An overview of our method is illustrated in Figure 2. It builds upon a single-scale version of pix2pixHD, and contains: (i) a generator , with semantic map , style example and its corresponding label as input and output a synthetic image; (ii) a standard discriminator to distinguish real images from fake ones given conditional inputs; and (iii) we introduce a style consistency discriminator to detect whether the synthetic image and the guidance image are style-compatible, which operates on image pairs from domain . Here, is an operator which, given an image produces a set of semantic labels that represent the image (choices of are given in Section 4.2); for convenience can be visualized as an image, provided the viewer recalls that the image contains semantic labels. Our objective function contains three losses: a standard adversarial loss; a novel adversarial style consistency loss; and a novel adaptive semantic consistency loss.

3.2.1 Standard Adversarial Loss

We apply standard adversarial losses via the standard discriminator as:

(4)

where the tries to synthesize images that look similar to real images from image domain regardless of specific styles, while given an image conditioned with the corresponding label map, the aims to determine the image is real or fake.

3.2.2 Adversarial Style Consistency Loss

With the standard adversarial loss, the generator is able to synthesize images matching the data distribution of domain , however the synthetic results are not guaranteed to be style-consistent with the corresponding guidance . We introduce the style consistency loss using a discriminator associated with a pair of images — either both real, or one real and one synthetic:

(5)

where and are a pair of sampled real images from domain with the same style, and are a pair of sampled real images from domain with different styles. We introduce the data sampling strategy in Section 3.3.

With the proposed adversarial style consistency loss , the discriminator tries to learn awareness of style consistency between a pair of images, while the generator tries to fool by generating an image with the same style to exemplar .

3.2.3 Adaptive Semantic Consistency Loss

The semantic consistency loss is introduced to reconstruct an image from a label map in the semantic sense of e.g. sketch. It may appear we could use the error between the input labels , and the predicted labels from the synthetic image, , for example or some variant thereof. However, different applications give distinct meanings to the semantic label maps, with the consequence that the gradient of the loss will, in general, vary between applications. This would mean selecting hyper-parameters to combine losses on a per-application basis.

We avoid this problem by always computing semantic consistency losses between images: the synthetic image and specifically an image which is a priori known to be consistent with a given semantic map . Typically the image is drawn from the training dataset and we have . A particular issue with our adopted scheme is that such losses will try to converge the network output on the image , which by choice is photorealistic and is semantically consistent with . Such behavior would work perfectly when and are sampled from images with the same style, but could force the output away from the desired style when and are “style-wise” different, i.e. .

Our solution, is to use a novel adaptive VGG loss computed via a pre-trained model [23] between the synthetic image and the real image of label map . An adaptive weighting scheme is proposed for per-layer VGG loss computation, to ensure the semantic consistency of the synthetic image to :

(6)

where represents the -th layer feature extractor of the VGG network, and is the adaptive weight for the -th layer. We set to gain the impact of details from shallow layers when and are from style-consistent sampled pairs , and to suppress the impact of detail matching for style-inconsistent pair ; is the number of elements in the -th feature layer. The adaptive weighting scheme is illustrated in Figure 3.

Figure 3: Adaptive weight for semantic consistency loss.
Full Objective.

The final loss is formulated as:

(7)

where and control the relative importance of the terms, our full objective is given by:

(8)

3.3 Sampling Strategy for Style-consistent and Style-inconsistent Image Pairs

So far, we have introduced the core techniques of our network. However one prerequisite to our method is to obtain style-consistent image pairs and style-inconsistent image pairs . Thus the datasets for prior image-to-image translation works [22, 46, 51, 31, 20] are not feasible for our training.

A key idea for training data acquisition is to collect image pairs from videos. In face and dance synthesis tasks, we observed that: (i) within a short temporal period of a video, the style of frame contents are ensured to be the same, and (ii) frames from different videos probably have different styles (e.g. different gender, hairstyles, skin colors and make-up in the face image synthesis application). We thus randomly sample pairs of frames within

frames from a video and regard them as style-consistent ones . For style-inconsistent pairs , we firstly randomly sample pairs of frames from different videos, then manually label whether images from each sampled pair are style-consistent or not.

In the street view synthesis task, as large scale street view videos with different styles are not easy to collect, we use images from the BDD100K dataset [47]. In BDD100K, street view images and the weather, time of day attributes are provided. We coarsely categorize the images into style groups based on the attributes, then sample style-consistent image pairs inside each group and sample style-inconsistent image pairs between groups. Figure 4 shows representative sampled pairs of images.

Figure 4: Representative sampled data for training networks using FaceForensics [41], YouTube Dances and BDD100K [47] datasets. Each row shows pairs of sampled images from the above three datasets.

4 Experiments

4.1 Implementation Details

Figure 5: Example-based face image synthesis on the FaceForensics dataset. The first column shows the input labels, the second column shows the input style example, next columns show the results from our method and our ablation studies, pix2pixHD, pix2pixHD with DPST, MUNIT and PairedMUNIT.

We implement our model based on the single-scale pix2pixHD framework and experiment with images with size ( for street view synthesis). The generator

contains several Convolution-InstanceNorm-ReLU-Stride-2 layers to encode deep features, then 9 residual blocks

[13] and finally some Convolution-InstanceNorm-ReLU-Stride-0.5 layers to synthesize images. For both discriminators and , we use PatchGANs [22] with several Convolution-InstanceNorm-LeakyReLU-Stride-2 layers with the exception that InstanceNorm is not applied in the first layer. The slope for LeakyReLU is set as . For all the experiments, we set and in Equation 7. All the networks are trained from scratch on an NVIDIA GTX 1080 Ti GPU using the Adam solver [27] with a batch size of 1. The learning rate is initially fixed as for the first 500K iterations and linearly decayed to zero over the next 500K iterations. We use LSGANs [37] for stable training. For more details, please refer to the supplementary material.

4.2 Datasets

We evaluate our method on face, dance and street view image synthesis tasks, using the following datasets:

SketchFace. We use the real videos in the FaceForensics dataset [41], which contains videos of reporters broadcasting news. We use the image sampling strategy described in Section 3.3 to acquire training image pairs from video, then apply face alignment algorithm [26] to localize facial landmarks, crop facial regions and resize them to size . The detected facial landmarks are connected to create face sketches as function .

PoseDance. We download solo dance videos from YouTube, crop out the central body regions and resize them to . As the number of videos is small, we evenly split each video into the first part and the second part along the time-line, then sample training data only from the first parts and sample testing data only from the second parts of all the videos. The function is implemented using concatenated pre-trained DensePose [40] and OpenPose [3] pose detection results to provide pose labels.

Scene parsingStreet view. We use the BDD100k dataset [47] to synthesize street view images from pixel-wise semantic labels (i.e. scene parsing maps). We use the state-of-the-art scene parsing network DANet [9] as the function . Please find more details in our supplementary material.

4.3 Baselines

We compare our method with the following algorithms:

pix2pixHD and pix2pixHD [46] with DPST [34]. pix2pixHD is the image-to-image translation baseline. A default image could be synthesized using pix2pixHD with its style then transfered to the guided example using Deep Photo Style Transfer (DPST) method.

MUNIT [20] and PairedMUNIT. MUNIT is the state-of-the-art unsupervised image-to-image translation method with disentangled content and style representations that are able to translate images to given exemplars. We modify MUNIT by integrating pairwise style information to the original model and adaptively computing losses with style (denoted as PairedMUNIT).

Ours without , or adaptive weights for ablation studies. All of the methods are trained on the datasets introduced in Section 4.2.

4.4 Evaluation Metrics

Photorealism and Semantic Consistency. We use the Fréchet Inception Distance [15] to evaluate the realism and faithfulness of the synthetic results. This metric is widely used for implicit generative models, because it correlates with the visual quality of generated samples. A smaller FID is often favored by the human subjects. We further evaluate semantic consistency by translating the synthetic images back to the label domain and comparing the accuracy to the input labels. For tasks SketchFace and PoseDance, we use the labeling endpoint error (LEPE) between the input label map and the labels generated by to compute the label accuracy. For task Scene parsingStreet view, we use scene parsing score (SPS) [9] on synthetic street view images to measure the segmentation accuracy.

Style Consistency. We perform a human perceptual study to compare style consistency from human point of view. We show pairs of our result and the result from baseline methods to invited subjects and ask which one they see as being closer to the guidances’ style.

SketchFace PoseDance ParsingStreet
pix2pixHD 39.86 92.33 157.46
MUNIT 148.57 158.47 235.84
PairedMUNIT 142.08 161.22 259.95
Ours 31.26 33.39 96.23
Table 1: Photorealism comparison measured by Fréchet Inception Distance (FID) [15].
Method SketchFace PoseDance
pix2pixHD 0.0050 0.0163
MUNIT 0.0107 0.0958
PairedMUNIT 0.0080 0.0502
Ours 0.0085 0.0186
Table 2: Semantic consistency measured by normalized label endpoint error for different methods in face and dance image synthesis tasks.

4.5 Results

Main Results. In Figure 14, we show our results (column 3) and the results from baseline methods in the SketchFace synthesis application on the test set. While the pix2pixHD is able to generate photorealistic images consistent with the input semantic labels, it is not able to keep the style (e.g. gender, hair, skin color) from input exemplars in the synthetic results, even enhanced by the deep photo style transfer effect (column 7 and 8). The unsupervised method MUNIT and its improvement PairedMUNIT fail to generate photorealistic results from semantic maps in this application (column 9 and 10). The possible reason for their failures is that they assume that the input and output domains share the same content space, which is not true in image synthesis applications from semantic label maps.

Table 1 gives the quantitative evaluation of the photorealism measured by FID in various image synthesis tasks, where our method performs the best. The semantic consistency of synthetic results to the input labels is given by LEPE in Table 2. It can be seen that the pix2pixHD obtains the best semantic consistency to the input labels, because it does not lose semantic accuracy by totally ignoring style consistency. Our method outperforms MUNIT and PairedMUNIT.

For style consistency evaluation, we conduct a human perception study commonly used in image-to-image translation works [22, 51, 6, 46, 8]. The input exemplars and pairwise synthetic results sampled from our method and a baseline method are shown to the subjects with unlimited watching time. Then the subjects were asked “Which image is closer to the exemplar in terms of style?” Images for user study were randomly sampled from the test set; each pair was shown in random order and guaranteed to be examined by at least 30 subjects. The ratios of votes our method got over baseline methods are given in Table 3. Our method won more user preferences in pairwise comparison. The quantitative results shown that our results are more photorealistic and more style-consistent with the exemplars.

We conducted ablation studies to verify our model. As can be seen in Figure 14, without the adaptive weight scheme in , the quality of results is slightly reduced; without the semantic loss , the semantic consistency would lose; without the style consistency adversarial loss , the target style is not maintained. Quantitative photorealism statistics reported in Table 4 validated the above observation. We further extract eye patches from synthetic images and exemplars and compute the VGG feature distance between them. Table 5 indicates that the weight adaptation makes a quantitative improvement of style consistency.

Figure 15 shows the in-the-wild synthesis results from our model using Internet images. The results indicate that the model generalizes well for “unseen” cases. We provide more results in the supplementary material.

pix2pixHD pix2pixHD+DPST PairedMUNIT
Ours 89.12% 80.67% 90.88%
Table 3: Style consistency evaluation by human option study on SketchFace synthesis. Each cell lists the percentage where our result is preferred over the other method.
Method FID Method FID
Ours w/o adaptive weights 35.59 Ours w/o 58.08
Ours w/o 76.59 Ours 31.26
Table 4: Ablation study: Fréchet Inception Distance (FID) of our results and alternatives on the SketchFace synthesis task.
Ours Ours w/o adapt. weights w/o w/o
VGG Dist. 0.643 0.654 0.898 0.703
Table 5: VGG feature distance of eye patches between synthetic image and exemplar.
Figure 6: In-the-wild SketchFace synthesis.
Figure 7: Dance synthesis from pose maps.
Figure 8: PoseDance comparison with Ma et al. [36].
Figure 9: More results of example-based image synthesis on face, dance and street view synthesis tasks.
Figure 10: Street view synthesis from scene parsing maps and corresponding exemplars.

PoseDance Synthesis. Figure 7 shows a visual comparison of our method and baselines in the PoseDance synthesis application. The semantic consistency of synthetic results to the input labels measured using LEPE are given in Table 2. Although the facial regions of our results are blurry without including facial landmarks in the input pose labels, our model still produces images that are style-consistent with the guidance images while consistent with the semantic labels. Figure 8 shows the visual comparison with Ma et al. [36] on the dancing dataset. The generated poses and clothes in our results are visually better.

Scene parsingStreet view Synthesis. A comparison of our method and baselines in the Scene parsingStreet view task is given in Figure 10. The semantic consistency of synthetic results to the input labels measured using SPS are given in Table 6. Although the scene in the guidance images are not quite the same as the semantics of the input label maps, our model is able to produce images that are semantically consistent with the segmentation map and style-consistent with the guidance image.

Figure 9 shows more results. Our network can faithfully synthesize images from various semantic labels and exemplars. Please find more results in the supplementary file.

Method Per-pixel acc. Per-class acc. Class IOU
pix2pixHD 83.85 36.17 0.310
MUNIT 58.58 18.99 0.139
PairedMUNIT 62.96 22.41 0.160
Ours 84.71 39.44 0.333
Original image 86.74 52.25 0.452
Table 6: Semantic consistency measured by scene parsing score [9] for different methods on the street view image synthesis task.

5 Conclusions

In this paper, we present a novel method for example-guided image synthesis with style-consistency from general-form semantic labels. During network training, we propose to sample style-consistent and style-inconsistent image pairs from video to provide style awareness to the model. Beyond that, we introduce the style consistency adversarial losses and the style consistency discriminator, as well as the semantic consistency loss with adaptive weights, to produce plausible results. Qualitative and quantitative results in different applications show that the proposed model produces realistic and style-consistent images better than those from prior arts.

Limitations and Future Work. Our network is mainly trained on cropped video data whose resolution is limited (e.g. ), we did not use the multi-scale architecture as pix2pixHD did for high-resolution image synthesis (e.g. resolution or more). Moreover, the synthetic background in face and dance image synthesis tasks may be blurry, because the semantic labels do not specify any background scenes. Lastly, we have demonstrated the efficiency of our method in several synthesis applications, however the results in other applications could be effected by the performance of the state-of-the-art semantic labeling function . In the future, we plan to extend this framework to video domain [45] and synthesize style-consistent videos to given exemplars.

Acknowledgements. We thank the anonymous reviewers for the valuable discussions. This work was supported by the Natural Science Foundation of China (Project Number: 61521002, 61561146393). Shi-Min Hu is the corresponding author.

References

Appendix A Datasets

As described in the main manuscript, we evaluate our model on face, dance and street view image synthesis tasks, using following datasets and semantic functions:

SketchFace. We use the real videos in the FaceForensics dataset [41], which contains videos of reporters broadcasting news. We use the image sampling strategy described in Section 3.3 of the main manuscript to acquire training image pairs from video, then apply face alignment algorithm [26] to localize facial landmarks, crop facial regions and resize them to size . We sample images from videos for training and images from distinct videos for testing. The detected facial landmarks are connected to create face sketches; this is the function , in both training set and test set. For each sketch extracted from a training image, we randomly sample guidance images from other videos for training, and for each testing sketch, we randomly sample guidance images from other videos for testing.

SceneParsingStreetView. We use the BDD100k dataset [47] to synthesize street view images from pixel-wise semantic labels (i.e. scene parsing) maps. For each street view image in the dataset, the corresponding scene parsing map and weather and timeofday attributes are provided. Based on these attributes, we divide images into style groups as listed in Table 7, then sample style-consistent image pairs inside each group and style-inconsistent image pairs between groups. The training set contains images and test set contains images, both resized to width . We use scene parsing network DANet [9] as the function for each street view image during testing. For each scene parsing map, we randomly select an image inside each style group as the guidance, both in training and testing phases.

Group Weather Timeofday
1 - Night
2 Foggy Dawn or Dusk
3 Overcast Daytime
4 Rainy Dawn or Dusk
5 Snowy Dawn or Dusk
6 Clear Dawn or Dusk
7 Foggy Daytime
8 Partly cloudy Dawn or Dusk
9 Rainy Daytime
10 Snowy Daytime
11 Clear Daytime
12 Overcast Dawn or Dusk
13 Partly cloudy Daytime
Table 7: Style groups we used to categorize BDD100K street view images.

PoseDance. We downloaded solo dance videos from YouTube, cropped out the central body regions and resized them to size . As the number of videos is small, we evenly split each video into the first part and the second part along the timeline, then sample training data only from the first parts and sample testing data only from the second parts of all the videos. The function is implemented using concatenated pre-trained DensePose [40] and OpenPose [3] pose detection results to provide pose labels. As a result, we have images for training and images for testing. For each pose extracted from a training image, we randomly sample guidance images from other dancing videos, and for each testing pose, we randomly sample guidance images from other dancing videos.

Appendix B Network Architectures

b.1 Generator

We follow the naming convention used in Johnson et al. [23], CycleGAN [51] and pix2pixHD [46]. Let c7s1-k denote a Convolution-InstanceNorm-ReLU layer with filters and stride 1. dk denotes a Convolution-InstanceNorm-ReLU layer with filters and stride

. Reflection padding is used to reduce boundary artifacts.

Rkt denotes residual blocks each contains two convolutional layers with filters, repeated times. uk denotes a fractional-strided-Convolution-InstanceNorm-ReLU layer with filters and stride .

The architecture of generator is represented as:

c7s1-64, d128, d256, d512, d1024, R10249, u512, u256, u128, u64, c7s1-3

b.2 Discriminators

We use PatchGAN [22] in both of the two discriminators and . Let Ck denote a Convolution-InstanceNorm-LeakyRU layer with filters and stride . The last layer is send to an extra convolution layer to produce a dimensional output. InstanceNorm is not used for the first C64 layer. Leaky ReLU slope is set as .

The architectures of and are identical:

C64, C128, C256, C512

Appendix C Training Details

All the networks were trained from scratch. Weights were initialized from a Gaussian distribution with mean 0 and standard deviation 0.02. In the first 250K iterations, the learning rate was fixed as 0.0002 with the adversarial style-consistency loss

turned-off. In the next 250K iterations, we turned on the loss. In the final 500K iterations, the learning rate linearly decayed to zero with all the losses turned-on.

The models were trained on an NVIDIA TITAN 1080 Ti GPU with 11GB memory. The inference time is about 8-10 milliseconds per image.

Appendix D Additional Results

In Figure 11 and following pages, we show further experimental results from our method and baselines.

Figure 11: Example-based dance image synthesis YouTube Dance dataset. The first column shows the input pose labels, the second column shows the input style examples, next columns show the results from our method, pix2pixHD and PairedMUNIT.
Figure 12: More results of dance synthesis. The first column shows input pose maps. The first row shows input dance exemplars. Other images are the synthetic dance results.
Figure 13: Example-based face image synthesis on the FaceForensics dataset. The first column shows the input labels, the second column shows the input style example, next columns show the results from our method and our ablation studies, pix2pixHD, pix2pixHD+DPST and PairedMUNIT.
Figure 14: More results of face synthesis. The first column shows input sketch maps. The first row shows input face exemplars. Other images are the synthetic face results.
Figure 15: More in-the-wild SketchFace results. The model is trained on our training dataset and tested on Internet images.
Figure 16: More results of street view synthesis. The first column shows input segmentation maps. The first row shows input exemplars. Other images are the synthetic street view results.