Edge Guided GANs with Semantic Preserving for Semantic Image Synthesis

03/31/2020 ∙ by Hao Tang, et al. ∙ 28

We propose a novel Edge guided Generative Adversarial Network (EdgeGAN) for photo-realistic image synthesis from semantic layouts. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to two largely unresolved challenges. First, the semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. Second, the widely adopted CNN operations such as convolution, down-sampling and normalization usually cause spatial resolution loss and thus are unable to fully preserve the original semantic information, leading to semantically inconsistent results (e.g., missing small objects). To tackle the first challenge, we propose to use the edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. Edge information is produced by a convolutional generator and introduces detailed structure information. Further, to preserve the semantic information, we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout. Extensive experiments on two challenging datasets show that the proposed EdgeGAN can generate significantly better results than state-of-the-art methods. The source code and trained models are available at https://github.com/Ha0Tang/EdgeGAN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 26

page 27

page 28

page 29

page 30

page 33

page 34

page 39

Code Repositories

EdgeGAN

Edge Guided GANs with Semantic Preserving for Semantic Image Synthesis


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Overview of the proposed EdgeGAN framework. It consists of a parameter-sharing encoder , an edge generator , an image generator , and a multi-modality discriminator . The edge and image generators are connected by the proposed attention guided edge transfer module from two levels, i.e., edge feature-level and edge content-level, in order to generate realistic images. The semantic preserving module is proposed to preserve the semantic information of the input semantic labels. The discriminator aims to distinguish the outputs from two modalities, i.e., edge and image. The whole framework can be end-to-end trained. The symbol c⃝ denotes channel-wise concatenation.

Semantic image synthesis refers to the task of generating photo-realistic images conditioned on pixel-level semantic labels. This task has a wide range of applications such as image editing and content generation [9, 27, 45, 20, 4, 5]. Although existing approaches such as [9, 27, 45, 20, 38, 46] conducted interesting explorations, we still observe unsatisfactory aspects mainly in the generated local structures and small-scale objects, which we believe are mainly due to two reasons. First, conventional methods [45, 59, 38] generally take the semantic label map as input directly. However, the input label map provides only structural information between different semantic-class regions and does not contain any structural information within each semantic-class region, making it difficult for synthesizing rich local structures within each class. Take Fig. 1 (Label map: ) as an example, the generator does not have enough structural guidance to produce a realistic bed, window and curtain from only the input label (). Second, the classic deep network architectures are constructed by stacking convolutional, down-sampling, normalization, non-linearity and up-sampling layers, which will cause the problem of spatial resolution losses of the input semantic labels.

To address both issues, in this paper, we propose a novel Edge guided Generative Adversarial Network (EdgeGAN) for semantic image synthesis tasks. The overall framework of the proposed EdgeGAN is shown in Fig. 1. We first propose an edge generator to produce the edge features and edge maps, and then the generated edge features and edge maps are selectively transferred to the image generator for improving the quality of the image results by using the proposed attention guided edge transfer module. Moreover, to tackle the issue of the spatial resolution losses caused by the common operations in the deep networks, we propose an effective semantic preserving module, which aims at selectively highlighting class-dependent feature maps according to the original semantic layout. Finally, we develop a multi-modality discriminator to simultaneously distinguish the output from two modal spaces, i.e., the edge and the image space. All the proposed modules are jointly optimized in an end-to-end fashion so that each module can benefit from each other in the training.

We conduct extensive experiments on two challenging datasets, i.e., Cityscapes [13] and ADE20K [67]. Both qualitative and quantitative results show that the proposed EdgeGAN is able to produce remarkably better results than existing baseline models such as CRN [9], Pix2pixHD [59], SIMS [46], GauGAN [45] and CC-FPSE [38], regarding both the visual fidelity and the alignment with the input semantic labels.

To summarize, the contributions of this paper are as follows:

  • [leftmargin=*]

  • We propose a novel Edge Guided GAN (EdgeGAN) for challenging semantic image synthesis tasks. To the best of our knowledge, we are the first to explore the edge generation from semantic layouts and then utilize the generated edges to guide the generation of realistic images.

  • We propose an effective attention guided edge transfer module to selectively transfer useful edge structure information from the edge generation branch to the image generation branch. We also design a new semantic preserving module to highlight class-dependent feature maps based on the input semantic label map for generating semantically consistent results, which is not investigated by any existing GAN-based generation works.

  • Extensive experiments clearly demonstrate the effectiveness of the proposed different modules and the EdgeGAN framework, and establish new state-of-the-art results on two challenging datasets, i.e., Cityscapes [13] and ADE20K [67]. The code will be made publicly available.

2 Related Work

Generative Adversarial Networks (GANs) [19] have two important components, i.e., a generator and a discriminator. Both are trained in an adversarial way to achieve a balance. Recently, GANs have shown the capability of generating realistic images [65, 6, 30, 49, 26, 21, 36, 16, 50, 37, 48, 29, 12, 18, 28, 51, 57]. Moreover, to generate user-specific images, Conditional GANs (CGANs) [40] have been proposed. CGANs usually combine a vanilla GAN and some external information such as class labels [11, 60], human poses [14, 69, 52, 8], conditional images [27, 56], text descriptions [34, 64] and segmentation maps [59, 45, 58, 55, 20, 2, 44].

Image-to-Image Translation aims to generate the target image based on an input image. CGANs have achieved decent results in both paired [27, 1, 54] and unpaired [17, 68]

image translation tasks. For instance, Isola et al. propose Pix2pix 

[27], which employs a CGAN to learn a translation mapping from input to output image domains, such as map-to-photo and day-to-night. To further improve the quality of the generated images, the attention mechanism has been recently investigated in image-to-image translation tasks, such as [55, 31, 39, 10, 62].

Different from previous attention-related image generation works, we propose a novel attention guided edge transfer module to transfer useful edge structure information from the edge generation branch to the image generation branch at two different levels, i.e., feature level and content level. To the best of our knowledge, our module is the first attempt to incorporate both edge feature attention and edge content attention within a GAN framework for image-to-image translation tasks.

Edge Guided Image Generation.

Edge maps are usually adopted in image inpainting

[47, 42, 35]

and image super-resolution

[43] tasks to reconstruct the missing structure information of the inputs. For example, Nazeri et al. [42] propose an edge generator to hallucinate edges in the missing regions given edges, which can be regarded as an edge completion problem. Using edge images as the structural guidance, [42] achieves good results even for some highly structured scenes. Moreover, Ghosh et al. [15] propose an interactive GAN-based sketch-to-image translation model that can help users easily create images of some simple objects. Pix2pix [27] adopts edge maps as input and aims to generate photo-realistic shoe and handbag images, which can be seen as an edge-to-image translation problem.

Different from previous works, we propose a novel edge generator to perform a new task, i.e., semantic label-to-edge translation. To the best of our knowledge, we are the first time to generate realistic edge maps from semantic labels. Then the generated edge maps, which with more local structure information, can be used to improve the quality of the image results.

Semantic Image Synthesis aims to generate a photo-realistic image from a semantic label map [59, 9, 46, 45, 38, 3]. With the semantic information as guidance, existing methods have achieved promising performance. However, we can still observe unsatisfying aspects, especially on the generation of the small-scale objects, which we believe is mainly due to the problem of spatial resolution losses associated with deep network operations such as convolution, normalization and down-sampling, etc.

To solve this problem, Park et al. propose GauGAN [45], which uses the input semantic labels to modulate the activations in normalization layers through a spatially-adaptive transformation. However, the spatial resolution losses caused by other operations such as convolution and down-sampling have not been resolved. Moreover, we observe that the input label map has only a few semantic classes in the entire dataset. Thus the generator should focus more on learning these existing semantic classes rather than all the semantic classes.

To tackle both limitations, we propose a novel semantic preserving module, which aims to selectively highlight class-dependent feature maps according to the input label for generating semantically consistent image. This idea is not investigated by existing GAN-based generation works.

3 Edge Guided GANs with Semantic Preserving

In this section, we describe the proposed Edge Guided GAN (EdgeGAN) for semantic image synthesis. We first introduce an overview of the proposed EdgeGAN, and then introduce the details of each module. Finally, we present the optimization objective.

Framework Overview. Fig. 1 shows the overall structure of the proposed EdgeGAN for semantic image synthesis, which consists of a semantic and edge guided generator and a multi-modality discriminator . The generator consists of five components: 1) a parameter-sharing convolutional encoder

is proposed to produce deep feature maps

; 2) an edge generator is adopted to generate edge maps taking as input deep features from the encoder; 3) an image generator is used to produce intermediate images ; 4) an attention guided edge transfer module is designed to forward useful structure information from the edge generator to the image generator; and 5) the semantic preserving module is developed to selectively highlight class-dependent feature maps according to the input label for generating semantically consistent images . Meanwhile, to effectively train the network, we propose a multi-modality discriminator that can simultaneously distinguish the outputs from two modalities, i.e., edge and image space.

EdgeGAN takes a semantic layout as input and outputs a semantically correspondent photo-realistic image. During training, the ground truth edge map is extracted from corresponding ground truth images with Canny Edge Detector [7].

3.1 Edge Guided Semantic Image Synthesis

Parameter-Sharing Encoder. The backbone encoder could employ any network structures, such as the commonly used AlexNet [33], VGG [53] and ResNet [22]. We directly utilize the feature maps from the last convolutional layer as deep feature representations , where represents the encoder, is the input label, and are width and height of the input semantic labels, and is the total number of semantic classes. Optionally, one can always combine multiple intermediate feature maps to enhance the feature representation.

The encoder is shared by the edge generator and the image generator. Then, the gradients from the two generators all contribute to updating the parameters of the encoder. This compact design can potentially enhance the deep representations since the encoder can simultaneously learn structure (from the edge generation branch) and appearance (from the image generation branch) representations.

Edge Guided Image Generation. As discussed, the lack of detailed structure or geometry guidance makes it extremely difficult for the generator to produce realistic local structures and details. To overcome this limitation, we propose to adopt the edge as guidance. A novel edge generator is designed to directly generate the edge maps from the input semantic labels. This also facilitates the shared encoder to learn more local structures of the targeted images. Meanwhile, the image generator aims to generate photo-realistic images from the input labels. In this way, the encoder is promoted to learn the appearance information of the targeted images.

Previous works [45, 38, 46, 9, 59] directly use deep networks to output the target image, which is challenging since the network need to simultaneously learn appearance and structure information from the input labels. In contrast, our EdgeGAN separately learns structure and appearance via the proposed edge generator and image generator. Moreover, the explicit guidance from ground truth edge maps can also benefit training the encoder.

The framework of the proposed edge and image generators are illustrated in Fig. 2. Given the feature maps from the last convolutional layer of the encoder, i.e., , where and are width and height of the features, and is the number of channels, the edge generator produces edge features and edge maps which are further utilized to guide the image generator to produce the intermediate image .

The edge generator contains convolution layers and correspondingly produces intermediate feature maps . After that, another convolution layer with Tanh(·) non-linear activation is utilized to generate the edge map . Meanwhile, the feature maps is also fed into the image generator to generate intermediate feature maps . Then another convolution operation with Tanh(·) non-linear activation is adopted to produce the intermediate image . In addition, the intermediate edge feature maps and the edge map are utilized to guide the generation of the image feature maps and the intermediate image via the Attention Guided Edge Transfer as detailed below.

Figure 2: Structure of the proposed edge generator (first row), the proposed attention guided edge transfer module (middle row) and the proposed image generator (third row). The edge generator selectively transfers useful local structure information to the image generator using the proposed attention guided transfer module . The symbols , and $⃝\sigma$

denote element-wise addition, element-wise multiplication and Sigmoid activation function, respectively.

Attention Guided Edge Transfer. We further propose a novel attention guided edge transfer module to explicitly employ the edge structure information to refine the intermediate image representations. The architecture of the proposed transfer module is illustrated in Fig. 2.

To transfer useful structure information from edge feature maps to the image feature maps , the edge feature maps are firstly processed by Sigmoid(·) activation function to generate the corresponding attention maps . Then, we multiply the generated attention maps with the corresponding image feature maps to obtain the refined maps which incorporate local structures and details as Eq. (1). Finally, the edge refined features are element-wisely summed with the original image features to produce the final edge refined features, which are further fed to the next convolution layer.

(1)

By this way, the image feature maps also contain the local structure information provided by the edge feature maps.

Similarly, to directly employ the structure information from the generated edge map for image generation, we adopt the attention guided edge transfer module to refine the generated image directly with edge information as Eq. (2).

(2)

3.2 Semantic Preserving Image Enhancement

Due to the spatial resolution loss caused by convolution, normalization and down-sampling layers, existing models [59, 45, 46, 9] could not be able to fully preserve the semantic information of the input labels as illustrated in Fig. 8, e.g., the small ‘pole’ is missing and the large ‘fence’ is incomplete. To fix this problem, we propose a novel semantic preserving module, which aims to select class-dependent feature maps and further enhance it guided by the original semantic layout. An overview of the proposed semantic preserving module is shown in Fig. 3.

Specifically, the input denoted as to the module is the concatenation of the input label , the generated intermediate edge map and image , and the deep feature produced from the shared encoder .

Figure 3: Overview of the proposed semantic preserving module , which aims to capture the semantic information and predict scaling factors that are conditional on the combined feature maps . These learned factors selectively highlight class-dependent feature maps, which are visualized in different colors. The symbols , and $⃝\sigma$ denote element-wise addition, element-wise multiplication and Sigmoid activation function, respectively.

Then, we apply a convolution operation on to produce a new feature map with channel number equal to the number of semantic categories, where each channel corresponds to a specific semantic category. Next, we apply the averaging pooling operation on to obtain the global information of each class followed by a Sigmoid(·) activation function to derive scaling factors as Eq. (3), where each value represents the importance of each class.

(3)

Then, the scaling factor is adopted to reweight the feature map and highlight corresponding class-dependent feature maps as Eq. (4). The reweighted feature map is further added with the original feature to compensate information loss due to multiplication, and produces ,

(4)

After that, we perform another convolution operation on to obtain the feature map to enhance the representative capability of the feature. In addition, has the same size as the original input one , which makes the module flexible and can be plugged into other existing architectures without modification of other parts to refine the output.

Finally, the feature map is fed into a convolution layer followed by a Tanh(·) non-linear activation layer to obtain the final result . The proposed semantic preserving module enhances the representational power of the model by adaptively re-calibrating semantic class-dependent feature maps, and shares similar spirits with style transfer [25], and recent works SENet [24] and EncNet [66]. One intuitive example of the utility of the module is for the generation of small object classes: the small object classes are easily missed in the generation results due to spatial resolution loss while our scaling factor can put an emphasis on small objects and help preserve them.

3.3 Model Training

Multi-Modality Discriminator. To facilitate training the proposed EdgeGAN for high-quality edge and image generation, a novel multi-modality discriminator is developed to simultaneously distinguish outputs from two modality spaces, i.e., edge and image. The framework of the proposed multi-modality discriminator is shown in Fig. 1, which are capable of discriminating both real/fake images and edges. To discriminate real/fake edges, the discriminator loss considering the semantic label and the generated edge (or the real edge ) is as Eq. (5) which guide the model to distinguish real edges from fake generated edges.

(5)

Further, to discriminate real/fake images, the discriminator loss regarding semantic label and the generated images , (or the real image ) is as Eq. (6), which guide the model to discriminate real/fake images.

(6)

Therefore, the total loss of the proposed multi-modality discriminator can be written as .

Optimization Objective. Equipped with the multi-modality discriminator, we elaborate the training objective for the generator as below. Three different losses, i.e., the conditional adversarial loss , the discriminator feature matching loss and the perceptual loss , are used to optimize the proposed EdgeGAN,

(7)

where , and are three parameters of the corresponding loss that contributes to the total loss ; where matches the discriminator intermediate features between the generated images/edges and the real images/edges; where matches the VGG [53] extracted features between the generated images/edges and the real images/edges. By maximizing the discriminator loss, the generator is promoted to simultaneously generate reasonable edge maps that can capture the local-aware structure information and generate photo-realistic images semantically aligned with the input semantic labels.

4 Experiments

Implementation Details. We adopt ResNet [22] as the structure of our encoder . Moreover, we employ the SPADE residual block [45] in our generator, which has been shown very effective in semantic image synthesis tasks. For both the image generator and edge generator

, the kernel size and padding size of convolutions are all

and 1 for preserving the feature map size. We set for generators , and . The channel size of feature is set to . For the semantic preserving module , we adopt adaptive averaging pooling operation. Spectral normalization [41] is applied to all the layers in both the generator and discriminator.

We follow the training procedures of GANs and alternatively train the generator and discriminator , i.e., one gradient descent step on discriminator and generator alternately. We first train with fixed, and then train with fixed. We use the Adam solver [32] and set , . , and in Eq. (7) is set to 1, 10 and 10, respectively. All in both Eq. (6) and (7) are set to 2. We conduct the experiments on an NVIDIA DGX1 with 8 V100 GPUs.

Datasets and Evaluation Metrics. We follow GauGAN [45] and conduct experiments on two challenging datasets, i.e., Cityscapes [13] and ADE20K [67]. The size of the training and validation set on Cityscapes are 3,000 and 500, respectively. For ADE20K, which contains challenging scenes with 150 semantic classes, and consists of 20,210 training and 2,000 validation images. Moreover, we adopt the mean Intersection-over-Union (mIoU), Pixel Accuracy (Acc) and Fréchet Inception Distance (FID) [23]

as the evaluation metrics. All images on Cityscapes and ADE20K are re-scaled to

and

, respectively. For both datasets, we preform 200 epochs of training with batch size 32, and the learning rate is linearly decayed to 0 from epoch 100 to 200.

Figure 4: Visual results generated by different methods on Cityscapes (top) and ADE20K (bottom).
AMT Cityscapes ADE20K
Ours vs. CRN [9] 70.28 81.35
Ours vs. Pix2pixHD [59] 60.85 85.18
Ours vs. SIMS [46] 57.67 N/A
Ours vs. GauGAN [45] 56.54 60.49
Ours vs. CC-FPSE [38] 55.81 57.75
Table 1: User preference study on Cityscapes and ADE20K. The numbers indicate the percentage of users who favor the results of the proposed EdgeGAN over the competing method. For this metric, higher is better.

4.1 Comparisons with State-of-the-Art

Qualitative Comparisons. Qualitative results of Cityscapes and ADE20K compared with existing methods, i.e., Pix2pixHD [59], CRN [9] and SIMS [46], are shown in Fig. 4. We can see that the proposed EdgeGAN achieves significantly better results with fewer visual artifacts than other baselines.

To further validate the effectiveness of the proposed EdgeGAN, we compare it with two stronger baselines, i.e., GauGAN [45] and CC-FPSE [38]. Note that we download their well-trained models and generate the results for fair comparisons. Results compared with GauGAN are shown in Fig. 5. We observe that the proposed EdgeGAN generates sharper images than GauGAN, especially at local structures and details. Besides, as shown in Fig. 6, the proposed EdgeGAN achieves significantly better results than CC-FPSE, while CC-FPSE always generates results with lots of visual artifacts on both datasets.

User Study. We follow the evaluation protocol of GauGAN [45] and conduct a user study. Specifically, we give the participants an input semantic label and two translated images from different models and ask them to choose the generated image that looks more like a corresponding image of the semantic label. The participants are given unlimited time to make the decision.

Figure 5: Visual results generated by GauGAN [45] and the proposed EdgeGAN on Cityscapes (top) and ADE20K (bottom).
Method Cityscapes ADE20K
mIoU Acc FID mIoU Acc FID
CRN [9] 52.4 77.1 104.7 22.4 68.8 73.3
SIMS [46] 47.2 75.5 49.7 N/A N/A N/A
Pix2pixHD [59] 58.3 81.4 95.0 20.3 69.2 81.8
GauGAN [45] 62.3 81.9 71.8 38.5 79.9 33.9
CC-FPSE [38] 65.5 82.3 54.3 43.7 82.9 31.7
EdgeGAN (Ours) 64.5 82.5 57.1 42.0 82.0 32.4
Table 2: Our method achieves very competitive results compared to the current leading methods in semantic segmentation scores (mIoU and Acc) and FID. For mIoU and Acc, higher is better. For FID, lower is better.

Results compared with Pix2pixHD [59], CRN [9], SIMS [46], GauGAN [45] and CC-FPSE [38] are shown in Table 1. We observe that users favor our synthesized results on both datasets compared with other competing methods including GauGAN and CC-FPSE, further validating that the generated images by the proposed EdgeGAN are more natural and photo-realistic.

Quantitative Comparisons. Quantitative results of the mIoU, Acc and FID metrics are shown in Table 2. It is clear that the proposed EdgeGAN outperforms most existing leading methods by a large margin except CC-FPSE [38]. However, CC-FPSE generates significantly worse visual results than ours as shown in Fig. 6. Moreover, we provide the number of model parameters in Table 3. We see that the proposed EdgeGAN has much fewer model parameters than CC-FPSE on both datasets, which means the proposed EdgeGAN can be trained with less training time and GPU memory.

Figure 6: Visual results generated by CC-FPSE [38] and the proposed EdgeGAN on Cityscapes (top) and ADE20K (bottom).
Method Cityscapes ADE20K
G D Total G D Total
GauGAN [45] 93.0M 5.6M 98.6M 96.5M 5.8M 102.3M
CC-FPSE [38] 138.6M 5.2M 143.8M (+45.2M) 151.2M 5.2M 156.4M (+54.1M)
EdgeGAN (Ours) 93.2M 5.6M 98.8M (+0.2M) 97.2M 5.8M 103.0M (+0.7M)
Table 3: Comparison of the number of model parameters. ‘G’ and ‘D’ denote Generator and Discriminator, respectively.

Visualization of Edge and Attention Maps. We also visualize the generated edge and attention maps on both datasets in Fig. 7. We observe that the proposed EdgeGAN can generate reasonable edge maps according to the input labels, thus the generated edge maps can be used to provide more local structure information for generating more photo-realistic images.

Visualization of Segmentation Maps. We follow GauGAN [45] and apply pre-trained segmentation networks [63, 61] on the generated images to produce segmentation maps. Results compared with GauGAN [45] are shown in Fig. 8. We consistently observe that the proposed EdgeGAN generates better semantic labels than GauGAN on both datasets.

4.2 Ablation Study

Variants of EdgeGAN. We conduct extensive ablation studies on Cityscapes [13] to evaluate different components of the proposed EdgeGAN. The proposed EdgeGAN has four baselines as shown in Table 4: (i) ‘+’ means only using the encoder and the proposed image generator to synthesize the targeted images; (ii) ‘++’ means adopting the proposed image generator and edge generator to simultaneously produce both edge maps and images; (iii) ‘+++’ connects the image generator and the edge generator by using the proposed attention guided edge transfer module ; (iv) ‘++++’ is our full model and employs the proposed semantic preserving module to further improve the quality of the final results.

Effect of Edge Guided Generation Strategy. The results of the ablation study are shown in Table 4. When using the proposed edge generator to produce the corresponding edge map from the input label, performance on all evaluation metrics is improved. Specifically, 1.6, 0.3 and 4.7 point gains on the mIoU, Acc and FID metrics, respectively, which confirms the effectiveness of the proposed edge guided generation strategy.

Figure 7: Edge and attention maps generated by the proposed EdgeGAN on Cityscapes (top) and ADE20K (bottom).
Variants of EdgeGAN mIoU Acc FID
+ 58.6 (+0.0) 81.4 (+0.0) 65.7 (-0.0)
++ 60.2 (+1.6) 81.7 (+0.3) 61.0 (-4.7)
+++ 61.5 (+1.3) 82.0 (+0.3) 59.0 (-2.0)
++++ 64.5 (+3.0) 82.5 (+0.5) 57.1 (-1.9)
Total (+5.9) (+1.1) (-8.6)
Table 4: Quantitative comparison of different variants of the proposed EdgeGAN on Cityscapes. For mIoU and Acc, higher is better. For FID, lower is better.

Effect of Attention Guided Edge Transfer Module. We observe that the implicitly learned edge structure information by the ‘++’ baseline is not enough for such a challenging task. Thus we further adopt the proposed attention guided edge transfer module to transfer useful edge structure information from the edge generation branch to the image generation branch. We observe that 1.3, 0.3 and 2.0 point gains are obtained on the mIoU, Acc and FID metrics, respectively. This means that the proposed transfer module indeed learns richer feature representations with more convincing structure cues and details, and then transfers them from the edge generator to the image generator , confirming our design motivation.

Effect of Semantic Preserving Module. By adding the proposed semantic preserving module , the overall performance is further boosted with 3.0, 0.5 and 1.9 point improvements on the mIoU, Acc and FID metrics, respectively. This means the proposed semantic preserving module indeed learns and highlights class-specific semantic feature maps, leading to better generation results.

Figure 8: Segmentation maps generated by GauGAN [45] and the proposed EdgeGAN on Cityscapes (top) and ADE20K (bottom). ‘EdgeGAN I’ and ‘EdgeGAN II’ stand for and , respectively.
Stages of EdgeGAN Cityscapes ADE20K
mIoU Acc FID mIoU Acc FID
EdgeGAN I 61.7 82.1 59.1 39.6 80.9 34.2
EdgeGAN II 64.5 82.5 57.1 42.0 82.0 32.4
Table 5: Performance before (‘EdgeGAN I’) and after (‘EdgeGAN II’) using the proposed semantic preserving module. For mIoU and Acc, higher is better. For FID, lower is better.

In Fig. 8, we show some samples of the generated semantic maps. We observe that the semantic maps produced by the results after the proposed semantic preserving module (i.e., ‘Label by EdgeGAN II’ in Fig. 8) are more accurate than those without using the proposed semantic preserving module (‘Label by EdgeGAN I’ in Fig. 8). Moreover, we provide quantitative results on both datasets in Table 5. We can see that the proposed semantic preserving module indeed learns better class-specific feature representation, leading better performance on both datasets. Lastly, we also observe that our generated semantic maps are much better than those generated by GauGAN [45]. Both quantitative and qualitative results confirm the effectiveness of the proposed semantic preserving module.

5 Conclusions

We propose a novel Edge guided GAN (EdgeGAN) for challenging semantic image synthesis tasks. EdgeGAN introduces three core components: edge guided image generation strategy, attention guided edge transfer module and semantic preserving module. The first component is employed to generate edge maps from input semantic labels. The second one is used to selectively transfer the useful structure information from the edge generation branch to the image generation branch. The third one is adopted to alleviate the problem of the spatial resolution losses caused by different operations in the deep networks. Extensive experiments show that SEGAN achieves significantly better results than existing methods. Furthermore, we believe that the proposed modules can be easily plugged into existing GAN architectures to address other generation tasks.

References

  • [1] B. AlBahar and J. Huang (2019) Guided image-to-image translation with bi-directional feature transformation. In ICCV, Cited by: §2.
  • [2] S. Azadi, M. Tschannen, E. Tzeng, S. Gelly, T. Darrell, and M. Lucic (2019) Semantic bottleneck scene generation. arXiv preprint arXiv:1911.11357. Cited by: §2.
  • [3] A. Bansal, Y. Sheikh, and D. Ramanan (2019) Shapes and context: in-the-wild image synthesis & manipulation. In CVPR, Cited by: §2.
  • [4] D. Bau, H. Strobelt, W. Peebles, J. Wulff, B. Zhou, J. Zhu, and A. Torralba (2019) Semantic photo manipulation with a generative image prior. ACM TOG 38 (4), pp. 1–11. Cited by: §1.
  • [5] D. Bau, J. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman, and A. Torralba (2019) Gan dissection: visualizing and understanding generative adversarial networks. In ICLR, Cited by: §1.
  • [6] A. Brock, J. Donahue, and K. Simonyan (2019) Large scale gan training for high fidelity natural image synthesis. In ICLR, Cited by: §2.
  • [7] J. Canny (1986) A computational approach to edge detection. IEEE TPAMI (6), pp. 679–698. Cited by: §3.
  • [8] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros (2019) Everybody dance now. In ICCV, Cited by: §2.
  • [9] Q. Chen and V. Koltun (2017) Photographic image synthesis with cascaded refinement networks. In ICCV, Cited by: §1, §1, §2, §3.1, §3.2, §4.1, §4.1, Table 1, Table 2.
  • [10] X. Chen, C. Xu, X. Yang, and D. Tao (2018) Attention-gan for object transfiguration in wild images. In ECCV, Cited by: §2.
  • [11] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, Cited by: §2.
  • [12] Y. Choi, Y. Uh, J. Yoo, and J. Ha (2020) StarGAN v2: diverse image synthesis for multiple domains. In CVPR, Cited by: §2.
  • [13] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In CVPR, Cited by: 3rd item, §1, §4.2, §4, §6, §8.
  • [14] P. Esser, E. Sutter, and B. Ommer (2018) A variational u-net for conditional appearance and shape generation. In CVPR, Cited by: §2.
  • [15] A. Ghosh, R. Zhang, P. K. Dokania, O. Wang, A. A. Efros, P. H. Torr, and E. Shechtman (2019) Interactive sketch & fill: multiclass sketch-to-image translation. In ICCV, Cited by: §2.
  • [16] L. Goetschalckx, A. Andonian, A. Oliva, and P. Isola (2019) GANalyze: toward visual definitions of cognitive image properties. In ICCV, Cited by: §2.
  • [17] R. Gong, W. Li, Y. Chen, and L. V. Gool (2019) DLOW: domain flow for adaptation and generalization. In CVPR, Cited by: §2.
  • [18] X. Gong, S. Chang, Y. Jiang, and Z. Wang (2019) Autogan: neural architecture search for generative adversarial networks. In ICCV, Cited by: §2.
  • [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §2.
  • [20] S. Gu, J. Bao, H. Yang, D. Chen, F. Wen, and L. Yuan (2019) Mask-guided portrait editing with conditional gans. In CVPR, Cited by: §1, §2.
  • [21] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville (2017) Improved training of wasserstein gans. In NeurIPS, Cited by: §2.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1, §4.
  • [23] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, Cited by: §4.
  • [24] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §3.2.
  • [25] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, Cited by: §3.2.
  • [26] M. Huh, S. Sun, and N. Zhang (2019) Feedback adversarial learning: spatial feedback for improving generative adversarial networks. In CVPR, Cited by: §2.
  • [27] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    .
    In CVPR, Cited by: §1, §2, §2, §2.
  • [28] A. Jahanian, L. Chai, and P. Isola (2020) On the”steerability” of generative adversarial networks. In ICLR, Cited by: §2.
  • [29] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. In ICLR, Cited by: §2.
  • [30] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In CVPR, Cited by: §2.
  • [31] J. Kim, M. Kim, H. Kang, and K. Lee (2020) U-gat-it: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. In ICLR, Cited by: §2.
  • [32] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.
  • [33] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, Cited by: §3.1.
  • [34] B. Li, X. Qi, T. Lukasiewicz, and P. Torr (2019) Controllable text-to-image generation. In NeurIPS, Cited by: §2.
  • [35] J. Li, F. He, L. Zhang, B. Du, and D. Tao (2019) Progressive reconstruction of visual structure for image inpainting. In ICCV, Cited by: §2.
  • [36] C. H. Lin, C. Chang, Y. Chen, D. Juan, W. Wei, and H. Chen (2019) COCO-gan: generation by parts via conditional coordinating. In ICCV, Cited by: §2.
  • [37] M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz (2019) Few-shot unsupervised image-to-image translation. In ICCV, Cited by: §2.
  • [38] X. Liu, G. Yin, J. Shao, X. Wang, et al. (2019) Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In NeurIPS, Cited by: §1, §1, §2, §3.1, Figure 6, §4.1, §4.1, §4.1, Table 1, Table 2, Table 3, §6, Figure 10, Figure 11, Figure 12, Figure 9, Edge Guided GANs with Semantic Preserving for Semantic Image Synthesis.
  • [39] Y. A. Mejjati, C. Richardt, J. Tompkin, D. Cosker, and K. I. Kim (2018) Unsupervised attention-guided image-to-image translation. In NeurIPS, Cited by: §2.
  • [40] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.
  • [41] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In ICLR, Cited by: §4.
  • [42] K. Nazeri, E. Ng, T. Joseph, F. Qureshi, and M. Ebrahimi (2019) EdgeConnect: structure guided image inpainting using edge prediction. In ICCV Workshops, Cited by: §2.
  • [43] K. Nazeri, H. Thasarathan, and M. Ebrahimi (2019) Edge-informed single image super-resolution. In ICCV Workshops, Cited by: §2.
  • [44] J. Pan, C. Wang, X. Jia, J. Shao, L. Sheng, J. Yan, and X. Wang (2019) Video generation from single semantic label map. In CVPR, Cited by: §2.
  • [45] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In CVPR, Cited by: §1, §1, §2, §2, §2, §3.1, §3.2, Figure 5, Figure 8, §4.1, §4.1, §4.1, §4.1, §4.2, Table 1, Table 2, Table 3, §4, §4, Figure 20, Figure 21, Figure 22, Figure 23, Figure 24, Figure 25, Figure 26, Figure 27, Figure 28, Figure 29, §8, §8.
  • [46] X. Qi, Q. Chen, J. Jia, and V. Koltun (2018) Semi-parametric image synthesis. In CVPR, Cited by: §1, §1, §2, §3.1, §3.2, §4.1, §4.1, Table 1, Table 2.
  • [47] Y. Ren, X. Yu, R. Zhang, T. H. Li, S. Liu, and G. Li (2019) StructureFlow: image inpainting via structure-aware appearance flow. In ICCV, Cited by: §2.
  • [48] T. R. Shaham, T. Dekel, and T. Michaeli (2019) SinGAN: learning a generative model from a single natural image. In ICCV, Cited by: §2.
  • [49] F. Shama, R. Mechrez, A. Shoshan, and L. Zelnik-Manor (2019) Adversarial feedback loop. In ICCV, Cited by: §2.
  • [50] A. Shocher, S. Bagon, P. Isola, and M. Irani (2019) Ingan: capturing and remapping the “dna” of a natural image. In ICCV, Cited by: §2.
  • [51] A. Shocher, S. Bagon, P. Isola, and M. Irani (2019) InGAN: capturing and remapping the” dna” of a natural image. In ICCV, Cited by: §2.
  • [52] A. Siarohin, E. Sangineto, S. Lathuilière, and N. Sebe (2018) Deformable gans for pose-based human image generation. In CVPR, Cited by: §2.
  • [53] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §3.1, §3.3.
  • [54] H. Tang, W. Wang, D. Xu, Y. Yan, and N. Sebe (2018) Gesturegan for hand gesture-to-gesture translation in the wild. In ACM MM, Cited by: §2.
  • [55] H. Tang, D. Xu, N. Sebe, Y. Wang, J. J. Corso, and Y. Yan (2019) Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In CVPR, Cited by: §2, §2.
  • [56] M. Wang, G. Yang, R. Li, R. Liang, S. Zhang, P. M. Hall, and S. Hu (2019) Example-guided style-consistent image synthesis from semantic labeling. In CVPR, Cited by: §2.
  • [57] S. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros (2020) CNN-generated images are surprisingly easy to spot… for now. In CVPR, Cited by: §2.
  • [58] T. Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro (2019) Few-shot video-to-video synthesis. In NeurIPS, Cited by: §2.
  • [59] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, Cited by: §1, §1, §2, §2, §3.1, §3.2, §4.1, §4.1, Table 1, Table 2.
  • [60] P. Wu, Y. Lin, C. Chang, E. Y. Chang, and S. Liao (2019) Relgan: multi-domain image-to-image translation via relative attributes. In ICCV, Cited by: §2.
  • [61] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun (2018) Unified perceptual parsing for scene understanding. In ECCV, Cited by: §4.1, §8.
  • [62] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In CVPR, Cited by: §2.
  • [63] F. Yu, V. Koltun, and T. Funkhouser (2017) Dilated residual networks. In CVPR, Cited by: §4.1, §8.
  • [64] X. Yu, Y. Chen, S. Liu, T. Li, and G. Li (2019) Multi-mapping image-to-image translation via learning disentanglement. In NeurIPS, Cited by: §2.
  • [65] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In ICML, Cited by: §2.
  • [66] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal (2018) Context encoding for semantic segmentation. In CVPR, Cited by: §3.2.
  • [67] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In CVPR, Cited by: 3rd item, §1, §4, §6, §8.
  • [68] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §2.
  • [69] Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and X. Bai (2019) Progressive pose attention transfer for person image generation. In CVPR, Cited by: §2.

6 Comparisons with State-of-the-Art

We provide more generation results of the proposed EdgeGAN and CC-FPSE [38] on both the Cityscapes [13] and ADE20K [67] datasets. Note that we generated the results of CC-FPSE [38] using the well-trained models provided by the authors111https://github.com/xh-liu/CC-FPSE for fair comparisons.

Results are shown in Fig. 9, 10, 11 and 12. We observe that the proposed EdgeGAN consistently achieves photo-realistic results with fewer visual artifacts than CC-FPSE on both challenging datasets.

7 Visualization of Edge and Attention Maps

We visualize the generated edge and attention maps on both datasets in Fig. 13, 14, 15, 16, 17, 18 and 19. We observe that the proposed EdgeGAN can generate good edge maps according to the input semantic labels, thus the generated edge maps can be used to provide more local structure guidance for generating more photo-realistic images. These visualization results further prove the effectiveness of the proposed edge guided image generation strategy.

8 Visualization of Segmentation Maps

We follow GauGAN [45] and use the state-of-the-art segmentation networks on the generated images to produce the corresponding segmentation maps: DRN-D-105 [63] for Cityscapes [13] and UperNet101 [61] for ADE20K [67].

In Fig. 20, 21, 22, 23, 24, 25, 26, 27, 28 and 29, we show samples of the generated segmentation maps on both datasets. ‘EdgeGAN I’ and ‘EdgeGAN II’ in these figures stand for and , respectively.

We observe that the segmentation maps produced by the results after the proposed semantic preserving module (i.e., ‘Label by EdgeGAN II’ in these figures) are more accurate than those without using the proposed semantic preserving module (‘Label by EdgeGAN I’ in these figures), which further validates the effectiveness of the proposed semantic preserving module.

Moreover, we consistently observe in these figures that the proposed EdgeGAN generates significantly better segmentation maps than GauGAN [45], especially on local texture and small-scale objects.

Figure 9: Visual results generated by CC-FPSE [38] and EdgeGAN on Cityscapes.
Figure 10: Visual results generated by CC-FPSE [38] and EdgeGAN on Cityscapes.
Figure 11: Visual results generated by CC-FPSE [38] and EdgeGAN on ADE20K.
Figure 12: Visual results generated by CC-FPSE [38] and EdgeGAN on ADE20K.
Figure 13: Edge and attention maps generated by EdgeGAN on Cityscapes.
Figure 14: Edge and attention maps generated by EdgeGAN on Cityscapes.
Figure 15: Edge and attention maps generated by EdgeGAN on ADE20K.
Figure 16: Edge and attention maps generated by EdgeGAN on ADE20K.
Figure 17: Edge and attention maps generated by EdgeGAN on ADE20K.
Figure 18: Edge and attention maps generated by EdgeGAN on ADE20K.
Figure 19: Edge and attention maps generated by EdgeGAN on ADE20K.
Figure 20: Semantic maps generated by GauGAN [45] and EdgeGAN on Cityscapes.
Figure 21: Semantic maps generated by GauGAN [45] and EdgeGAN on Cityscapes.
Figure 22: Semantic maps generated by GauGAN [45] and EdgeGAN on Cityscapes.
Figure 23: Semantic maps generated by GauGAN [45] and EdgeGAN on Cityscapes.
Figure 24: Semantic maps generated by GauGAN [45] and EdgeGAN on Cityscapes.
Figure 25: Semantic maps generated by GauGAN [45] and EdgeGAN on ADE20K.
Figure 26: Semantic maps generated by GauGAN [45] and EdgeGAN on ADE20K.
Figure 27: Semantic maps generated by GauGAN [45] and EdgeGAN on ADE20K.
Figure 28: Semantic maps generated by GauGAN [45] and EdgeGAN on ADE20K.
Figure 29: Semantic maps generated by GauGAN [45] and EdgeGAN on ADE20K.