Edge Guided GANs with Semantic Preserving for Semantic Image Synthesis
We propose a novel Edge guided Generative Adversarial Network (EdgeGAN) for photo-realistic image synthesis from semantic layouts. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to two largely unresolved challenges. First, the semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. Second, the widely adopted CNN operations such as convolution, down-sampling and normalization usually cause spatial resolution loss and thus are unable to fully preserve the original semantic information, leading to semantically inconsistent results (e.g., missing small objects). To tackle the first challenge, we propose to use the edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. Edge information is produced by a convolutional generator and introduces detailed structure information. Further, to preserve the semantic information, we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout. Extensive experiments on two challenging datasets show that the proposed EdgeGAN can generate significantly better results than state-of-the-art methods. The source code and trained models are available at https://github.com/Ha0Tang/EdgeGAN.READ FULL TEXT VIEW PDF
In this paper, we focus on the semantic image synthesis task that aims a...
In this paper, we address the task of semantic-guided scene generation. ...
In this paper, we address the task of layout-to-image translation, which...
Cross-view image translation is challenging because it involves images w...
Semantic image synthesis, translating semantic layouts to photo-realisti...
Despite recent advances in deep learning-based face frontalization metho...
Spatially-adaptive normalization is remarkably successful recently in
Edge Guided GANs with Semantic Preserving for Semantic Image Synthesis
Semantic image synthesis refers to the task of generating photo-realistic images conditioned on pixel-level semantic labels. This task has a wide range of applications such as image editing and content generation [9, 27, 45, 20, 4, 5]. Although existing approaches such as [9, 27, 45, 20, 38, 46] conducted interesting explorations, we still observe unsatisfactory aspects mainly in the generated local structures and small-scale objects, which we believe are mainly due to two reasons. First, conventional methods [45, 59, 38] generally take the semantic label map as input directly. However, the input label map provides only structural information between different semantic-class regions and does not contain any structural information within each semantic-class region, making it difficult for synthesizing rich local structures within each class. Take Fig. 1 (Label map: ) as an example, the generator does not have enough structural guidance to produce a realistic bed, window and curtain from only the input label (). Second, the classic deep network architectures are constructed by stacking convolutional, down-sampling, normalization, non-linearity and up-sampling layers, which will cause the problem of spatial resolution losses of the input semantic labels.
To address both issues, in this paper, we propose a novel Edge guided Generative Adversarial Network (EdgeGAN) for semantic image synthesis tasks. The overall framework of the proposed EdgeGAN is shown in Fig. 1. We first propose an edge generator to produce the edge features and edge maps, and then the generated edge features and edge maps are selectively transferred to the image generator for improving the quality of the image results by using the proposed attention guided edge transfer module. Moreover, to tackle the issue of the spatial resolution losses caused by the common operations in the deep networks, we propose an effective semantic preserving module, which aims at selectively highlighting class-dependent feature maps according to the original semantic layout. Finally, we develop a multi-modality discriminator to simultaneously distinguish the output from two modal spaces, i.e., the edge and the image space. All the proposed modules are jointly optimized in an end-to-end fashion so that each module can benefit from each other in the training.
We conduct extensive experiments on two challenging datasets, i.e., Cityscapes  and ADE20K . Both qualitative and quantitative results show that the proposed EdgeGAN is able to produce remarkably better results than existing baseline models such as CRN , Pix2pixHD , SIMS , GauGAN  and CC-FPSE , regarding both the visual fidelity and the alignment with the input semantic labels.
To summarize, the contributions of this paper are as follows:
We propose a novel Edge Guided GAN (EdgeGAN) for challenging semantic image synthesis tasks. To the best of our knowledge, we are the first to explore the edge generation from semantic layouts and then utilize the generated edges to guide the generation of realistic images.
We propose an effective attention guided edge transfer module to selectively transfer useful edge structure information from the edge generation branch to the image generation branch. We also design a new semantic preserving module to highlight class-dependent feature maps based on the input semantic label map for generating semantically consistent results, which is not investigated by any existing GAN-based generation works.
Generative Adversarial Networks (GANs)  have two important components, i.e., a generator and a discriminator. Both are trained in an adversarial way to achieve a balance. Recently, GANs have shown the capability of generating realistic images [65, 6, 30, 49, 26, 21, 36, 16, 50, 37, 48, 29, 12, 18, 28, 51, 57]. Moreover, to generate user-specific images, Conditional GANs (CGANs)  have been proposed. CGANs usually combine a vanilla GAN and some external information such as class labels [11, 60], human poses [14, 69, 52, 8], conditional images [27, 56], text descriptions [34, 64] and segmentation maps [59, 45, 58, 55, 20, 2, 44].
image translation tasks. For instance, Isola et al. propose Pix2pix, which employs a CGAN to learn a translation mapping from input to output image domains, such as map-to-photo and day-to-night. To further improve the quality of the generated images, the attention mechanism has been recently investigated in image-to-image translation tasks, such as [55, 31, 39, 10, 62].
Different from previous attention-related image generation works, we propose a novel attention guided edge transfer module to transfer useful edge structure information from the edge generation branch to the image generation branch at two different levels, i.e., feature level and content level. To the best of our knowledge, our module is the first attempt to incorporate both edge feature attention and edge content attention within a GAN framework for image-to-image translation tasks.
Edge Guided Image Generation.
Edge maps are usually adopted in image inpainting[47, 42, 35]
and image super-resolution tasks to reconstruct the missing structure information of the inputs. For example, Nazeri et al.  propose an edge generator to hallucinate edges in the missing regions given edges, which can be regarded as an edge completion problem. Using edge images as the structural guidance,  achieves good results even for some highly structured scenes. Moreover, Ghosh et al.  propose an interactive GAN-based sketch-to-image translation model that can help users easily create images of some simple objects. Pix2pix  adopts edge maps as input and aims to generate photo-realistic shoe and handbag images, which can be seen as an edge-to-image translation problem.
Different from previous works, we propose a novel edge generator to perform a new task, i.e., semantic label-to-edge translation. To the best of our knowledge, we are the first time to generate realistic edge maps from semantic labels. Then the generated edge maps, which with more local structure information, can be used to improve the quality of the image results.
Semantic Image Synthesis aims to generate a photo-realistic image from a semantic label map [59, 9, 46, 45, 38, 3]. With the semantic information as guidance, existing methods have achieved promising performance. However, we can still observe unsatisfying aspects, especially on the generation of the small-scale objects, which we believe is mainly due to the problem of spatial resolution losses associated with deep network operations such as convolution, normalization and down-sampling, etc.
To solve this problem, Park et al. propose GauGAN , which uses the input semantic labels to modulate the activations in normalization layers through a spatially-adaptive transformation. However, the spatial resolution losses caused by other operations such as convolution and down-sampling have not been resolved. Moreover, we observe that the input label map has only a few semantic classes in the entire dataset. Thus the generator should focus more on learning these existing semantic classes rather than all the semantic classes.
To tackle both limitations, we propose a novel semantic preserving module, which aims to selectively highlight class-dependent feature maps according to the input label for generating semantically consistent image. This idea is not investigated by existing GAN-based generation works.
In this section, we describe the proposed Edge Guided GAN (EdgeGAN) for semantic image synthesis. We first introduce an overview of the proposed EdgeGAN, and then introduce the details of each module. Finally, we present the optimization objective.
Framework Overview. Fig. 1 shows the overall structure of the proposed EdgeGAN for semantic image synthesis, which consists of a semantic and edge guided generator and a multi-modality discriminator . The generator consists of five components: 1) a parameter-sharing convolutional encoder
is proposed to produce deep feature maps; 2) an edge generator is adopted to generate edge maps taking as input deep features from the encoder; 3) an image generator is used to produce intermediate images ; 4) an attention guided edge transfer module is designed to forward useful structure information from the edge generator to the image generator; and 5) the semantic preserving module is developed to selectively highlight class-dependent feature maps according to the input label for generating semantically consistent images . Meanwhile, to effectively train the network, we propose a multi-modality discriminator that can simultaneously distinguish the outputs from two modalities, i.e., edge and image space.
EdgeGAN takes a semantic layout as input and outputs a semantically correspondent photo-realistic image. During training, the ground truth edge map is extracted from corresponding ground truth images with Canny Edge Detector .
Parameter-Sharing Encoder. The backbone encoder could employ any network structures, such as the commonly used AlexNet , VGG  and ResNet . We directly utilize the feature maps from the last convolutional layer as deep feature representations , where represents the encoder, is the input label, and are width and height of the input semantic labels, and is the total number of semantic classes. Optionally, one can always combine multiple intermediate feature maps to enhance the feature representation.
The encoder is shared by the edge generator and the image generator. Then, the gradients from the two generators all contribute to updating the parameters of the encoder. This compact design can potentially enhance the deep representations since the encoder can simultaneously learn structure (from the edge generation branch) and appearance (from the image generation branch) representations.
Edge Guided Image Generation. As discussed, the lack of detailed structure or geometry guidance makes it extremely difficult for the generator to produce realistic local structures and details. To overcome this limitation, we propose to adopt the edge as guidance. A novel edge generator is designed to directly generate the edge maps from the input semantic labels. This also facilitates the shared encoder to learn more local structures of the targeted images. Meanwhile, the image generator aims to generate photo-realistic images from the input labels. In this way, the encoder is promoted to learn the appearance information of the targeted images.
Previous works [45, 38, 46, 9, 59] directly use deep networks to output the target image, which is challenging since the network need to simultaneously learn appearance and structure information from the input labels. In contrast, our EdgeGAN separately learns structure and appearance via the proposed edge generator and image generator. Moreover, the explicit guidance from ground truth edge maps can also benefit training the encoder.
The framework of the proposed edge and image generators are illustrated in Fig. 2. Given the feature maps from the last convolutional layer of the encoder, i.e., , where and are width and height of the features, and is the number of channels, the edge generator produces edge features and edge maps which are further utilized to guide the image generator to produce the intermediate image .
The edge generator contains convolution layers and correspondingly produces intermediate feature maps . After that, another convolution layer with Tanh(·) non-linear activation is utilized to generate the edge map . Meanwhile, the feature maps is also fed into the image generator to generate intermediate feature maps . Then another convolution operation with Tanh(·) non-linear activation is adopted to produce the intermediate image . In addition, the intermediate edge feature maps and the edge map are utilized to guide the generation of the image feature maps and the intermediate image via the Attention Guided Edge Transfer as detailed below.
Attention Guided Edge Transfer. We further propose a novel attention guided edge transfer module to explicitly employ the edge structure information to refine the intermediate image representations. The architecture of the proposed transfer module is illustrated in Fig. 2.
To transfer useful structure information from edge feature maps to the image feature maps , the edge feature maps are firstly processed by Sigmoid(·) activation function to generate the corresponding attention maps . Then, we multiply the generated attention maps with the corresponding image feature maps to obtain the refined maps which incorporate local structures and details as Eq. (1). Finally, the edge refined features are element-wisely summed with the original image features to produce the final edge refined features, which are further fed to the next convolution layer.
By this way, the image feature maps also contain the local structure information provided by the edge feature maps.
Similarly, to directly employ the structure information from the generated edge map for image generation, we adopt the attention guided edge transfer module to refine the generated image directly with edge information as Eq. (2).
Due to the spatial resolution loss caused by convolution, normalization and down-sampling layers, existing models [59, 45, 46, 9] could not be able to fully preserve the semantic information of the input labels as illustrated in Fig. 8, e.g., the small ‘pole’ is missing and the large ‘fence’ is incomplete. To fix this problem, we propose a novel semantic preserving module, which aims to select class-dependent feature maps and further enhance it guided by the original semantic layout. An overview of the proposed semantic preserving module is shown in Fig. 3.
Specifically, the input denoted as to the module is the concatenation of the input label , the generated intermediate edge map and image , and the deep feature produced from the shared encoder .
Then, we apply a convolution operation on to produce a new feature map with channel number equal to the number of semantic categories, where each channel corresponds to a specific semantic category. Next, we apply the averaging pooling operation on to obtain the global information of each class followed by a Sigmoid(·) activation function to derive scaling factors as Eq. (3), where each value represents the importance of each class.
Then, the scaling factor is adopted to reweight the feature map and highlight corresponding class-dependent feature maps as Eq. (4). The reweighted feature map is further added with the original feature to compensate information loss due to multiplication, and produces ,
After that, we perform another convolution operation on to obtain the feature map to enhance the representative capability of the feature. In addition, has the same size as the original input one , which makes the module flexible and can be plugged into other existing architectures without modification of other parts to refine the output.
Finally, the feature map is fed into a convolution layer followed by a Tanh(·) non-linear activation layer to obtain the final result . The proposed semantic preserving module enhances the representational power of the model by adaptively re-calibrating semantic class-dependent feature maps, and shares similar spirits with style transfer , and recent works SENet  and EncNet . One intuitive example of the utility of the module is for the generation of small object classes: the small object classes are easily missed in the generation results due to spatial resolution loss while our scaling factor can put an emphasis on small objects and help preserve them.
Multi-Modality Discriminator. To facilitate training the proposed EdgeGAN for high-quality edge and image generation, a novel multi-modality discriminator is developed to simultaneously distinguish outputs from two modality spaces, i.e., edge and image. The framework of the proposed multi-modality discriminator is shown in Fig. 1, which are capable of discriminating both real/fake images and edges. To discriminate real/fake edges, the discriminator loss considering the semantic label and the generated edge (or the real edge ) is as Eq. (5) which guide the model to distinguish real edges from fake generated edges.
Further, to discriminate real/fake images, the discriminator loss regarding semantic label and the generated images , (or the real image ) is as Eq. (6), which guide the model to discriminate real/fake images.
Therefore, the total loss of the proposed multi-modality discriminator can be written as .
Optimization Objective. Equipped with the multi-modality discriminator, we elaborate the training objective for the generator as below. Three different losses, i.e., the conditional adversarial loss , the discriminator feature matching loss and the perceptual loss , are used to optimize the proposed EdgeGAN,
where , and are three parameters of the corresponding loss that contributes to the total loss ; where matches the discriminator intermediate features between the generated images/edges and the real images/edges; where matches the VGG  extracted features between the generated images/edges and the real images/edges. By maximizing the discriminator loss, the generator is promoted to simultaneously generate reasonable edge maps that can capture the local-aware structure information and generate photo-realistic images semantically aligned with the input semantic labels.
Implementation Details. We adopt ResNet  as the structure of our encoder . Moreover, we employ the SPADE residual block  in our generator, which has been shown very effective in semantic image synthesis tasks. For both the image generator and edge generator
, the kernel size and padding size of convolutions are alland 1 for preserving the feature map size. We set for generators , and . The channel size of feature is set to . For the semantic preserving module , we adopt adaptive averaging pooling operation. Spectral normalization  is applied to all the layers in both the generator and discriminator.
We follow the training procedures of GANs and alternatively train the generator and discriminator , i.e., one gradient descent step on discriminator and generator alternately. We first train with fixed, and then train with fixed. We use the Adam solver  and set , . , and in Eq. (7) is set to 1, 10 and 10, respectively. All in both Eq. (6) and (7) are set to 2. We conduct the experiments on an NVIDIA DGX1 with 8 V100 GPUs.
Datasets and Evaluation Metrics. We follow GauGAN  and conduct experiments on two challenging datasets, i.e., Cityscapes  and ADE20K . The size of the training and validation set on Cityscapes are 3,000 and 500, respectively. For ADE20K, which contains challenging scenes with 150 semantic classes, and consists of 20,210 training and 2,000 validation images. Moreover, we adopt the mean Intersection-over-Union (mIoU), Pixel Accuracy (Acc) and Fréchet Inception Distance (FID) 
as the evaluation metrics. All images on Cityscapes and ADE20K are re-scaled toand
, respectively. For both datasets, we preform 200 epochs of training with batch size 32, and the learning rate is linearly decayed to 0 from epoch 100 to 200.
|Ours vs. CRN ||70.28||81.35|
|Ours vs. Pix2pixHD ||60.85||85.18|
|Ours vs. SIMS ||57.67||N/A|
|Ours vs. GauGAN ||56.54||60.49|
|Ours vs. CC-FPSE ||55.81||57.75|
Qualitative Comparisons. Qualitative results of Cityscapes and ADE20K compared with existing methods, i.e., Pix2pixHD , CRN  and SIMS , are shown in Fig. 4. We can see that the proposed EdgeGAN achieves significantly better results with fewer visual artifacts than other baselines.
To further validate the effectiveness of the proposed EdgeGAN, we compare it with two stronger baselines, i.e., GauGAN  and CC-FPSE . Note that we download their well-trained models and generate the results for fair comparisons. Results compared with GauGAN are shown in Fig. 5. We observe that the proposed EdgeGAN generates sharper images than GauGAN, especially at local structures and details. Besides, as shown in Fig. 6, the proposed EdgeGAN achieves significantly better results than CC-FPSE, while CC-FPSE always generates results with lots of visual artifacts on both datasets.
User Study. We follow the evaluation protocol of GauGAN  and conduct a user study. Specifically, we give the participants an input semantic label and two translated images from different models and ask them to choose the generated image that looks more like a corresponding image of the semantic label. The participants are given unlimited time to make the decision.
Results compared with Pix2pixHD , CRN , SIMS , GauGAN  and CC-FPSE  are shown in Table 1. We observe that users favor our synthesized results on both datasets compared with other competing methods including GauGAN and CC-FPSE, further validating that the generated images by the proposed EdgeGAN are more natural and photo-realistic.
Quantitative Comparisons. Quantitative results of the mIoU, Acc and FID metrics are shown in Table 2. It is clear that the proposed EdgeGAN outperforms most existing leading methods by a large margin except CC-FPSE . However, CC-FPSE generates significantly worse visual results than ours as shown in Fig. 6. Moreover, we provide the number of model parameters in Table 3. We see that the proposed EdgeGAN has much fewer model parameters than CC-FPSE on both datasets, which means the proposed EdgeGAN can be trained with less training time and GPU memory.
|CC-FPSE ||138.6M||5.2M||143.8M (+45.2M)||151.2M||5.2M||156.4M (+54.1M)|
|EdgeGAN (Ours)||93.2M||5.6M||98.8M (+0.2M)||97.2M||5.8M||103.0M (+0.7M)|
Visualization of Edge and Attention Maps. We also visualize the generated edge and attention maps on both datasets in Fig. 7. We observe that the proposed EdgeGAN can generate reasonable edge maps according to the input labels, thus the generated edge maps can be used to provide more local structure information for generating more photo-realistic images.
Visualization of Segmentation Maps. We follow GauGAN  and apply pre-trained segmentation networks [63, 61] on the generated images to produce segmentation maps. Results compared with GauGAN  are shown in Fig. 8. We consistently observe that the proposed EdgeGAN generates better semantic labels than GauGAN on both datasets.
Variants of EdgeGAN. We conduct extensive ablation studies on Cityscapes  to evaluate different components of the proposed EdgeGAN. The proposed EdgeGAN has four baselines as shown in Table 4: (i) ‘+’ means only using the encoder and the proposed image generator to synthesize the targeted images; (ii) ‘++’ means adopting the proposed image generator and edge generator to simultaneously produce both edge maps and images; (iii) ‘+++’ connects the image generator and the edge generator by using the proposed attention guided edge transfer module ; (iv) ‘++++’ is our full model and employs the proposed semantic preserving module to further improve the quality of the final results.
Effect of Edge Guided Generation Strategy. The results of the ablation study are shown in Table 4. When using the proposed edge generator to produce the corresponding edge map from the input label, performance on all evaluation metrics is improved. Specifically, 1.6, 0.3 and 4.7 point gains on the mIoU, Acc and FID metrics, respectively, which confirms the effectiveness of the proposed edge guided generation strategy.
|Variants of EdgeGAN||mIoU||Acc||FID|
|+||58.6 (+0.0)||81.4 (+0.0)||65.7 (-0.0)|
|++||60.2 (+1.6)||81.7 (+0.3)||61.0 (-4.7)|
|+++||61.5 (+1.3)||82.0 (+0.3)||59.0 (-2.0)|
|++++||64.5 (+3.0)||82.5 (+0.5)||57.1 (-1.9)|
Effect of Attention Guided Edge Transfer Module. We observe that the implicitly learned edge structure information by the ‘++’ baseline is not enough for such a challenging task. Thus we further adopt the proposed attention guided edge transfer module to transfer useful edge structure information from the edge generation branch to the image generation branch. We observe that 1.3, 0.3 and 2.0 point gains are obtained on the mIoU, Acc and FID metrics, respectively. This means that the proposed transfer module indeed learns richer feature representations with more convincing structure cues and details, and then transfers them from the edge generator to the image generator , confirming our design motivation.
Effect of Semantic Preserving Module. By adding the proposed semantic preserving module , the overall performance is further boosted with 3.0, 0.5 and 1.9 point improvements on the mIoU, Acc and FID metrics, respectively. This means the proposed semantic preserving module indeed learns and highlights class-specific semantic feature maps, leading to better generation results.
|Stages of EdgeGAN||Cityscapes||ADE20K|
In Fig. 8, we show some samples of the generated semantic maps. We observe that the semantic maps produced by the results after the proposed semantic preserving module (i.e., ‘Label by EdgeGAN II’ in Fig. 8) are more accurate than those without using the proposed semantic preserving module (‘Label by EdgeGAN I’ in Fig. 8). Moreover, we provide quantitative results on both datasets in Table 5. We can see that the proposed semantic preserving module indeed learns better class-specific feature representation, leading better performance on both datasets. Lastly, we also observe that our generated semantic maps are much better than those generated by GauGAN . Both quantitative and qualitative results confirm the effectiveness of the proposed semantic preserving module.
We propose a novel Edge guided GAN (EdgeGAN) for challenging semantic image synthesis tasks. EdgeGAN introduces three core components: edge guided image generation strategy, attention guided edge transfer module and semantic preserving module. The first component is employed to generate edge maps from input semantic labels. The second one is used to selectively transfer the useful structure information from the edge generation branch to the image generation branch. The third one is adopted to alleviate the problem of the spatial resolution losses caused by different operations in the deep networks. Extensive experiments show that SEGAN achieves significantly better results than existing methods. Furthermore, we believe that the proposed modules can be easily plugged into existing GAN architectures to address other generation tasks.
The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: 3rd item, §1, §4.2, §4, §6, §8.
Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §1, §2, §2, §2.
We provide more generation results of the proposed EdgeGAN and CC-FPSE  on both the Cityscapes  and ADE20K  datasets. Note that we generated the results of CC-FPSE  using the well-trained models provided by the authors111https://github.com/xh-liu/CC-FPSE for fair comparisons.
We visualize the generated edge and attention maps on both datasets in Fig. 13, 14, 15, 16, 17, 18 and 19. We observe that the proposed EdgeGAN can generate good edge maps according to the input semantic labels, thus the generated edge maps can be used to provide more local structure guidance for generating more photo-realistic images. These visualization results further prove the effectiveness of the proposed edge guided image generation strategy.
We follow GauGAN  and use the state-of-the-art segmentation networks on the generated images to produce the corresponding segmentation maps: DRN-D-105  for Cityscapes  and UperNet101  for ADE20K .
We observe that the segmentation maps produced by the results after the proposed semantic preserving module (i.e., ‘Label by EdgeGAN II’ in these figures) are more accurate than those without using the proposed semantic preserving module (‘Label by EdgeGAN I’ in these figures), which further validates the effectiveness of the proposed semantic preserving module.
Moreover, we consistently observe in these figures that the proposed EdgeGAN generates significantly better segmentation maps than GauGAN , especially on local texture and small-scale objects.